All of lore.kernel.org
 help / color / mirror / Atom feed
* blk-mq: allow passing in an external queue mapping V2
@ 2016-08-29 10:53 ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-08-29 10:53 UTC (permalink / raw)
  To: axboe; +Cc: keith.busch, linux-block, linux-nvme

This series is one third of the earlier "automatic interrupt affinity for
MSI/MSI-X capable devices" series, and make uses of the new irq-level
interrupt / queue mapping code in blk-mq, as well as allowing the driver
to pass in such a mask obtained from the (PCI) interrupt code.  To fully
support this feature in drivers the final third in the PCI layer will
be needed as well.

Note that these patches are on top of Linux 4.8-rc4 and will need several
patches not yet in the block for-4.9 branch.

A git tree is available at:

   git://git.infradead.org/users/hch/block.git block-queue-mapping

Gitweb:

   http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/block-queue-mapping

Changes since V1:
 - rebased on top of Linux 4.8-rc4

Changes since automatic interrupt affinity for MSI/MSI-X capable devices V3:
 - a trivial cleanup in blk_mq_create_mq_map pointed out by Alexander


^ permalink raw reply	[flat|nested] 38+ messages in thread

* blk-mq: allow passing in an external queue mapping V2
@ 2016-08-29 10:53 ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-08-29 10:53 UTC (permalink / raw)


This series is one third of the earlier "automatic interrupt affinity for
MSI/MSI-X capable devices" series, and make uses of the new irq-level
interrupt / queue mapping code in blk-mq, as well as allowing the driver
to pass in such a mask obtained from the (PCI) interrupt code.  To fully
support this feature in drivers the final third in the PCI layer will
be needed as well.

Note that these patches are on top of Linux 4.8-rc4 and will need several
patches not yet in the block for-4.9 branch.

A git tree is available at:

   git://git.infradead.org/users/hch/block.git block-queue-mapping

Gitweb:

   http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/block-queue-mapping

Changes since V1:
 - rebased on top of Linux 4.8-rc4

Changes since automatic interrupt affinity for MSI/MSI-X capable devices V3:
 - a trivial cleanup in blk_mq_create_mq_map pointed out by Alexander

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/7] blk-mq: don't redistribute hardware queues on a CPU hotplug event
  2016-08-29 10:53 ` Christoph Hellwig
@ 2016-08-29 10:53   ` Christoph Hellwig
  -1 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-08-29 10:53 UTC (permalink / raw)
  To: axboe; +Cc: keith.busch, linux-block, linux-nvme

Currently blk-mq will totally remap hardware context when a CPU hotplug
even happened, which causes major havoc for drivers, as they are never
told about this remapping.  E.g. any carefully sorted out CPU affinity
will just be completely messed up.

The rebuild also doesn't really help for the common case of cpu
hotplug, which is soft onlining / offlining of cpus - in this case we
should just leave the queue and irq mapping as is.  If it actually
worked it would have helped in the case of physical cpu hotplug,
although for that we'd need a way to actually notify the driver.
Note that drivers may already be able to accommodate such a topology
change on their own, e.g. using the reset_controller sysfs file in NVMe
will cause the driver to get things right for this case.

With the rebuild removed we will simplify retain the queue mapping for
a soft offlined CPU that will work when it comes back online, and will
map any newly onlined CPU to queue 0 until the driver initiates
a rebuild of the queue map.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 13f5a6c..b29e7b2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2147,8 +2147,6 @@ static void blk_mq_queue_reinit(struct request_queue *q,
 
 	blk_mq_sysfs_unregister(q);
 
-	blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, online_mask);
-
 	/*
 	 * redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe
 	 * we should change hctx numa_node according to new topology (this
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 1/7] blk-mq: don't redistribute hardware queues on a CPU hotplug event
@ 2016-08-29 10:53   ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-08-29 10:53 UTC (permalink / raw)


Currently blk-mq will totally remap hardware context when a CPU hotplug
even happened, which causes major havoc for drivers, as they are never
told about this remapping.  E.g. any carefully sorted out CPU affinity
will just be completely messed up.

The rebuild also doesn't really help for the common case of cpu
hotplug, which is soft onlining / offlining of cpus - in this case we
should just leave the queue and irq mapping as is.  If it actually
worked it would have helped in the case of physical cpu hotplug,
although for that we'd need a way to actually notify the driver.
Note that drivers may already be able to accommodate such a topology
change on their own, e.g. using the reset_controller sysfs file in NVMe
will cause the driver to get things right for this case.

With the rebuild removed we will simplify retain the queue mapping for
a soft offlined CPU that will work when it comes back online, and will
map any newly onlined CPU to queue 0 until the driver initiates
a rebuild of the queue map.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-mq.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 13f5a6c..b29e7b2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2147,8 +2147,6 @@ static void blk_mq_queue_reinit(struct request_queue *q,
 
 	blk_mq_sysfs_unregister(q);
 
-	blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, online_mask);
-
 	/*
 	 * redo blk_mq_init_cpu_queues and blk_mq_init_hw_queues. FIXME: maybe
 	 * we should change hctx numa_node according to new topology (this
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/7] blk-mq: only allocate a single mq_map per tag_set
  2016-08-29 10:53 ` Christoph Hellwig
@ 2016-08-29 10:53   ` Christoph Hellwig
  -1 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-08-29 10:53 UTC (permalink / raw)
  To: axboe; +Cc: keith.busch, linux-block, linux-nvme

The mapping is identical for all queues in a tag_set, so stop wasting
memory for building multiple.  Note that for now I've kept the mq_map
pointer in the request_queue, but we'll need to investigate if we can
remove it without suffering too much from the additional pointer chasing.
The same would apply to the mq_ops pointer as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq.c         | 22 ++++++++++++++--------
 include/linux/blk-mq.h |  1 +
 2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b29e7b2..15c36c1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1963,7 +1963,6 @@ void blk_mq_release(struct request_queue *q)
 		kfree(hctx);
 	}
 
-	kfree(q->mq_map);
 	q->mq_map = NULL;
 
 	kfree(q->queue_hw_ctx);
@@ -2062,9 +2061,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->queue_hw_ctx)
 		goto err_percpu;
 
-	q->mq_map = blk_mq_make_queue_map(set);
-	if (!q->mq_map)
-		goto err_map;
+	q->mq_map = set->mq_map;
 
 	blk_mq_realloc_hw_ctxs(set, q);
 	if (!q->nr_hw_queues)
@@ -2114,8 +2111,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	return q;
 
 err_hctxs:
-	kfree(q->mq_map);
-err_map:
 	kfree(q->queue_hw_ctx);
 err_percpu:
 	free_percpu(q->queue_ctx);
@@ -2337,14 +2332,22 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (!set->tags)
 		return -ENOMEM;
 
+	set->mq_map = blk_mq_make_queue_map(set);
+	if (!set->mq_map)
+		goto out_free_tags;
+
 	if (blk_mq_alloc_rq_maps(set))
-		goto enomem;
+		goto out_free_mq_map;
 
 	mutex_init(&set->tag_list_lock);
 	INIT_LIST_HEAD(&set->tag_list);
 
 	return 0;
-enomem:
+
+out_free_mq_map:
+	kfree(set->mq_map);
+	set->mq_map = NULL;
+out_free_tags:
 	kfree(set->tags);
 	set->tags = NULL;
 	return -ENOMEM;
@@ -2360,6 +2363,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 			blk_mq_free_rq_map(set, set->tags[i], i);
 	}
 
+	kfree(set->mq_map);
+	set->mq_map = NULL;
+
 	kfree(set->tags);
 	set->tags = NULL;
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e43bbff..a572227 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -65,6 +65,7 @@ struct blk_mq_hw_ctx {
 };
 
 struct blk_mq_tag_set {
+	unsigned int		*mq_map;
 	struct blk_mq_ops	*ops;
 	unsigned int		nr_hw_queues;
 	unsigned int		queue_depth;	/* max hw supported */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/7] blk-mq: only allocate a single mq_map per tag_set
@ 2016-08-29 10:53   ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-08-29 10:53 UTC (permalink / raw)


The mapping is identical for all queues in a tag_set, so stop wasting
memory for building multiple.  Note that for now I've kept the mq_map
pointer in the request_queue, but we'll need to investigate if we can
remove it without suffering too much from the additional pointer chasing.
The same would apply to the mq_ops pointer as well.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-mq.c         | 22 ++++++++++++++--------
 include/linux/blk-mq.h |  1 +
 2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b29e7b2..15c36c1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1963,7 +1963,6 @@ void blk_mq_release(struct request_queue *q)
 		kfree(hctx);
 	}
 
-	kfree(q->mq_map);
 	q->mq_map = NULL;
 
 	kfree(q->queue_hw_ctx);
@@ -2062,9 +2061,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->queue_hw_ctx)
 		goto err_percpu;
 
-	q->mq_map = blk_mq_make_queue_map(set);
-	if (!q->mq_map)
-		goto err_map;
+	q->mq_map = set->mq_map;
 
 	blk_mq_realloc_hw_ctxs(set, q);
 	if (!q->nr_hw_queues)
@@ -2114,8 +2111,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	return q;
 
 err_hctxs:
-	kfree(q->mq_map);
-err_map:
 	kfree(q->queue_hw_ctx);
 err_percpu:
 	free_percpu(q->queue_ctx);
@@ -2337,14 +2332,22 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (!set->tags)
 		return -ENOMEM;
 
+	set->mq_map = blk_mq_make_queue_map(set);
+	if (!set->mq_map)
+		goto out_free_tags;
+
 	if (blk_mq_alloc_rq_maps(set))
-		goto enomem;
+		goto out_free_mq_map;
 
 	mutex_init(&set->tag_list_lock);
 	INIT_LIST_HEAD(&set->tag_list);
 
 	return 0;
-enomem:
+
+out_free_mq_map:
+	kfree(set->mq_map);
+	set->mq_map = NULL;
+out_free_tags:
 	kfree(set->tags);
 	set->tags = NULL;
 	return -ENOMEM;
@@ -2360,6 +2363,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 			blk_mq_free_rq_map(set, set->tags[i], i);
 	}
 
+	kfree(set->mq_map);
+	set->mq_map = NULL;
+
 	kfree(set->tags);
 	set->tags = NULL;
 }
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e43bbff..a572227 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -65,6 +65,7 @@ struct blk_mq_hw_ctx {
 };
 
 struct blk_mq_tag_set {
+	unsigned int		*mq_map;
 	struct blk_mq_ops	*ops;
 	unsigned int		nr_hw_queues;
 	unsigned int		queue_depth;	/* max hw supported */
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/7] blk-mq: remove ->map_queue
  2016-08-29 10:53 ` Christoph Hellwig
@ 2016-08-29 10:53   ` Christoph Hellwig
  -1 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-08-29 10:53 UTC (permalink / raw)
  To: axboe; +Cc: keith.busch, linux-block, linux-nvme

All drivers use the default, so provide an inline version of it.  If we
ever need other queue mapping we can add an optional method back,
although supporting will also require major changes to the queue setup
code.

This provides better code generation, and better debugability as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-flush.c                 |  6 +++---
 block/blk-mq-tag.c                |  5 ++---
 block/blk-mq.c                    | 40 +++++++++++----------------------------
 block/blk-mq.h                    |  6 ++++++
 block/blk.h                       | 11 +++--------
 drivers/block/loop.c              |  1 -
 drivers/block/mtip32xx/mtip32xx.c |  1 -
 drivers/block/null_blk.c          |  1 -
 drivers/block/rbd.c               |  1 -
 drivers/block/virtio_blk.c        |  1 -
 drivers/block/xen-blkfront.c      |  1 -
 drivers/md/dm-rq.c                |  1 -
 drivers/mtd/ubi/block.c           |  1 -
 drivers/nvme/host/pci.c           |  2 --
 drivers/nvme/host/rdma.c          |  2 --
 drivers/nvme/target/loop.c        |  2 --
 drivers/scsi/scsi_lib.c           |  1 -
 include/linux/blk-mq.h            |  7 -------
 18 files changed, 25 insertions(+), 65 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index d308def8..6a14b68 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -232,7 +232,7 @@ static void flush_end_io(struct request *flush_rq, int error)
 
 		/* release the tag's ownership to the req cloned from */
 		spin_lock_irqsave(&fq->mq_flush_lock, flags);
-		hctx = q->mq_ops->map_queue(q, flush_rq->mq_ctx->cpu);
+		hctx = blk_mq_map_queue(q, flush_rq->mq_ctx->cpu);
 		blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
 		flush_rq->tag = -1;
 	}
@@ -325,7 +325,7 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq)
 		flush_rq->tag = first_rq->tag;
 		fq->orig_rq = first_rq;
 
-		hctx = q->mq_ops->map_queue(q, first_rq->mq_ctx->cpu);
+		hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu);
 		blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
 	}
 
@@ -358,7 +358,7 @@ static void mq_flush_data_end_io(struct request *rq, int error)
 	unsigned long flags;
 	struct blk_flush_queue *fq = blk_get_flush_queue(q, ctx);
 
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
 
 	/*
 	 * After populating an empty queue, kick it to avoid stall.  Read
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 729bac3..1602813 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -301,8 +301,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
 		io_schedule();
 
 		data->ctx = blk_mq_get_ctx(data->q);
-		data->hctx = data->q->mq_ops->map_queue(data->q,
-				data->ctx->cpu);
+		data->hctx = blk_mq_map_queue(data->q, data->ctx->cpu);
 		if (data->flags & BLK_MQ_REQ_RESERVED) {
 			bt = &data->hctx->tags->breserved_tags;
 		} else {
@@ -726,7 +725,7 @@ u32 blk_mq_unique_tag(struct request *rq)
 	int hwq = 0;
 
 	if (q->mq_ops) {
-		hctx = q->mq_ops->map_queue(q, rq->mq_ctx->cpu);
+		hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
 		hwq = hctx->queue_num;
 	}
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 15c36c1..434df39 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -244,7 +244,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
 		return ERR_PTR(ret);
 
 	ctx = blk_mq_get_ctx(q);
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
 	blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
 
 	rq = __blk_mq_alloc_request(&alloc_data, rw, 0);
@@ -253,7 +253,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
 		blk_mq_put_ctx(ctx);
 
 		ctx = blk_mq_get_ctx(q);
-		hctx = q->mq_ops->map_queue(q, ctx->cpu);
+		hctx = blk_mq_map_queue(q, ctx->cpu);
 		blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
 		rq =  __blk_mq_alloc_request(&alloc_data, rw, 0);
 		ctx = alloc_data.ctx;
@@ -337,11 +337,7 @@ EXPORT_SYMBOL_GPL(blk_mq_free_hctx_request);
 
 void blk_mq_free_request(struct request *rq)
 {
-	struct blk_mq_hw_ctx *hctx;
-	struct request_queue *q = rq->q;
-
-	hctx = q->mq_ops->map_queue(q, rq->mq_ctx->cpu);
-	blk_mq_free_hctx_request(hctx, rq);
+	blk_mq_free_hctx_request(blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), rq);
 }
 EXPORT_SYMBOL_GPL(blk_mq_free_request);
 
@@ -1064,9 +1060,7 @@ void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
 {
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx;
-
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
 
 	spin_lock(&ctx->lock);
 	__blk_mq_insert_request(hctx, rq, at_head);
@@ -1083,12 +1077,10 @@ static void blk_mq_insert_requests(struct request_queue *q,
 				     bool from_schedule)
 
 {
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
 
 	trace_block_unplug(q, depth, !from_schedule);
 
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
-
 	/*
 	 * preemption doesn't flush plug list, so it's possible ctx->cpu is
 	 * offline now
@@ -1222,7 +1214,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 
 	blk_queue_enter_live(q);
 	ctx = blk_mq_get_ctx(q);
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
 
 	if (rw_is_sync(bio_op(bio), bio->bi_opf))
 		op_flags |= REQ_SYNC;
@@ -1236,7 +1228,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 		trace_block_sleeprq(q, bio, op);
 
 		ctx = blk_mq_get_ctx(q);
-		hctx = q->mq_ops->map_queue(q, ctx->cpu);
+		hctx = blk_mq_map_queue(q, ctx->cpu);
 		blk_mq_set_alloc_data(&alloc_data, q, 0, ctx, hctx);
 		rq = __blk_mq_alloc_request(&alloc_data, op, op_flags);
 		ctx = alloc_data.ctx;
@@ -1253,8 +1245,7 @@ static int blk_mq_direct_issue_request(struct request *rq, blk_qc_t *cookie)
 {
 	int ret;
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = q->mq_ops->map_queue(q,
-			rq->mq_ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
 	struct blk_mq_queue_data bd = {
 		.rq = rq,
 		.list = NULL,
@@ -1458,15 +1449,6 @@ run_queue:
 	return cookie;
 }
 
-/*
- * Default mapping to a software queue, since we use one per CPU.
- */
-struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q, const int cpu)
-{
-	return q->queue_hw_ctx[q->mq_map[cpu]];
-}
-EXPORT_SYMBOL(blk_mq_map_queue);
-
 static void blk_mq_free_rq_map(struct blk_mq_tag_set *set,
 		struct blk_mq_tags *tags, unsigned int hctx_idx)
 {
@@ -1800,7 +1782,7 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
 		if (!cpu_online(i))
 			continue;
 
-		hctx = q->mq_ops->map_queue(q, i);
+		hctx = blk_mq_map_queue(q, i);
 
 		/*
 		 * Set local node, IFF we have more than one hw queue. If
@@ -1838,7 +1820,7 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 			continue;
 
 		ctx = per_cpu_ptr(q->queue_ctx, i);
-		hctx = q->mq_ops->map_queue(q, i);
+		hctx = blk_mq_map_queue(q, i);
 
 		cpumask_set_cpu(i, hctx->cpumask);
 		ctx->index_hw = hctx->nr_ctx;
@@ -2303,7 +2285,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN)
 		return -EINVAL;
 
-	if (!set->ops->queue_rq || !set->ops->map_queue)
+	if (!set->ops->queue_rq)
 		return -EINVAL;
 
 	if (set->queue_depth > BLK_MQ_MAX_DEPTH) {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9087b11..ec774bf 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -52,6 +52,12 @@ extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
 				   const struct cpumask *online_mask);
 extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
 
+static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
+		int cpu)
+{
+	return q->queue_hw_ctx[q->mq_map[cpu]];
+}
+
 /*
  * sysfs helpers
  */
diff --git a/block/blk.h b/block/blk.h
index c37492f..74444c4 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -39,14 +39,9 @@ extern struct ida blk_queue_ida;
 static inline struct blk_flush_queue *blk_get_flush_queue(
 		struct request_queue *q, struct blk_mq_ctx *ctx)
 {
-	struct blk_mq_hw_ctx *hctx;
-
-	if (!q->mq_ops)
-		return q->fq;
-
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
-
-	return hctx->fq;
+	if (q->mq_ops)
+		return blk_mq_map_queue(q, ctx->cpu)->fq;
+	return q->fq;
 }
 
 static inline void __blk_get_queue(struct request_queue *q)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c9f2107..cbdb3b1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1703,7 +1703,6 @@ static int loop_init_request(void *data, struct request *rq,
 
 static struct blk_mq_ops loop_mq_ops = {
 	.queue_rq       = loop_queue_rq,
-	.map_queue      = blk_mq_map_queue,
 	.init_request	= loop_init_request,
 };
 
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 2aca98e..3cc92e9 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -3895,7 +3895,6 @@ exit_handler:
 
 static struct blk_mq_ops mtip_mq_ops = {
 	.queue_rq	= mtip_queue_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= mtip_init_cmd,
 	.exit_request	= mtip_free_cmd,
 	.complete	= mtip_softirq_done_fn,
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 75a7f88..7d3b7d6 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -393,7 +393,6 @@ static int null_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 
 static struct blk_mq_ops null_mq_ops = {
 	.queue_rq       = null_queue_rq,
-	.map_queue      = blk_mq_map_queue,
 	.init_hctx	= null_init_hctx,
 	.complete	= null_softirq_done_fn,
 };
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 6c6519f..c1f84df 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3621,7 +3621,6 @@ static int rbd_init_request(void *data, struct request *rq,
 
 static struct blk_mq_ops rbd_mq_ops = {
 	.queue_rq	= rbd_queue_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= rbd_init_request,
 };
 
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 93b1aaa..2dc5c96 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -542,7 +542,6 @@ static int virtblk_init_request(void *data, struct request *rq,
 
 static struct blk_mq_ops virtio_mq_ops = {
 	.queue_rq	= virtio_queue_rq,
-	.map_queue	= blk_mq_map_queue,
 	.complete	= virtblk_request_done,
 	.init_request	= virtblk_init_request,
 };
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 88ef6d4..9908597 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -909,7 +909,6 @@ out_busy:
 
 static struct blk_mq_ops blkfront_mq_ops = {
 	.queue_rq = blkif_queue_rq,
-	.map_queue = blk_mq_map_queue,
 };
 
 static void blkif_set_queue_limits(struct blkfront_info *info)
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 1ca7463..d1c3645 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -908,7 +908,6 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 static struct blk_mq_ops dm_mq_ops = {
 	.queue_rq = dm_mq_queue_rq,
-	.map_queue = blk_mq_map_queue,
 	.complete = dm_softirq_done,
 	.init_request = dm_mq_init_request,
 };
diff --git a/drivers/mtd/ubi/block.c b/drivers/mtd/ubi/block.c
index ebf46ad..d1e6931 100644
--- a/drivers/mtd/ubi/block.c
+++ b/drivers/mtd/ubi/block.c
@@ -351,7 +351,6 @@ static int ubiblock_init_request(void *data, struct request *req,
 static struct blk_mq_ops ubiblock_mq_ops = {
 	.queue_rq       = ubiblock_queue_rq,
 	.init_request	= ubiblock_init_request,
-	.map_queue      = blk_mq_map_queue,
 };
 
 static DEFINE_IDR(ubiblock_minor_idr);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8dcf5a9..086fd7e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1131,7 +1131,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 static struct blk_mq_ops nvme_mq_admin_ops = {
 	.queue_rq	= nvme_queue_rq,
 	.complete	= nvme_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_hctx	= nvme_admin_init_hctx,
 	.exit_hctx      = nvme_admin_exit_hctx,
 	.init_request	= nvme_admin_init_request,
@@ -1141,7 +1140,6 @@ static struct blk_mq_ops nvme_mq_admin_ops = {
 static struct blk_mq_ops nvme_mq_ops = {
 	.queue_rq	= nvme_queue_rq,
 	.complete	= nvme_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_hctx	= nvme_init_hctx,
 	.init_request	= nvme_init_request,
 	.timeout	= nvme_timeout,
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 8d2875b..c7d27ea 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1521,7 +1521,6 @@ static void nvme_rdma_complete_rq(struct request *rq)
 static struct blk_mq_ops nvme_rdma_mq_ops = {
 	.queue_rq	= nvme_rdma_queue_rq,
 	.complete	= nvme_rdma_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= nvme_rdma_init_request,
 	.exit_request	= nvme_rdma_exit_request,
 	.reinit_request	= nvme_rdma_reinit_request,
@@ -1533,7 +1532,6 @@ static struct blk_mq_ops nvme_rdma_mq_ops = {
 static struct blk_mq_ops nvme_rdma_admin_mq_ops = {
 	.queue_rq	= nvme_rdma_queue_rq,
 	.complete	= nvme_rdma_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= nvme_rdma_init_admin_request,
 	.exit_request	= nvme_rdma_exit_admin_request,
 	.reinit_request	= nvme_rdma_reinit_request,
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 7affd40..62f17c0 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -273,7 +273,6 @@ static int nvme_loop_init_admin_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 static struct blk_mq_ops nvme_loop_mq_ops = {
 	.queue_rq	= nvme_loop_queue_rq,
 	.complete	= nvme_loop_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= nvme_loop_init_request,
 	.init_hctx	= nvme_loop_init_hctx,
 	.timeout	= nvme_loop_timeout,
@@ -282,7 +281,6 @@ static struct blk_mq_ops nvme_loop_mq_ops = {
 static struct blk_mq_ops nvme_loop_admin_mq_ops = {
 	.queue_rq	= nvme_loop_queue_rq,
 	.complete	= nvme_loop_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= nvme_loop_init_admin_request,
 	.init_hctx	= nvme_loop_init_admin_hctx,
 	.timeout	= nvme_loop_timeout,
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c71344a..2cca9cf 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2077,7 +2077,6 @@ struct request_queue *scsi_alloc_queue(struct scsi_device *sdev)
 }
 
 static struct blk_mq_ops scsi_mq_ops = {
-	.map_queue	= blk_mq_map_queue,
 	.queue_rq	= scsi_queue_rq,
 	.complete	= scsi_softirq_done,
 	.timeout	= scsi_timeout,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index a572227..d4d8bc8 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -89,7 +89,6 @@ struct blk_mq_queue_data {
 };
 
 typedef int (queue_rq_fn)(struct blk_mq_hw_ctx *, const struct blk_mq_queue_data *);
-typedef struct blk_mq_hw_ctx *(map_queue_fn)(struct request_queue *, const int);
 typedef enum blk_eh_timer_return (timeout_fn)(struct request *, bool);
 typedef int (init_hctx_fn)(struct blk_mq_hw_ctx *, void *, unsigned int);
 typedef void (exit_hctx_fn)(struct blk_mq_hw_ctx *, unsigned int);
@@ -112,11 +111,6 @@ struct blk_mq_ops {
 	queue_rq_fn		*queue_rq;
 
 	/*
-	 * Map to specific hardware queue
-	 */
-	map_queue_fn		*map_queue;
-
-	/*
 	 * Called on request timeout
 	 */
 	timeout_fn		*timeout;
@@ -221,7 +215,6 @@ static inline u16 blk_mq_unique_tag_to_tag(u32 unique_tag)
 	return unique_tag & BLK_MQ_UNIQUE_TAG_MASK;
 }
 
-struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *, const int ctx_index);
 struct blk_mq_hw_ctx *blk_mq_alloc_single_hw_queue(struct blk_mq_tag_set *, unsigned int, int);
 
 int blk_mq_request_started(struct request *rq);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/7] blk-mq: remove ->map_queue
@ 2016-08-29 10:53   ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-08-29 10:53 UTC (permalink / raw)


All drivers use the default, so provide an inline version of it.  If we
ever need other queue mapping we can add an optional method back,
although supporting will also require major changes to the queue setup
code.

This provides better code generation, and better debugability as well.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-flush.c                 |  6 +++---
 block/blk-mq-tag.c                |  5 ++---
 block/blk-mq.c                    | 40 +++++++++++----------------------------
 block/blk-mq.h                    |  6 ++++++
 block/blk.h                       | 11 +++--------
 drivers/block/loop.c              |  1 -
 drivers/block/mtip32xx/mtip32xx.c |  1 -
 drivers/block/null_blk.c          |  1 -
 drivers/block/rbd.c               |  1 -
 drivers/block/virtio_blk.c        |  1 -
 drivers/block/xen-blkfront.c      |  1 -
 drivers/md/dm-rq.c                |  1 -
 drivers/mtd/ubi/block.c           |  1 -
 drivers/nvme/host/pci.c           |  2 --
 drivers/nvme/host/rdma.c          |  2 --
 drivers/nvme/target/loop.c        |  2 --
 drivers/scsi/scsi_lib.c           |  1 -
 include/linux/blk-mq.h            |  7 -------
 18 files changed, 25 insertions(+), 65 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index d308def8..6a14b68 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -232,7 +232,7 @@ static void flush_end_io(struct request *flush_rq, int error)
 
 		/* release the tag's ownership to the req cloned from */
 		spin_lock_irqsave(&fq->mq_flush_lock, flags);
-		hctx = q->mq_ops->map_queue(q, flush_rq->mq_ctx->cpu);
+		hctx = blk_mq_map_queue(q, flush_rq->mq_ctx->cpu);
 		blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
 		flush_rq->tag = -1;
 	}
@@ -325,7 +325,7 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq)
 		flush_rq->tag = first_rq->tag;
 		fq->orig_rq = first_rq;
 
-		hctx = q->mq_ops->map_queue(q, first_rq->mq_ctx->cpu);
+		hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu);
 		blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
 	}
 
@@ -358,7 +358,7 @@ static void mq_flush_data_end_io(struct request *rq, int error)
 	unsigned long flags;
 	struct blk_flush_queue *fq = blk_get_flush_queue(q, ctx);
 
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
 
 	/*
 	 * After populating an empty queue, kick it to avoid stall.  Read
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 729bac3..1602813 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -301,8 +301,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
 		io_schedule();
 
 		data->ctx = blk_mq_get_ctx(data->q);
-		data->hctx = data->q->mq_ops->map_queue(data->q,
-				data->ctx->cpu);
+		data->hctx = blk_mq_map_queue(data->q, data->ctx->cpu);
 		if (data->flags & BLK_MQ_REQ_RESERVED) {
 			bt = &data->hctx->tags->breserved_tags;
 		} else {
@@ -726,7 +725,7 @@ u32 blk_mq_unique_tag(struct request *rq)
 	int hwq = 0;
 
 	if (q->mq_ops) {
-		hctx = q->mq_ops->map_queue(q, rq->mq_ctx->cpu);
+		hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
 		hwq = hctx->queue_num;
 	}
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 15c36c1..434df39 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -244,7 +244,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
 		return ERR_PTR(ret);
 
 	ctx = blk_mq_get_ctx(q);
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
 	blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
 
 	rq = __blk_mq_alloc_request(&alloc_data, rw, 0);
@@ -253,7 +253,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
 		blk_mq_put_ctx(ctx);
 
 		ctx = blk_mq_get_ctx(q);
-		hctx = q->mq_ops->map_queue(q, ctx->cpu);
+		hctx = blk_mq_map_queue(q, ctx->cpu);
 		blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
 		rq =  __blk_mq_alloc_request(&alloc_data, rw, 0);
 		ctx = alloc_data.ctx;
@@ -337,11 +337,7 @@ EXPORT_SYMBOL_GPL(blk_mq_free_hctx_request);
 
 void blk_mq_free_request(struct request *rq)
 {
-	struct blk_mq_hw_ctx *hctx;
-	struct request_queue *q = rq->q;
-
-	hctx = q->mq_ops->map_queue(q, rq->mq_ctx->cpu);
-	blk_mq_free_hctx_request(hctx, rq);
+	blk_mq_free_hctx_request(blk_mq_map_queue(rq->q, rq->mq_ctx->cpu), rq);
 }
 EXPORT_SYMBOL_GPL(blk_mq_free_request);
 
@@ -1064,9 +1060,7 @@ void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
 {
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx;
-
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
 
 	spin_lock(&ctx->lock);
 	__blk_mq_insert_request(hctx, rq, at_head);
@@ -1083,12 +1077,10 @@ static void blk_mq_insert_requests(struct request_queue *q,
 				     bool from_schedule)
 
 {
-	struct blk_mq_hw_ctx *hctx;
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
 
 	trace_block_unplug(q, depth, !from_schedule);
 
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
-
 	/*
 	 * preemption doesn't flush plug list, so it's possible ctx->cpu is
 	 * offline now
@@ -1222,7 +1214,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 
 	blk_queue_enter_live(q);
 	ctx = blk_mq_get_ctx(q);
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
 
 	if (rw_is_sync(bio_op(bio), bio->bi_opf))
 		op_flags |= REQ_SYNC;
@@ -1236,7 +1228,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 		trace_block_sleeprq(q, bio, op);
 
 		ctx = blk_mq_get_ctx(q);
-		hctx = q->mq_ops->map_queue(q, ctx->cpu);
+		hctx = blk_mq_map_queue(q, ctx->cpu);
 		blk_mq_set_alloc_data(&alloc_data, q, 0, ctx, hctx);
 		rq = __blk_mq_alloc_request(&alloc_data, op, op_flags);
 		ctx = alloc_data.ctx;
@@ -1253,8 +1245,7 @@ static int blk_mq_direct_issue_request(struct request *rq, blk_qc_t *cookie)
 {
 	int ret;
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = q->mq_ops->map_queue(q,
-			rq->mq_ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
 	struct blk_mq_queue_data bd = {
 		.rq = rq,
 		.list = NULL,
@@ -1458,15 +1449,6 @@ run_queue:
 	return cookie;
 }
 
-/*
- * Default mapping to a software queue, since we use one per CPU.
- */
-struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q, const int cpu)
-{
-	return q->queue_hw_ctx[q->mq_map[cpu]];
-}
-EXPORT_SYMBOL(blk_mq_map_queue);
-
 static void blk_mq_free_rq_map(struct blk_mq_tag_set *set,
 		struct blk_mq_tags *tags, unsigned int hctx_idx)
 {
@@ -1800,7 +1782,7 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
 		if (!cpu_online(i))
 			continue;
 
-		hctx = q->mq_ops->map_queue(q, i);
+		hctx = blk_mq_map_queue(q, i);
 
 		/*
 		 * Set local node, IFF we have more than one hw queue. If
@@ -1838,7 +1820,7 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 			continue;
 
 		ctx = per_cpu_ptr(q->queue_ctx, i);
-		hctx = q->mq_ops->map_queue(q, i);
+		hctx = blk_mq_map_queue(q, i);
 
 		cpumask_set_cpu(i, hctx->cpumask);
 		ctx->index_hw = hctx->nr_ctx;
@@ -2303,7 +2285,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN)
 		return -EINVAL;
 
-	if (!set->ops->queue_rq || !set->ops->map_queue)
+	if (!set->ops->queue_rq)
 		return -EINVAL;
 
 	if (set->queue_depth > BLK_MQ_MAX_DEPTH) {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9087b11..ec774bf 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -52,6 +52,12 @@ extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
 				   const struct cpumask *online_mask);
 extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
 
+static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
+		int cpu)
+{
+	return q->queue_hw_ctx[q->mq_map[cpu]];
+}
+
 /*
  * sysfs helpers
  */
diff --git a/block/blk.h b/block/blk.h
index c37492f..74444c4 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -39,14 +39,9 @@ extern struct ida blk_queue_ida;
 static inline struct blk_flush_queue *blk_get_flush_queue(
 		struct request_queue *q, struct blk_mq_ctx *ctx)
 {
-	struct blk_mq_hw_ctx *hctx;
-
-	if (!q->mq_ops)
-		return q->fq;
-
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
-
-	return hctx->fq;
+	if (q->mq_ops)
+		return blk_mq_map_queue(q, ctx->cpu)->fq;
+	return q->fq;
 }
 
 static inline void __blk_get_queue(struct request_queue *q)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c9f2107..cbdb3b1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1703,7 +1703,6 @@ static int loop_init_request(void *data, struct request *rq,
 
 static struct blk_mq_ops loop_mq_ops = {
 	.queue_rq       = loop_queue_rq,
-	.map_queue      = blk_mq_map_queue,
 	.init_request	= loop_init_request,
 };
 
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 2aca98e..3cc92e9 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -3895,7 +3895,6 @@ exit_handler:
 
 static struct blk_mq_ops mtip_mq_ops = {
 	.queue_rq	= mtip_queue_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= mtip_init_cmd,
 	.exit_request	= mtip_free_cmd,
 	.complete	= mtip_softirq_done_fn,
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 75a7f88..7d3b7d6 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -393,7 +393,6 @@ static int null_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 
 static struct blk_mq_ops null_mq_ops = {
 	.queue_rq       = null_queue_rq,
-	.map_queue      = blk_mq_map_queue,
 	.init_hctx	= null_init_hctx,
 	.complete	= null_softirq_done_fn,
 };
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 6c6519f..c1f84df 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3621,7 +3621,6 @@ static int rbd_init_request(void *data, struct request *rq,
 
 static struct blk_mq_ops rbd_mq_ops = {
 	.queue_rq	= rbd_queue_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= rbd_init_request,
 };
 
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 93b1aaa..2dc5c96 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -542,7 +542,6 @@ static int virtblk_init_request(void *data, struct request *rq,
 
 static struct blk_mq_ops virtio_mq_ops = {
 	.queue_rq	= virtio_queue_rq,
-	.map_queue	= blk_mq_map_queue,
 	.complete	= virtblk_request_done,
 	.init_request	= virtblk_init_request,
 };
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 88ef6d4..9908597 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -909,7 +909,6 @@ out_busy:
 
 static struct blk_mq_ops blkfront_mq_ops = {
 	.queue_rq = blkif_queue_rq,
-	.map_queue = blk_mq_map_queue,
 };
 
 static void blkif_set_queue_limits(struct blkfront_info *info)
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 1ca7463..d1c3645 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -908,7 +908,6 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 static struct blk_mq_ops dm_mq_ops = {
 	.queue_rq = dm_mq_queue_rq,
-	.map_queue = blk_mq_map_queue,
 	.complete = dm_softirq_done,
 	.init_request = dm_mq_init_request,
 };
diff --git a/drivers/mtd/ubi/block.c b/drivers/mtd/ubi/block.c
index ebf46ad..d1e6931 100644
--- a/drivers/mtd/ubi/block.c
+++ b/drivers/mtd/ubi/block.c
@@ -351,7 +351,6 @@ static int ubiblock_init_request(void *data, struct request *req,
 static struct blk_mq_ops ubiblock_mq_ops = {
 	.queue_rq       = ubiblock_queue_rq,
 	.init_request	= ubiblock_init_request,
-	.map_queue      = blk_mq_map_queue,
 };
 
 static DEFINE_IDR(ubiblock_minor_idr);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8dcf5a9..086fd7e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1131,7 +1131,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 static struct blk_mq_ops nvme_mq_admin_ops = {
 	.queue_rq	= nvme_queue_rq,
 	.complete	= nvme_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_hctx	= nvme_admin_init_hctx,
 	.exit_hctx      = nvme_admin_exit_hctx,
 	.init_request	= nvme_admin_init_request,
@@ -1141,7 +1140,6 @@ static struct blk_mq_ops nvme_mq_admin_ops = {
 static struct blk_mq_ops nvme_mq_ops = {
 	.queue_rq	= nvme_queue_rq,
 	.complete	= nvme_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_hctx	= nvme_init_hctx,
 	.init_request	= nvme_init_request,
 	.timeout	= nvme_timeout,
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 8d2875b..c7d27ea 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1521,7 +1521,6 @@ static void nvme_rdma_complete_rq(struct request *rq)
 static struct blk_mq_ops nvme_rdma_mq_ops = {
 	.queue_rq	= nvme_rdma_queue_rq,
 	.complete	= nvme_rdma_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= nvme_rdma_init_request,
 	.exit_request	= nvme_rdma_exit_request,
 	.reinit_request	= nvme_rdma_reinit_request,
@@ -1533,7 +1532,6 @@ static struct blk_mq_ops nvme_rdma_mq_ops = {
 static struct blk_mq_ops nvme_rdma_admin_mq_ops = {
 	.queue_rq	= nvme_rdma_queue_rq,
 	.complete	= nvme_rdma_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= nvme_rdma_init_admin_request,
 	.exit_request	= nvme_rdma_exit_admin_request,
 	.reinit_request	= nvme_rdma_reinit_request,
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 7affd40..62f17c0 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -273,7 +273,6 @@ static int nvme_loop_init_admin_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 static struct blk_mq_ops nvme_loop_mq_ops = {
 	.queue_rq	= nvme_loop_queue_rq,
 	.complete	= nvme_loop_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= nvme_loop_init_request,
 	.init_hctx	= nvme_loop_init_hctx,
 	.timeout	= nvme_loop_timeout,
@@ -282,7 +281,6 @@ static struct blk_mq_ops nvme_loop_mq_ops = {
 static struct blk_mq_ops nvme_loop_admin_mq_ops = {
 	.queue_rq	= nvme_loop_queue_rq,
 	.complete	= nvme_loop_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= nvme_loop_init_admin_request,
 	.init_hctx	= nvme_loop_init_admin_hctx,
 	.timeout	= nvme_loop_timeout,
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c71344a..2cca9cf 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2077,7 +2077,6 @@ struct request_queue *scsi_alloc_queue(struct scsi_device *sdev)
 }
 
 static struct blk_mq_ops scsi_mq_ops = {
-	.map_queue	= blk_mq_map_queue,
 	.queue_rq	= scsi_queue_rq,
 	.complete	= scsi_softirq_done,
 	.timeout	= scsi_timeout,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index a572227..d4d8bc8 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -89,7 +89,6 @@ struct blk_mq_queue_data {
 };
 
 typedef int (queue_rq_fn)(struct blk_mq_hw_ctx *, const struct blk_mq_queue_data *);
-typedef struct blk_mq_hw_ctx *(map_queue_fn)(struct request_queue *, const int);
 typedef enum blk_eh_timer_return (timeout_fn)(struct request *, bool);
 typedef int (init_hctx_fn)(struct blk_mq_hw_ctx *, void *, unsigned int);
 typedef void (exit_hctx_fn)(struct blk_mq_hw_ctx *, unsigned int);
@@ -112,11 +111,6 @@ struct blk_mq_ops {
 	queue_rq_fn		*queue_rq;
 
 	/*
-	 * Map to specific hardware queue
-	 */
-	map_queue_fn		*map_queue;
-
-	/*
 	 * Called on request timeout
 	 */
 	timeout_fn		*timeout;
@@ -221,7 +215,6 @@ static inline u16 blk_mq_unique_tag_to_tag(u32 unique_tag)
 	return unique_tag & BLK_MQ_UNIQUE_TAG_MASK;
 }
 
-struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *, const int ctx_index);
 struct blk_mq_hw_ctx *blk_mq_alloc_single_hw_queue(struct blk_mq_tag_set *, unsigned int, int);
 
 int blk_mq_request_started(struct request *rq);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
  2016-08-29 10:53 ` Christoph Hellwig
@ 2016-08-29 10:53   ` Christoph Hellwig
  -1 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-08-29 10:53 UTC (permalink / raw)
  To: axboe; +Cc: keith.busch, linux-block, linux-nvme

Allow drivers to pass in the affinity mask from the generic interrupt
layer, and spread queues based on that.  If the driver doesn't pass in
a mask we will create it using the genirq helper.  As this helper was
modelled after the blk-mq algorithm there should be no change in
behavior.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/Makefile         |   2 +-
 block/blk-mq-cpumap.c  | 120 -------------------------------------------------
 block/blk-mq.c         |  68 +++++++++++++++++++++++++---
 block/blk-mq.h         |   7 +--
 include/linux/blk-mq.h |   1 +
 5 files changed, 66 insertions(+), 132 deletions(-)
 delete mode 100644 block/blk-mq-cpumap.c

diff --git a/block/Makefile b/block/Makefile
index 9eda232..aeb318d 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
 			blk-lib.o blk-mq.o blk-mq-tag.o \
-			blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
+			blk-mq-sysfs.o blk-mq-cpu.o ioctl.o \
 			genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
 			badblocks.o partitions/
 
diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
deleted file mode 100644
index d0634bc..0000000
--- a/block/blk-mq-cpumap.c
+++ /dev/null
@@ -1,120 +0,0 @@
-/*
- * CPU <-> hardware queue mapping helpers
- *
- * Copyright (C) 2013-2014 Jens Axboe
- */
-#include <linux/kernel.h>
-#include <linux/threads.h>
-#include <linux/module.h>
-#include <linux/mm.h>
-#include <linux/smp.h>
-#include <linux/cpu.h>
-
-#include <linux/blk-mq.h>
-#include "blk.h"
-#include "blk-mq.h"
-
-static int cpu_to_queue_index(unsigned int nr_cpus, unsigned int nr_queues,
-			      const int cpu)
-{
-	return cpu * nr_queues / nr_cpus;
-}
-
-static int get_first_sibling(unsigned int cpu)
-{
-	unsigned int ret;
-
-	ret = cpumask_first(topology_sibling_cpumask(cpu));
-	if (ret < nr_cpu_ids)
-		return ret;
-
-	return cpu;
-}
-
-int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
-			    const struct cpumask *online_mask)
-{
-	unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling;
-	cpumask_var_t cpus;
-
-	if (!alloc_cpumask_var(&cpus, GFP_ATOMIC))
-		return 1;
-
-	cpumask_clear(cpus);
-	nr_cpus = nr_uniq_cpus = 0;
-	for_each_cpu(i, online_mask) {
-		nr_cpus++;
-		first_sibling = get_first_sibling(i);
-		if (!cpumask_test_cpu(first_sibling, cpus))
-			nr_uniq_cpus++;
-		cpumask_set_cpu(i, cpus);
-	}
-
-	queue = 0;
-	for_each_possible_cpu(i) {
-		if (!cpumask_test_cpu(i, online_mask)) {
-			map[i] = 0;
-			continue;
-		}
-
-		/*
-		 * Easy case - we have equal or more hardware queues. Or
-		 * there are no thread siblings to take into account. Do
-		 * 1:1 if enough, or sequential mapping if less.
-		 */
-		if (nr_queues >= nr_cpus || nr_cpus == nr_uniq_cpus) {
-			map[i] = cpu_to_queue_index(nr_cpus, nr_queues, queue);
-			queue++;
-			continue;
-		}
-
-		/*
-		 * Less then nr_cpus queues, and we have some number of
-		 * threads per cores. Map sibling threads to the same
-		 * queue.
-		 */
-		first_sibling = get_first_sibling(i);
-		if (first_sibling == i) {
-			map[i] = cpu_to_queue_index(nr_uniq_cpus, nr_queues,
-							queue);
-			queue++;
-		} else
-			map[i] = map[first_sibling];
-	}
-
-	free_cpumask_var(cpus);
-	return 0;
-}
-
-unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set)
-{
-	unsigned int *map;
-
-	/* If cpus are offline, map them to first hctx */
-	map = kzalloc_node(sizeof(*map) * nr_cpu_ids, GFP_KERNEL,
-				set->numa_node);
-	if (!map)
-		return NULL;
-
-	if (!blk_mq_update_queue_map(map, set->nr_hw_queues, cpu_online_mask))
-		return map;
-
-	kfree(map);
-	return NULL;
-}
-
-/*
- * We have no quick way of doing reverse lookups. This is only used at
- * queue init time, so runtime isn't important.
- */
-int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned int index)
-{
-	int i;
-
-	for_each_possible_cpu(i) {
-		if (index == mq_map[i])
-			return local_memory_node(cpu_to_node(i));
-	}
-
-	return NUMA_NO_NODE;
-}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 434df39..a679562 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -22,6 +22,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/delay.h>
 #include <linux/crash_dump.h>
+#include <linux/interrupt.h>
 
 #include <trace/events/block.h>
 
@@ -1969,6 +1970,22 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
 }
 EXPORT_SYMBOL(blk_mq_init_queue);
 
+/*
+ * We have no quick way of doing reverse lookups. This is only used at
+ * queue init time, so runtime isn't important.
+ */
+static int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned int index)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		if (index == mq_map[i])
+			return local_memory_node(cpu_to_node(i));
+	}
+
+	return NUMA_NO_NODE;
+}
+
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 						struct request_queue *q)
 {
@@ -2268,6 +2285,30 @@ struct cpumask *blk_mq_tags_cpumask(struct blk_mq_tags *tags)
 }
 EXPORT_SYMBOL_GPL(blk_mq_tags_cpumask);
 
+static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
+		const struct cpumask *affinity_mask)
+{
+	int queue = -1, cpu = 0;
+
+	set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
+			GFP_KERNEL, set->numa_node);
+	if (!set->mq_map)
+		return -ENOMEM;
+
+	if (!affinity_mask)
+		return 0;	/* map all cpus to queue 0 */
+
+	/* If cpus are offline, map them to first hctx */
+	for_each_online_cpu(cpu) {
+		if (cpumask_test_cpu(cpu, affinity_mask))
+			queue++;
+		if (queue >= 0)
+			set->mq_map[cpu] = queue;
+	}
+
+	return 0;
+}
+
 /*
  * Alloc a tag set to be associated with one or more request queues.
  * May fail with EINVAL for various error conditions. May adjust the
@@ -2276,6 +2317,8 @@ EXPORT_SYMBOL_GPL(blk_mq_tags_cpumask);
  */
 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 {
+	int ret;
+
 	BUILD_BUG_ON(BLK_MQ_MAX_DEPTH > 1 << BLK_MQ_UNIQUE_TAG_BITS);
 
 	if (!set->nr_hw_queues)
@@ -2314,11 +2357,26 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (!set->tags)
 		return -ENOMEM;
 
-	set->mq_map = blk_mq_make_queue_map(set);
-	if (!set->mq_map)
-		goto out_free_tags;
+	/*
+	 * Use the passed in affinity mask if the driver provided one.
+	 */
+	if (set->affinity_mask) {
+		ret = blk_mq_create_mq_map(set, set->affinity_mask);
+		if (!set->mq_map)
+			goto out_free_tags;
+	} else {
+		struct cpumask *affinity_mask;
+
+		affinity_mask = irq_create_affinity_mask(&set->nr_hw_queues);
+		ret = blk_mq_create_mq_map(set, affinity_mask);
+		kfree(affinity_mask);
+
+		if (!set->mq_map)
+			goto out_free_tags;
+	}
 
-	if (blk_mq_alloc_rq_maps(set))
+	ret = blk_mq_alloc_rq_maps(set);
+	if (ret)
 		goto out_free_mq_map;
 
 	mutex_init(&set->tag_list_lock);
@@ -2332,7 +2390,7 @@ out_free_mq_map:
 out_free_tags:
 	kfree(set->tags);
 	set->tags = NULL;
-	return -ENOMEM;
+	return ret;
 }
 EXPORT_SYMBOL(blk_mq_alloc_tag_set);
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index ec774bf..7ef5302 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -45,13 +45,8 @@ void blk_mq_enable_hotplug(void);
 void blk_mq_disable_hotplug(void);
 
 /*
- * CPU -> queue mappings
+ * CPU -> queue mapping
  */
-extern unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set);
-extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
-				   const struct cpumask *online_mask);
-extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
-
 static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
 		int cpu)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d4d8bc8..29e227b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -75,6 +75,7 @@ struct blk_mq_tag_set {
 	unsigned int		timeout;
 	unsigned int		flags;		/* BLK_MQ_F_* */
 	void			*driver_data;
+	struct cpumask		*affinity_mask;
 
 	struct blk_mq_tags	**tags;
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
@ 2016-08-29 10:53   ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-08-29 10:53 UTC (permalink / raw)


Allow drivers to pass in the affinity mask from the generic interrupt
layer, and spread queues based on that.  If the driver doesn't pass in
a mask we will create it using the genirq helper.  As this helper was
modelled after the blk-mq algorithm there should be no change in
behavior.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/Makefile         |   2 +-
 block/blk-mq-cpumap.c  | 120 -------------------------------------------------
 block/blk-mq.c         |  68 +++++++++++++++++++++++++---
 block/blk-mq.h         |   7 +--
 include/linux/blk-mq.h |   1 +
 5 files changed, 66 insertions(+), 132 deletions(-)
 delete mode 100644 block/blk-mq-cpumap.c

diff --git a/block/Makefile b/block/Makefile
index 9eda232..aeb318d 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -6,7 +6,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-tag.o blk-sysfs.o \
 			blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
 			blk-lib.o blk-mq.o blk-mq-tag.o \
-			blk-mq-sysfs.o blk-mq-cpu.o blk-mq-cpumap.o ioctl.o \
+			blk-mq-sysfs.o blk-mq-cpu.o ioctl.o \
 			genhd.o scsi_ioctl.o partition-generic.o ioprio.o \
 			badblocks.o partitions/
 
diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
deleted file mode 100644
index d0634bc..0000000
--- a/block/blk-mq-cpumap.c
+++ /dev/null
@@ -1,120 +0,0 @@
-/*
- * CPU <-> hardware queue mapping helpers
- *
- * Copyright (C) 2013-2014 Jens Axboe
- */
-#include <linux/kernel.h>
-#include <linux/threads.h>
-#include <linux/module.h>
-#include <linux/mm.h>
-#include <linux/smp.h>
-#include <linux/cpu.h>
-
-#include <linux/blk-mq.h>
-#include "blk.h"
-#include "blk-mq.h"
-
-static int cpu_to_queue_index(unsigned int nr_cpus, unsigned int nr_queues,
-			      const int cpu)
-{
-	return cpu * nr_queues / nr_cpus;
-}
-
-static int get_first_sibling(unsigned int cpu)
-{
-	unsigned int ret;
-
-	ret = cpumask_first(topology_sibling_cpumask(cpu));
-	if (ret < nr_cpu_ids)
-		return ret;
-
-	return cpu;
-}
-
-int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
-			    const struct cpumask *online_mask)
-{
-	unsigned int i, nr_cpus, nr_uniq_cpus, queue, first_sibling;
-	cpumask_var_t cpus;
-
-	if (!alloc_cpumask_var(&cpus, GFP_ATOMIC))
-		return 1;
-
-	cpumask_clear(cpus);
-	nr_cpus = nr_uniq_cpus = 0;
-	for_each_cpu(i, online_mask) {
-		nr_cpus++;
-		first_sibling = get_first_sibling(i);
-		if (!cpumask_test_cpu(first_sibling, cpus))
-			nr_uniq_cpus++;
-		cpumask_set_cpu(i, cpus);
-	}
-
-	queue = 0;
-	for_each_possible_cpu(i) {
-		if (!cpumask_test_cpu(i, online_mask)) {
-			map[i] = 0;
-			continue;
-		}
-
-		/*
-		 * Easy case - we have equal or more hardware queues. Or
-		 * there are no thread siblings to take into account. Do
-		 * 1:1 if enough, or sequential mapping if less.
-		 */
-		if (nr_queues >= nr_cpus || nr_cpus == nr_uniq_cpus) {
-			map[i] = cpu_to_queue_index(nr_cpus, nr_queues, queue);
-			queue++;
-			continue;
-		}
-
-		/*
-		 * Less then nr_cpus queues, and we have some number of
-		 * threads per cores. Map sibling threads to the same
-		 * queue.
-		 */
-		first_sibling = get_first_sibling(i);
-		if (first_sibling == i) {
-			map[i] = cpu_to_queue_index(nr_uniq_cpus, nr_queues,
-							queue);
-			queue++;
-		} else
-			map[i] = map[first_sibling];
-	}
-
-	free_cpumask_var(cpus);
-	return 0;
-}
-
-unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set)
-{
-	unsigned int *map;
-
-	/* If cpus are offline, map them to first hctx */
-	map = kzalloc_node(sizeof(*map) * nr_cpu_ids, GFP_KERNEL,
-				set->numa_node);
-	if (!map)
-		return NULL;
-
-	if (!blk_mq_update_queue_map(map, set->nr_hw_queues, cpu_online_mask))
-		return map;
-
-	kfree(map);
-	return NULL;
-}
-
-/*
- * We have no quick way of doing reverse lookups. This is only used at
- * queue init time, so runtime isn't important.
- */
-int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned int index)
-{
-	int i;
-
-	for_each_possible_cpu(i) {
-		if (index == mq_map[i])
-			return local_memory_node(cpu_to_node(i));
-	}
-
-	return NUMA_NO_NODE;
-}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 434df39..a679562 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -22,6 +22,7 @@
 #include <linux/sched/sysctl.h>
 #include <linux/delay.h>
 #include <linux/crash_dump.h>
+#include <linux/interrupt.h>
 
 #include <trace/events/block.h>
 
@@ -1969,6 +1970,22 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
 }
 EXPORT_SYMBOL(blk_mq_init_queue);
 
+/*
+ * We have no quick way of doing reverse lookups. This is only used at
+ * queue init time, so runtime isn't important.
+ */
+static int blk_mq_hw_queue_to_node(unsigned int *mq_map, unsigned int index)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		if (index == mq_map[i])
+			return local_memory_node(cpu_to_node(i));
+	}
+
+	return NUMA_NO_NODE;
+}
+
 static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 						struct request_queue *q)
 {
@@ -2268,6 +2285,30 @@ struct cpumask *blk_mq_tags_cpumask(struct blk_mq_tags *tags)
 }
 EXPORT_SYMBOL_GPL(blk_mq_tags_cpumask);
 
+static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
+		const struct cpumask *affinity_mask)
+{
+	int queue = -1, cpu = 0;
+
+	set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
+			GFP_KERNEL, set->numa_node);
+	if (!set->mq_map)
+		return -ENOMEM;
+
+	if (!affinity_mask)
+		return 0;	/* map all cpus to queue 0 */
+
+	/* If cpus are offline, map them to first hctx */
+	for_each_online_cpu(cpu) {
+		if (cpumask_test_cpu(cpu, affinity_mask))
+			queue++;
+		if (queue >= 0)
+			set->mq_map[cpu] = queue;
+	}
+
+	return 0;
+}
+
 /*
  * Alloc a tag set to be associated with one or more request queues.
  * May fail with EINVAL for various error conditions. May adjust the
@@ -2276,6 +2317,8 @@ EXPORT_SYMBOL_GPL(blk_mq_tags_cpumask);
  */
 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 {
+	int ret;
+
 	BUILD_BUG_ON(BLK_MQ_MAX_DEPTH > 1 << BLK_MQ_UNIQUE_TAG_BITS);
 
 	if (!set->nr_hw_queues)
@@ -2314,11 +2357,26 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (!set->tags)
 		return -ENOMEM;
 
-	set->mq_map = blk_mq_make_queue_map(set);
-	if (!set->mq_map)
-		goto out_free_tags;
+	/*
+	 * Use the passed in affinity mask if the driver provided one.
+	 */
+	if (set->affinity_mask) {
+		ret = blk_mq_create_mq_map(set, set->affinity_mask);
+		if (!set->mq_map)
+			goto out_free_tags;
+	} else {
+		struct cpumask *affinity_mask;
+
+		affinity_mask = irq_create_affinity_mask(&set->nr_hw_queues);
+		ret = blk_mq_create_mq_map(set, affinity_mask);
+		kfree(affinity_mask);
+
+		if (!set->mq_map)
+			goto out_free_tags;
+	}
 
-	if (blk_mq_alloc_rq_maps(set))
+	ret = blk_mq_alloc_rq_maps(set);
+	if (ret)
 		goto out_free_mq_map;
 
 	mutex_init(&set->tag_list_lock);
@@ -2332,7 +2390,7 @@ out_free_mq_map:
 out_free_tags:
 	kfree(set->tags);
 	set->tags = NULL;
-	return -ENOMEM;
+	return ret;
 }
 EXPORT_SYMBOL(blk_mq_alloc_tag_set);
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index ec774bf..7ef5302 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -45,13 +45,8 @@ void blk_mq_enable_hotplug(void);
 void blk_mq_disable_hotplug(void);
 
 /*
- * CPU -> queue mappings
+ * CPU -> queue mapping
  */
-extern unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set);
-extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
-				   const struct cpumask *online_mask);
-extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
-
 static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
 		int cpu)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d4d8bc8..29e227b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -75,6 +75,7 @@ struct blk_mq_tag_set {
 	unsigned int		timeout;
 	unsigned int		flags;		/* BLK_MQ_F_* */
 	void			*driver_data;
+	struct cpumask		*affinity_mask;
 
 	struct blk_mq_tags	**tags;
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 5/7] nvme: switch to use pci_alloc_irq_vectors
  2016-08-29 10:53 ` Christoph Hellwig
@ 2016-08-29 10:53   ` Christoph Hellwig
  -1 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-08-29 10:53 UTC (permalink / raw)
  To: axboe; +Cc: keith.busch, linux-block, linux-nvme

Use the new helper to automatically select the right interrupt type, as
well as to use the automatic interupt affinity assignment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/pci.c | 99 ++++++++++++++-----------------------------------
 1 file changed, 28 insertions(+), 71 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 086fd7e..50e9038 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -88,7 +88,6 @@ struct nvme_dev {
 	unsigned max_qid;
 	int q_depth;
 	u32 db_stride;
-	struct msix_entry *entry;
 	void __iomem *bar;
 	struct work_struct reset_work;
 	struct work_struct remove_work;
@@ -201,6 +200,11 @@ static unsigned int nvme_cmd_size(struct nvme_dev *dev)
 		nvme_iod_alloc_size(dev, NVME_INT_BYTES(dev), NVME_INT_PAGES);
 }
 
+static int nvmeq_irq(struct nvme_queue *nvmeq)
+{
+	return pci_irq_vector(to_pci_dev(nvmeq->dev->dev), nvmeq->cq_vector);
+}
+
 static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 				unsigned int hctx_idx)
 {
@@ -960,7 +964,7 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 		spin_unlock_irq(&nvmeq->q_lock);
 		return 1;
 	}
-	vector = nvmeq->dev->entry[nvmeq->cq_vector].vector;
+	vector = nvmeq_irq(nvmeq);
 	nvmeq->dev->online_queues--;
 	nvmeq->cq_vector = -1;
 	spin_unlock_irq(&nvmeq->q_lock);
@@ -968,7 +972,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 	if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
 		blk_mq_stop_hw_queues(nvmeq->dev->ctrl.admin_q);
 
-	irq_set_affinity_hint(vector, NULL);
 	free_irq(vector, nvmeq);
 
 	return 0;
@@ -1075,15 +1078,14 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
 	return NULL;
 }
 
-static int queue_request_irq(struct nvme_dev *dev, struct nvme_queue *nvmeq,
-							const char *name)
+static int queue_request_irq(struct nvme_queue *nvmeq)
 {
 	if (use_threaded_interrupts)
-		return request_threaded_irq(dev->entry[nvmeq->cq_vector].vector,
-					nvme_irq_check, nvme_irq, IRQF_SHARED,
-					name, nvmeq);
-	return request_irq(dev->entry[nvmeq->cq_vector].vector, nvme_irq,
-				IRQF_SHARED, name, nvmeq);
+		return request_threaded_irq(nvmeq_irq(nvmeq), nvme_irq_check,
+				nvme_irq, IRQF_SHARED, nvmeq->irqname, nvmeq);
+	else
+		return request_irq(nvmeq_irq(nvmeq), nvme_irq, IRQF_SHARED,
+				nvmeq->irqname, nvmeq);
 }
 
 static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
@@ -1114,7 +1116,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	if (result < 0)
 		goto release_cq;
 
-	result = queue_request_irq(dev, nvmeq, nvmeq->irqname);
+	result = queue_request_irq(nvmeq);
 	if (result < 0)
 		goto release_sq;
 
@@ -1232,7 +1234,7 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 		goto free_nvmeq;
 
 	nvmeq->cq_vector = 0;
-	result = queue_request_irq(dev, nvmeq, nvmeq->irqname);
+	result = queue_request_irq(nvmeq);
 	if (result) {
 		nvmeq->cq_vector = -1;
 		goto free_nvmeq;
@@ -1380,7 +1382,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 {
 	struct nvme_queue *adminq = dev->queues[0];
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
-	int result, i, vecs, nr_io_queues, size;
+	int result, nr_io_queues, size;
 
 	nr_io_queues = num_online_cpus();
 	result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
@@ -1415,29 +1417,18 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	}
 
 	/* Deregister the admin queue's interrupt */
-	free_irq(dev->entry[0].vector, adminq);
+	free_irq(pci_irq_vector(pdev, 0), adminq);
 
 	/*
 	 * If we enable msix early due to not intx, disable it again before
 	 * setting up the full range we need.
 	 */
-	if (pdev->msi_enabled)
-		pci_disable_msi(pdev);
-	else if (pdev->msix_enabled)
-		pci_disable_msix(pdev);
-
-	for (i = 0; i < nr_io_queues; i++)
-		dev->entry[i].entry = i;
-	vecs = pci_enable_msix_range(pdev, dev->entry, 1, nr_io_queues);
-	if (vecs < 0) {
-		vecs = pci_enable_msi_range(pdev, 1, min(nr_io_queues, 32));
-		if (vecs < 0) {
-			vecs = 1;
-		} else {
-			for (i = 0; i < vecs; i++)
-				dev->entry[i].vector = i + pdev->irq;
-		}
-	}
+	pci_free_irq_vectors(pdev);
+	nr_io_queues = pci_alloc_irq_vectors(pdev, 1, nr_io_queues,
+			PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY);
+	if (nr_io_queues <= 0)
+		return -EIO;
+	dev->max_qid = nr_io_queues;
 
 	/*
 	 * Should investigate if there's a performance win from allocating
@@ -1445,10 +1436,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	 * path to scale better, even if the receive path is limited by the
 	 * number of interrupts.
 	 */
-	nr_io_queues = vecs;
-	dev->max_qid = nr_io_queues;
 
-	result = queue_request_irq(dev, adminq, adminq->irqname);
+	result = queue_request_irq(adminq);
 	if (result) {
 		adminq->cq_vector = -1;
 		goto free_queues;
@@ -1460,23 +1449,6 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	return result;
 }
 
-static void nvme_pci_post_scan(struct nvme_ctrl *ctrl)
-{
-	struct nvme_dev *dev = to_nvme_dev(ctrl);
-	struct nvme_queue *nvmeq;
-	int i;
-
-	for (i = 0; i < dev->online_queues; i++) {
-		nvmeq = dev->queues[i];
-
-		if (!nvmeq->tags || !(*nvmeq->tags))
-			continue;
-
-		irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
-					blk_mq_tags_cpumask(*nvmeq->tags));
-	}
-}
-
 static void nvme_del_queue_end(struct request *req, int error)
 {
 	struct nvme_queue *nvmeq = req->end_io_data;
@@ -1574,6 +1546,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
 		dev->tagset.cmd_size = nvme_cmd_size(dev);
 		dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
 		dev->tagset.driver_data = dev;
+		dev->tagset.affinity_mask = to_pci_dev(dev->dev)->irq_affinity;
 
 		if (blk_mq_alloc_tag_set(&dev->tagset))
 			return 0;
@@ -1613,15 +1586,9 @@ static int nvme_pci_enable(struct nvme_dev *dev)
 	 * interrupts. Pre-enable a single MSIX or MSI vec for setup. We'll
 	 * adjust this later.
 	 */
-	if (pci_enable_msix(pdev, dev->entry, 1)) {
-		pci_enable_msi(pdev);
-		dev->entry[0].vector = pdev->irq;
-	}
-
-	if (!dev->entry[0].vector) {
-		result = -ENODEV;
-		goto disable;
-	}
+	result = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES);
+	if (result < 0)
+		return result;
 
 	cap = lo_hi_readq(dev->bar + NVME_REG_CAP);
 
@@ -1663,10 +1630,7 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 {
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
-	if (pdev->msi_enabled)
-		pci_disable_msi(pdev);
-	else if (pdev->msix_enabled)
-		pci_disable_msix(pdev);
+	pci_free_irq_vectors(pdev);
 
 	if (pci_is_enabled(pdev)) {
 		pci_disable_pcie_error_reporting(pdev);
@@ -1736,7 +1700,6 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 	if (dev->ctrl.admin_q)
 		blk_put_queue(dev->ctrl.admin_q);
 	kfree(dev->queues);
-	kfree(dev->entry);
 	kfree(dev);
 }
 
@@ -1880,7 +1843,6 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.reg_read64		= nvme_pci_reg_read64,
 	.reset_ctrl		= nvme_pci_reset_ctrl,
 	.free_ctrl		= nvme_pci_free_ctrl,
-	.post_scan		= nvme_pci_post_scan,
 	.submit_async_event	= nvme_pci_submit_async_event,
 };
 
@@ -1913,10 +1875,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
 	if (!dev)
 		return -ENOMEM;
-	dev->entry = kzalloc_node(num_possible_cpus() * sizeof(*dev->entry),
-							GFP_KERNEL, node);
-	if (!dev->entry)
-		goto free;
 	dev->queues = kzalloc_node((num_possible_cpus() + 1) * sizeof(void *),
 							GFP_KERNEL, node);
 	if (!dev->queues)
@@ -1957,7 +1915,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	nvme_dev_unmap(dev);
  free:
 	kfree(dev->queues);
-	kfree(dev->entry);
 	kfree(dev);
 	return result;
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 5/7] nvme: switch to use pci_alloc_irq_vectors
@ 2016-08-29 10:53   ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-08-29 10:53 UTC (permalink / raw)


Use the new helper to automatically select the right interrupt type, as
well as to use the automatic interupt affinity assignment.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/pci.c | 99 ++++++++++++++-----------------------------------
 1 file changed, 28 insertions(+), 71 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 086fd7e..50e9038 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -88,7 +88,6 @@ struct nvme_dev {
 	unsigned max_qid;
 	int q_depth;
 	u32 db_stride;
-	struct msix_entry *entry;
 	void __iomem *bar;
 	struct work_struct reset_work;
 	struct work_struct remove_work;
@@ -201,6 +200,11 @@ static unsigned int nvme_cmd_size(struct nvme_dev *dev)
 		nvme_iod_alloc_size(dev, NVME_INT_BYTES(dev), NVME_INT_PAGES);
 }
 
+static int nvmeq_irq(struct nvme_queue *nvmeq)
+{
+	return pci_irq_vector(to_pci_dev(nvmeq->dev->dev), nvmeq->cq_vector);
+}
+
 static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 				unsigned int hctx_idx)
 {
@@ -960,7 +964,7 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 		spin_unlock_irq(&nvmeq->q_lock);
 		return 1;
 	}
-	vector = nvmeq->dev->entry[nvmeq->cq_vector].vector;
+	vector = nvmeq_irq(nvmeq);
 	nvmeq->dev->online_queues--;
 	nvmeq->cq_vector = -1;
 	spin_unlock_irq(&nvmeq->q_lock);
@@ -968,7 +972,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
 	if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
 		blk_mq_stop_hw_queues(nvmeq->dev->ctrl.admin_q);
 
-	irq_set_affinity_hint(vector, NULL);
 	free_irq(vector, nvmeq);
 
 	return 0;
@@ -1075,15 +1078,14 @@ static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
 	return NULL;
 }
 
-static int queue_request_irq(struct nvme_dev *dev, struct nvme_queue *nvmeq,
-							const char *name)
+static int queue_request_irq(struct nvme_queue *nvmeq)
 {
 	if (use_threaded_interrupts)
-		return request_threaded_irq(dev->entry[nvmeq->cq_vector].vector,
-					nvme_irq_check, nvme_irq, IRQF_SHARED,
-					name, nvmeq);
-	return request_irq(dev->entry[nvmeq->cq_vector].vector, nvme_irq,
-				IRQF_SHARED, name, nvmeq);
+		return request_threaded_irq(nvmeq_irq(nvmeq), nvme_irq_check,
+				nvme_irq, IRQF_SHARED, nvmeq->irqname, nvmeq);
+	else
+		return request_irq(nvmeq_irq(nvmeq), nvme_irq, IRQF_SHARED,
+				nvmeq->irqname, nvmeq);
 }
 
 static void nvme_init_queue(struct nvme_queue *nvmeq, u16 qid)
@@ -1114,7 +1116,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	if (result < 0)
 		goto release_cq;
 
-	result = queue_request_irq(dev, nvmeq, nvmeq->irqname);
+	result = queue_request_irq(nvmeq);
 	if (result < 0)
 		goto release_sq;
 
@@ -1232,7 +1234,7 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 		goto free_nvmeq;
 
 	nvmeq->cq_vector = 0;
-	result = queue_request_irq(dev, nvmeq, nvmeq->irqname);
+	result = queue_request_irq(nvmeq);
 	if (result) {
 		nvmeq->cq_vector = -1;
 		goto free_nvmeq;
@@ -1380,7 +1382,7 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 {
 	struct nvme_queue *adminq = dev->queues[0];
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
-	int result, i, vecs, nr_io_queues, size;
+	int result, nr_io_queues, size;
 
 	nr_io_queues = num_online_cpus();
 	result = nvme_set_queue_count(&dev->ctrl, &nr_io_queues);
@@ -1415,29 +1417,18 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	}
 
 	/* Deregister the admin queue's interrupt */
-	free_irq(dev->entry[0].vector, adminq);
+	free_irq(pci_irq_vector(pdev, 0), adminq);
 
 	/*
 	 * If we enable msix early due to not intx, disable it again before
 	 * setting up the full range we need.
 	 */
-	if (pdev->msi_enabled)
-		pci_disable_msi(pdev);
-	else if (pdev->msix_enabled)
-		pci_disable_msix(pdev);
-
-	for (i = 0; i < nr_io_queues; i++)
-		dev->entry[i].entry = i;
-	vecs = pci_enable_msix_range(pdev, dev->entry, 1, nr_io_queues);
-	if (vecs < 0) {
-		vecs = pci_enable_msi_range(pdev, 1, min(nr_io_queues, 32));
-		if (vecs < 0) {
-			vecs = 1;
-		} else {
-			for (i = 0; i < vecs; i++)
-				dev->entry[i].vector = i + pdev->irq;
-		}
-	}
+	pci_free_irq_vectors(pdev);
+	nr_io_queues = pci_alloc_irq_vectors(pdev, 1, nr_io_queues,
+			PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY);
+	if (nr_io_queues <= 0)
+		return -EIO;
+	dev->max_qid = nr_io_queues;
 
 	/*
 	 * Should investigate if there's a performance win from allocating
@@ -1445,10 +1436,8 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	 * path to scale better, even if the receive path is limited by the
 	 * number of interrupts.
 	 */
-	nr_io_queues = vecs;
-	dev->max_qid = nr_io_queues;
 
-	result = queue_request_irq(dev, adminq, adminq->irqname);
+	result = queue_request_irq(adminq);
 	if (result) {
 		adminq->cq_vector = -1;
 		goto free_queues;
@@ -1460,23 +1449,6 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
 	return result;
 }
 
-static void nvme_pci_post_scan(struct nvme_ctrl *ctrl)
-{
-	struct nvme_dev *dev = to_nvme_dev(ctrl);
-	struct nvme_queue *nvmeq;
-	int i;
-
-	for (i = 0; i < dev->online_queues; i++) {
-		nvmeq = dev->queues[i];
-
-		if (!nvmeq->tags || !(*nvmeq->tags))
-			continue;
-
-		irq_set_affinity_hint(dev->entry[nvmeq->cq_vector].vector,
-					blk_mq_tags_cpumask(*nvmeq->tags));
-	}
-}
-
 static void nvme_del_queue_end(struct request *req, int error)
 {
 	struct nvme_queue *nvmeq = req->end_io_data;
@@ -1574,6 +1546,7 @@ static int nvme_dev_add(struct nvme_dev *dev)
 		dev->tagset.cmd_size = nvme_cmd_size(dev);
 		dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
 		dev->tagset.driver_data = dev;
+		dev->tagset.affinity_mask = to_pci_dev(dev->dev)->irq_affinity;
 
 		if (blk_mq_alloc_tag_set(&dev->tagset))
 			return 0;
@@ -1613,15 +1586,9 @@ static int nvme_pci_enable(struct nvme_dev *dev)
 	 * interrupts. Pre-enable a single MSIX or MSI vec for setup. We'll
 	 * adjust this later.
 	 */
-	if (pci_enable_msix(pdev, dev->entry, 1)) {
-		pci_enable_msi(pdev);
-		dev->entry[0].vector = pdev->irq;
-	}
-
-	if (!dev->entry[0].vector) {
-		result = -ENODEV;
-		goto disable;
-	}
+	result = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES);
+	if (result < 0)
+		return result;
 
 	cap = lo_hi_readq(dev->bar + NVME_REG_CAP);
 
@@ -1663,10 +1630,7 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 {
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
-	if (pdev->msi_enabled)
-		pci_disable_msi(pdev);
-	else if (pdev->msix_enabled)
-		pci_disable_msix(pdev);
+	pci_free_irq_vectors(pdev);
 
 	if (pci_is_enabled(pdev)) {
 		pci_disable_pcie_error_reporting(pdev);
@@ -1736,7 +1700,6 @@ static void nvme_pci_free_ctrl(struct nvme_ctrl *ctrl)
 	if (dev->ctrl.admin_q)
 		blk_put_queue(dev->ctrl.admin_q);
 	kfree(dev->queues);
-	kfree(dev->entry);
 	kfree(dev);
 }
 
@@ -1880,7 +1843,6 @@ static const struct nvme_ctrl_ops nvme_pci_ctrl_ops = {
 	.reg_read64		= nvme_pci_reg_read64,
 	.reset_ctrl		= nvme_pci_reset_ctrl,
 	.free_ctrl		= nvme_pci_free_ctrl,
-	.post_scan		= nvme_pci_post_scan,
 	.submit_async_event	= nvme_pci_submit_async_event,
 };
 
@@ -1913,10 +1875,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	dev = kzalloc_node(sizeof(*dev), GFP_KERNEL, node);
 	if (!dev)
 		return -ENOMEM;
-	dev->entry = kzalloc_node(num_possible_cpus() * sizeof(*dev->entry),
-							GFP_KERNEL, node);
-	if (!dev->entry)
-		goto free;
 	dev->queues = kzalloc_node((num_possible_cpus() + 1) * sizeof(void *),
 							GFP_KERNEL, node);
 	if (!dev->queues)
@@ -1957,7 +1915,6 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	nvme_dev_unmap(dev);
  free:
 	kfree(dev->queues);
-	kfree(dev->entry);
 	kfree(dev);
 	return result;
 }
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 6/7] nvme: remove the post_scan callout
  2016-08-29 10:53 ` Christoph Hellwig
@ 2016-08-29 10:53   ` Christoph Hellwig
  -1 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-08-29 10:53 UTC (permalink / raw)
  To: axboe; +Cc: keith.busch, linux-block, linux-nvme

No need now that we don't have to reverse engineer the irq affinity.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/core.c | 3 ---
 drivers/nvme/host/nvme.h | 1 -
 2 files changed, 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2feacc7..b245616 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1826,9 +1826,6 @@ static void nvme_scan_work(struct work_struct *work)
 	list_sort(NULL, &ctrl->namespaces, ns_cmp);
 	mutex_unlock(&ctrl->namespaces_mutex);
 	kfree(id);
-
-	if (ctrl->ops->post_scan)
-		ctrl->ops->post_scan(ctrl);
 }
 
 void nvme_queue_scan(struct nvme_ctrl *ctrl)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ab18b78..99e4c16 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -184,7 +184,6 @@ struct nvme_ctrl_ops {
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
 	int (*reset_ctrl)(struct nvme_ctrl *ctrl);
 	void (*free_ctrl)(struct nvme_ctrl *ctrl);
-	void (*post_scan)(struct nvme_ctrl *ctrl);
 	void (*submit_async_event)(struct nvme_ctrl *ctrl, int aer_idx);
 	int (*delete_ctrl)(struct nvme_ctrl *ctrl);
 	const char *(*get_subsysnqn)(struct nvme_ctrl *ctrl);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 6/7] nvme: remove the post_scan callout
@ 2016-08-29 10:53   ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-08-29 10:53 UTC (permalink / raw)


No need now that we don't have to reverse engineer the irq affinity.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 drivers/nvme/host/core.c | 3 ---
 drivers/nvme/host/nvme.h | 1 -
 2 files changed, 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2feacc7..b245616 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1826,9 +1826,6 @@ static void nvme_scan_work(struct work_struct *work)
 	list_sort(NULL, &ctrl->namespaces, ns_cmp);
 	mutex_unlock(&ctrl->namespaces_mutex);
 	kfree(id);
-
-	if (ctrl->ops->post_scan)
-		ctrl->ops->post_scan(ctrl);
 }
 
 void nvme_queue_scan(struct nvme_ctrl *ctrl)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ab18b78..99e4c16 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -184,7 +184,6 @@ struct nvme_ctrl_ops {
 	int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
 	int (*reset_ctrl)(struct nvme_ctrl *ctrl);
 	void (*free_ctrl)(struct nvme_ctrl *ctrl);
-	void (*post_scan)(struct nvme_ctrl *ctrl);
 	void (*submit_async_event)(struct nvme_ctrl *ctrl, int aer_idx);
 	int (*delete_ctrl)(struct nvme_ctrl *ctrl);
 	const char *(*get_subsysnqn)(struct nvme_ctrl *ctrl);
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 7/7] blk-mq: get rid of the cpumask in struct blk_mq_tags
  2016-08-29 10:53 ` Christoph Hellwig
@ 2016-08-29 10:53   ` Christoph Hellwig
  -1 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-08-29 10:53 UTC (permalink / raw)
  To: axboe; +Cc: keith.busch, linux-block, linux-nvme

Unused now that NVMe sets up irq affinity before calling into blk-mq.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq-tag.c     | 6 ------
 block/blk-mq-tag.h     | 1 -
 block/blk-mq.c         | 7 -------
 include/linux/blk-mq.h | 1 -
 4 files changed, 15 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 1602813..2eae3d5 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -665,11 +665,6 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 	if (!tags)
 		return NULL;
 
-	if (!zalloc_cpumask_var(&tags->cpumask, GFP_KERNEL)) {
-		kfree(tags);
-		return NULL;
-	}
-
 	tags->nr_tags = total_tags;
 	tags->nr_reserved_tags = reserved_tags;
 
@@ -680,7 +675,6 @@ void blk_mq_free_tags(struct blk_mq_tags *tags)
 {
 	bt_free(&tags->bitmap_tags);
 	bt_free(&tags->breserved_tags);
-	free_cpumask_var(tags->cpumask);
 	kfree(tags);
 }
 
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index d468a79..5569641 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -44,7 +44,6 @@ struct blk_mq_tags {
 	struct list_head page_list;
 
 	int alloc_policy;
-	cpumask_var_t cpumask;
 };
 
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a679562..1fc42a0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1852,7 +1852,6 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 		hctx->tags = set->tags[i];
 		WARN_ON(!hctx->tags);
 
-		cpumask_copy(hctx->tags->cpumask, hctx->cpumask);
 		/*
 		 * Set the map size to the number of mapped software queues.
 		 * This is more accurate and more efficient than looping
@@ -2279,12 +2278,6 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 	return 0;
 }
 
-struct cpumask *blk_mq_tags_cpumask(struct blk_mq_tags *tags)
-{
-	return tags->cpumask;
-}
-EXPORT_SYMBOL_GPL(blk_mq_tags_cpumask);
-
 static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
 		const struct cpumask *affinity_mask)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 29e227b..79a421e 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -197,7 +197,6 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
 struct request *blk_mq_alloc_request_hctx(struct request_queue *q, int op,
 		unsigned int flags, unsigned int hctx_idx);
 struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag);
-struct cpumask *blk_mq_tags_cpumask(struct blk_mq_tags *tags);
 
 enum {
 	BLK_MQ_UNIQUE_TAG_BITS = 16,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 7/7] blk-mq: get rid of the cpumask in struct blk_mq_tags
@ 2016-08-29 10:53   ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-08-29 10:53 UTC (permalink / raw)


Unused now that NVMe sets up irq affinity before calling into blk-mq.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-mq-tag.c     | 6 ------
 block/blk-mq-tag.h     | 1 -
 block/blk-mq.c         | 7 -------
 include/linux/blk-mq.h | 1 -
 4 files changed, 15 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 1602813..2eae3d5 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -665,11 +665,6 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 	if (!tags)
 		return NULL;
 
-	if (!zalloc_cpumask_var(&tags->cpumask, GFP_KERNEL)) {
-		kfree(tags);
-		return NULL;
-	}
-
 	tags->nr_tags = total_tags;
 	tags->nr_reserved_tags = reserved_tags;
 
@@ -680,7 +675,6 @@ void blk_mq_free_tags(struct blk_mq_tags *tags)
 {
 	bt_free(&tags->bitmap_tags);
 	bt_free(&tags->breserved_tags);
-	free_cpumask_var(tags->cpumask);
 	kfree(tags);
 }
 
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index d468a79..5569641 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -44,7 +44,6 @@ struct blk_mq_tags {
 	struct list_head page_list;
 
 	int alloc_policy;
-	cpumask_var_t cpumask;
 };
 
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a679562..1fc42a0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1852,7 +1852,6 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 		hctx->tags = set->tags[i];
 		WARN_ON(!hctx->tags);
 
-		cpumask_copy(hctx->tags->cpumask, hctx->cpumask);
 		/*
 		 * Set the map size to the number of mapped software queues.
 		 * This is more accurate and more efficient than looping
@@ -2279,12 +2278,6 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 	return 0;
 }
 
-struct cpumask *blk_mq_tags_cpumask(struct blk_mq_tags *tags)
-{
-	return tags->cpumask;
-}
-EXPORT_SYMBOL_GPL(blk_mq_tags_cpumask);
-
 static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
 		const struct cpumask *affinity_mask)
 {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 29e227b..79a421e 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -197,7 +197,6 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
 struct request *blk_mq_alloc_request_hctx(struct request_queue *q, int op,
 		unsigned int flags, unsigned int hctx_idx);
 struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag);
-struct cpumask *blk_mq_tags_cpumask(struct blk_mq_tags *tags);
 
 enum {
 	BLK_MQ_UNIQUE_TAG_BITS = 16,
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: blk-mq: allow passing in an external queue mapping V2
  2016-08-29 10:53 ` Christoph Hellwig
@ 2016-08-30 23:28   ` Keith Busch
  -1 siblings, 0 replies; 38+ messages in thread
From: Keith Busch @ 2016-08-30 23:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, linux-block, linux-nvme

On Mon, Aug 29, 2016 at 12:53:26PM +0200, Christoph Hellwig wrote:
> This series is one third of the earlier "automatic interrupt affinity for
> MSI/MSI-X capable devices" series, and make uses of the new irq-level
> interrupt / queue mapping code in blk-mq, as well as allowing the driver
> to pass in such a mask obtained from the (PCI) interrupt code.  To fully
> support this feature in drivers the final third in the PCI layer will
> be needed as well.
> 
> Note that these patches are on top of Linux 4.8-rc4 and will need several
> patches not yet in the block for-4.9 branch.
> 
> A git tree is available at:
> 
>    git://git.infradead.org/users/hch/block.git block-queue-mapping
> 
> Gitweb:
> 
>    http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/block-queue-mapping
> 
> Changes since V1:
>  - rebased on top of Linux 4.8-rc4
> 
> Changes since automatic interrupt affinity for MSI/MSI-X capable devices V3:
>  - a trivial cleanup in blk_mq_create_mq_map pointed out by Alexander
> 

I really like how this is looking, but the pairing isn't coming out
right when I applied this on one of my test machines. I have 32 CPUs
and 31 MSI-x vectors, and this is the resulting blk-mq cpu list:

  # cat /sys/block/nvme0n1/mq/*/cpu_list
  0
  10
  11
  12
  13
  14
  15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
  1
  2
  3
  4
  5
  6
  7
  8
  9

Before, each mq hardware queue would have 2 CPUs, now it's terribly
unbalanced.

I'll look again tomorrow to make sure I didn't mess something up when
merging your tree, but just wanted to let you know now.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* blk-mq: allow passing in an external queue mapping V2
@ 2016-08-30 23:28   ` Keith Busch
  0 siblings, 0 replies; 38+ messages in thread
From: Keith Busch @ 2016-08-30 23:28 UTC (permalink / raw)


On Mon, Aug 29, 2016@12:53:26PM +0200, Christoph Hellwig wrote:
> This series is one third of the earlier "automatic interrupt affinity for
> MSI/MSI-X capable devices" series, and make uses of the new irq-level
> interrupt / queue mapping code in blk-mq, as well as allowing the driver
> to pass in such a mask obtained from the (PCI) interrupt code.  To fully
> support this feature in drivers the final third in the PCI layer will
> be needed as well.
> 
> Note that these patches are on top of Linux 4.8-rc4 and will need several
> patches not yet in the block for-4.9 branch.
> 
> A git tree is available at:
> 
>    git://git.infradead.org/users/hch/block.git block-queue-mapping
> 
> Gitweb:
> 
>    http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/block-queue-mapping
> 
> Changes since V1:
>  - rebased on top of Linux 4.8-rc4
> 
> Changes since automatic interrupt affinity for MSI/MSI-X capable devices V3:
>  - a trivial cleanup in blk_mq_create_mq_map pointed out by Alexander
> 

I really like how this is looking, but the pairing isn't coming out
right when I applied this on one of my test machines. I have 32 CPUs
and 31 MSI-x vectors, and this is the resulting blk-mq cpu list:

  # cat /sys/block/nvme0n1/mq/*/cpu_list
  0
  10
  11
  12
  13
  14
  15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31
  1
  2
  3
  4
  5
  6
  7
  8
  9

Before, each mq hardware queue would have 2 CPUs, now it's terribly
unbalanced.

I'll look again tomorrow to make sure I didn't mess something up when
merging your tree, but just wanted to let you know now.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
  2016-08-29 10:53   ` Christoph Hellwig
@ 2016-08-31 16:38     ` Keith Busch
  -1 siblings, 0 replies; 38+ messages in thread
From: Keith Busch @ 2016-08-31 16:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, linux-block, linux-nvme

On Mon, Aug 29, 2016 at 12:53:30PM +0200, Christoph Hellwig wrote:
> +static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
> +		const struct cpumask *affinity_mask)
> +{
> +	int queue = -1, cpu = 0;
> +
> +	set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
> +			GFP_KERNEL, set->numa_node);
> +	if (!set->mq_map)
> +		return -ENOMEM;
> +
> +	if (!affinity_mask)
> +		return 0;	/* map all cpus to queue 0 */
> +
> +	/* If cpus are offline, map them to first hctx */
> +	for_each_online_cpu(cpu) {
> +		if (cpumask_test_cpu(cpu, affinity_mask))
> +			queue++;
> +		if (queue >= 0)
> +			set->mq_map[cpu] = queue;
> +	}

This can't be right. We have a single affinity mask for the entire
set, but what I think we want is an one affinity mask for each
nr_io_queues. The irq_create_affinity_mask should then create an array
of cpumasks based on nr_vecs..

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
@ 2016-08-31 16:38     ` Keith Busch
  0 siblings, 0 replies; 38+ messages in thread
From: Keith Busch @ 2016-08-31 16:38 UTC (permalink / raw)


On Mon, Aug 29, 2016@12:53:30PM +0200, Christoph Hellwig wrote:
> +static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
> +		const struct cpumask *affinity_mask)
> +{
> +	int queue = -1, cpu = 0;
> +
> +	set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
> +			GFP_KERNEL, set->numa_node);
> +	if (!set->mq_map)
> +		return -ENOMEM;
> +
> +	if (!affinity_mask)
> +		return 0;	/* map all cpus to queue 0 */
> +
> +	/* If cpus are offline, map them to first hctx */
> +	for_each_online_cpu(cpu) {
> +		if (cpumask_test_cpu(cpu, affinity_mask))
> +			queue++;
> +		if (queue >= 0)
> +			set->mq_map[cpu] = queue;
> +	}

This can't be right. We have a single affinity mask for the entire
set, but what I think we want is an one affinity mask for each
nr_io_queues. The irq_create_affinity_mask should then create an array
of cpumasks based on nr_vecs..

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: blk-mq: allow passing in an external queue mapping V2
  2016-08-30 23:28   ` Keith Busch
@ 2016-09-01  8:45     ` Christoph Hellwig
  -1 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-09-01  8:45 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, linux-block, Christoph Hellwig, linux-nvme

On Tue, Aug 30, 2016 at 07:28:24PM -0400, Keith Busch wrote:
> Before, each mq hardware queue would have 2 CPUs, now it's terribly
> unbalanced.

Hmm.  This wasn't supposed to change the cpu mapping, an indeed I saw
results like that horrible one before as well.  Let me dig into this
with my fake lower queue count device.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 38+ messages in thread

* blk-mq: allow passing in an external queue mapping V2
@ 2016-09-01  8:45     ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-09-01  8:45 UTC (permalink / raw)


On Tue, Aug 30, 2016@07:28:24PM -0400, Keith Busch wrote:
> Before, each mq hardware queue would have 2 CPUs, now it's terribly
> unbalanced.

Hmm.  This wasn't supposed to change the cpu mapping, an indeed I saw
results like that horrible one before as well.  Let me dig into this
with my fake lower queue count device.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
  2016-08-31 16:38     ` Keith Busch
@ 2016-09-01  8:46       ` Christoph Hellwig
  -1 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-09-01  8:46 UTC (permalink / raw)
  To: Keith Busch; +Cc: Christoph Hellwig, axboe, linux-block, linux-nvme

On Wed, Aug 31, 2016 at 12:38:53PM -0400, Keith Busch wrote:
> This can't be right. We have a single affinity mask for the entire
> set, but what I think we want is an one affinity mask for each
> nr_io_queues. The irq_create_affinity_mask should then create an array
> of cpumasks based on nr_vecs..

Nah, this is Thomas' creating abuse of the cpumask type.  Every bit set
in the affinity_mask means this is a cpu we allocate a vector / queue to.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
@ 2016-09-01  8:46       ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-09-01  8:46 UTC (permalink / raw)


On Wed, Aug 31, 2016@12:38:53PM -0400, Keith Busch wrote:
> This can't be right. We have a single affinity mask for the entire
> set, but what I think we want is an one affinity mask for each
> nr_io_queues. The irq_create_affinity_mask should then create an array
> of cpumasks based on nr_vecs..

Nah, this is Thomas' creating abuse of the cpumask type.  Every bit set
in the affinity_mask means this is a cpu we allocate a vector / queue to.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
  2016-09-01  8:46       ` Christoph Hellwig
@ 2016-09-01 14:24         ` Keith Busch
  -1 siblings, 0 replies; 38+ messages in thread
From: Keith Busch @ 2016-09-01 14:24 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, linux-block, linux-nvme

On Thu, Sep 01, 2016 at 10:46:24AM +0200, Christoph Hellwig wrote:
> On Wed, Aug 31, 2016 at 12:38:53PM -0400, Keith Busch wrote:
> > This can't be right. We have a single affinity mask for the entire
> > set, but what I think we want is an one affinity mask for each
> > nr_io_queues. The irq_create_affinity_mask should then create an array
> > of cpumasks based on nr_vecs..
> 
> Nah, this is Thomas' creating abuse of the cpumask type.  Every bit set
> in the affinity_mask means this is a cpu we allocate a vector / queue to.

Yeah, I gathered that's what it was providing, but that's just barely
not enough information to do something useful. The CPUs that aren't set
have to use a previously assigned vector/queue, but which one?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
@ 2016-09-01 14:24         ` Keith Busch
  0 siblings, 0 replies; 38+ messages in thread
From: Keith Busch @ 2016-09-01 14:24 UTC (permalink / raw)


On Thu, Sep 01, 2016@10:46:24AM +0200, Christoph Hellwig wrote:
> On Wed, Aug 31, 2016@12:38:53PM -0400, Keith Busch wrote:
> > This can't be right. We have a single affinity mask for the entire
> > set, but what I think we want is an one affinity mask for each
> > nr_io_queues. The irq_create_affinity_mask should then create an array
> > of cpumasks based on nr_vecs..
> 
> Nah, this is Thomas' creating abuse of the cpumask type.  Every bit set
> in the affinity_mask means this is a cpu we allocate a vector / queue to.

Yeah, I gathered that's what it was providing, but that's just barely
not enough information to do something useful. The CPUs that aren't set
have to use a previously assigned vector/queue, but which one?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
  2016-09-01 14:24         ` Keith Busch
@ 2016-09-01 23:30           ` Keith Busch
  -1 siblings, 0 replies; 38+ messages in thread
From: Keith Busch @ 2016-09-01 23:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, linux-block, linux-nvme

On Thu, Sep 01, 2016 at 10:24:10AM -0400, Keith Busch wrote:
> Yeah, I gathered that's what it was providing, but that's just barely
> not enough information to do something useful. The CPUs that aren't set
> have to use a previously assigned vector/queue, but which one?

Unless I'm totally missing how to infer paired CPUs, I think we need
arrays.

Here's a stab at that. I'm using the "old" algorithm the NVMe driver used
to pair vectors and cpus. It's not the most efficient way of pairing
that I know of, but it is easy to follow (relatively speaking), and it
actually utilizes every hardware resource available so I get very good
CPU <-> Queue mappings.

---
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9cc08c6..c5c038e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2283,7 +2283,7 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
 		const struct cpumask *affinity_mask)
 {
-	int queue = -1, cpu = 0;
+	int queue;
 
 	set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
 			GFP_KERNEL, set->numa_node);
@@ -2293,11 +2293,10 @@ static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
 	if (!affinity_mask)
 		return 0;	/* map all cpus to queue 0 */
 
-	/* If cpus are offline, map them to first hctx */
-	for_each_online_cpu(cpu) {
-		if (cpumask_test_cpu(cpu, affinity_mask))
-			queue++;
-		if (queue >= 0)
+	for (queue = 0; queue < set->nr_hw_queues; queue++) {
+		int cpu;
+
+		for_each_cpu(cpu, &affinity_mask[queue])
 			set->mq_map[cpu] = queue;
 	}
 
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 98f1222..03a1ffc 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -683,15 +683,11 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
 {
 	const struct cpumask *mask = NULL;
 	struct msi_desc *entry;
-	int cpu = -1, i;
+	int i;
 
 	for (i = 0; i < nvec; i++) {
-		if (dev->irq_affinity) {
-			cpu = cpumask_next(cpu, dev->irq_affinity);
-			if (cpu >= nr_cpu_ids)
-				cpu = cpumask_first(dev->irq_affinity);
-			mask = cpumask_of(cpu);
-		}
+		if (dev->irq_affinity)
+			mask = &dev->irq_affinity[i];
 
 		entry = alloc_msi_entry(&dev->dev);
 		if (!entry) {
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 32f6cfc..9fe548b 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -4,14 +4,47 @@
 #include <linux/slab.h>
 #include <linux/cpu.h>
 
-static int get_first_sibling(unsigned int cpu)
+static int find_closest_node(int node)
 {
-	unsigned int ret;
+	int n, val, min_val = INT_MAX, best_node = node;
+
+	for_each_online_node(n) {
+		if (n == node)
+			continue;
+		val = node_distance(node, n);
+		if (val < min_val) {
+			min_val = val;
+			best_node = n;
+		}
+	}
+	return best_node;
+}
+
+static void set_vec_cpus(const cpumask_t *qmask, struct cpumask *affinity_mask,
+								int count)
+{
+	int cpu;
+
+	for_each_cpu(cpu, qmask) {
+		if (cpumask_weight(affinity_mask) >= count)
+			break;
+		cpumask_set_cpu(cpu, affinity_mask);
+	}
+}
+
+static void add_cpus(cpumask_t *mask, const cpumask_t *unassigned_cpus,
+	const cpumask_t *new_mask, struct cpumask *affinity_mask,
+	int cpus_per_queue)
+{
+	int next_cpu;
+
+	for_each_cpu(next_cpu, new_mask) {
+		cpumask_or(mask, mask, get_cpu_mask(next_cpu));
+		cpumask_or(mask, mask, topology_sibling_cpumask(next_cpu));
+		cpumask_and(mask, mask, unassigned_cpus);
+	}
+	set_vec_cpus(mask, affinity_mask, cpus_per_queue);
 
-	ret = cpumask_first(topology_sibling_cpumask(cpu));
-	if (ret < nr_cpu_ids)
-		return ret;
-	return cpu;
 }
 
 /*
@@ -27,37 +60,76 @@ static int get_first_sibling(unsigned int cpu)
  */
 struct cpumask *irq_create_affinity_mask(unsigned int *nr_vecs)
 {
-	struct cpumask *affinity_mask;
-	unsigned int max_vecs = *nr_vecs;
+	struct cpumask *affinity_mask, *masks;
+	unsigned int max_vecs = *nr_vecs, cpu, cpus_per_vec, remainder, i;
+	cpumask_var_t unassigned_cpus;
 
 	if (max_vecs == 1)
 		return NULL;
 
-	affinity_mask = kzalloc(cpumask_size(), GFP_KERNEL);
-	if (!affinity_mask) {
+	masks = kcalloc(max_vecs, sizeof(*affinity_mask), GFP_KERNEL);
+	if (!masks) {
 		*nr_vecs = 1;
 		return NULL;
 	}
 
 	get_online_cpus();
-	if (max_vecs >= num_online_cpus()) {
-		cpumask_copy(affinity_mask, cpu_online_mask);
-		*nr_vecs = num_online_cpus();
-	} else {
-		unsigned int vecs = 0, cpu;
-
-		for_each_online_cpu(cpu) {
-			if (cpu == get_first_sibling(cpu)) {
-				cpumask_set_cpu(cpu, affinity_mask);
-				vecs++;
-			}
-
-			if (--max_vecs == 0)
-				break;
-		}
-		*nr_vecs = vecs;
+
+	cpus_per_vec = num_online_cpus() / max_vecs;
+	remainder = max_vecs - (num_online_cpus() - max_vecs * cpus_per_vec);
+
+	cpumask_copy(unassigned_cpus, cpu_online_mask);
+	cpu = cpumask_first(unassigned_cpus);
+
+	for (i = 0; i < max_vecs; i++) {
+		cpumask_t mask;
+
+		if (!cpumask_weight(unassigned_cpus))
+			break;
+
+		affinity_mask = &masks[i];
+
+		mask = *get_cpu_mask(cpu);
+		set_vec_cpus(&mask, affinity_mask, cpus_per_vec);
+
+		if (cpumask_weight(&mask) < cpus_per_vec)
+			add_cpus(&mask, unassigned_cpus,
+				topology_sibling_cpumask(cpu),
+				affinity_mask, cpus_per_vec);
+		if (cpumask_weight(&mask) < cpus_per_vec)
+			add_cpus(&mask, unassigned_cpus,
+				topology_core_cpumask(cpu),
+				affinity_mask, cpus_per_vec);
+		if (cpumask_weight(&mask) < cpus_per_vec)
+			add_cpus(&mask, unassigned_cpus,
+				cpumask_of_node(cpu_to_node(cpu)),
+				affinity_mask, cpus_per_vec);
+		if (cpumask_weight(&mask) < cpus_per_vec)
+			add_cpus(&mask, unassigned_cpus,
+				cpumask_of_node(
+					find_closest_node(
+						cpu_to_node(cpu))),
+				affinity_mask, cpus_per_vec);
+		if (cpumask_weight(&mask) < cpus_per_vec)
+			add_cpus(&mask, unassigned_cpus,
+				unassigned_cpus, affinity_mask,
+				cpus_per_vec);
+
+		cpumask_andnot(unassigned_cpus, unassigned_cpus, affinity_mask);
+		cpu = cpumask_next(cpu, unassigned_cpus);
+
+		if (remainder && !--remainder)
+			cpus_per_vec++;
 	}
 	put_online_cpus();
 
-	return affinity_mask;
+	i = 0;
+	cpumask_andnot(unassigned_cpus, cpu_possible_mask, cpu_online_mask);
+	for_each_cpu(cpu, unassigned_cpus) {
+		set_vec_cpus(get_cpu_mask(cpu), &masks[i], ~0);
+		i = (i + 1) % max_vecs;
+	}
+	free_cpumask_var(unassigned_cpus);
+
+	return masks;
 }
--

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
@ 2016-09-01 23:30           ` Keith Busch
  0 siblings, 0 replies; 38+ messages in thread
From: Keith Busch @ 2016-09-01 23:30 UTC (permalink / raw)


On Thu, Sep 01, 2016@10:24:10AM -0400, Keith Busch wrote:
> Yeah, I gathered that's what it was providing, but that's just barely
> not enough information to do something useful. The CPUs that aren't set
> have to use a previously assigned vector/queue, but which one?

Unless I'm totally missing how to infer paired CPUs, I think we need
arrays.

Here's a stab at that. I'm using the "old" algorithm the NVMe driver used
to pair vectors and cpus. It's not the most efficient way of pairing
that I know of, but it is easy to follow (relatively speaking), and it
actually utilizes every hardware resource available so I get very good
CPU <-> Queue mappings.

---
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9cc08c6..c5c038e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2283,7 +2283,7 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
 		const struct cpumask *affinity_mask)
 {
-	int queue = -1, cpu = 0;
+	int queue;
 
 	set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
 			GFP_KERNEL, set->numa_node);
@@ -2293,11 +2293,10 @@ static int blk_mq_create_mq_map(struct blk_mq_tag_set *set,
 	if (!affinity_mask)
 		return 0;	/* map all cpus to queue 0 */
 
-	/* If cpus are offline, map them to first hctx */
-	for_each_online_cpu(cpu) {
-		if (cpumask_test_cpu(cpu, affinity_mask))
-			queue++;
-		if (queue >= 0)
+	for (queue = 0; queue < set->nr_hw_queues; queue++) {
+		int cpu;
+
+		for_each_cpu(cpu, &affinity_mask[queue])
 			set->mq_map[cpu] = queue;
 	}
 
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 98f1222..03a1ffc 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -683,15 +683,11 @@ static int msix_setup_entries(struct pci_dev *dev, void __iomem *base,
 {
 	const struct cpumask *mask = NULL;
 	struct msi_desc *entry;
-	int cpu = -1, i;
+	int i;
 
 	for (i = 0; i < nvec; i++) {
-		if (dev->irq_affinity) {
-			cpu = cpumask_next(cpu, dev->irq_affinity);
-			if (cpu >= nr_cpu_ids)
-				cpu = cpumask_first(dev->irq_affinity);
-			mask = cpumask_of(cpu);
-		}
+		if (dev->irq_affinity)
+			mask = &dev->irq_affinity[i];
 
 		entry = alloc_msi_entry(&dev->dev);
 		if (!entry) {
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 32f6cfc..9fe548b 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -4,14 +4,47 @@
 #include <linux/slab.h>
 #include <linux/cpu.h>
 
-static int get_first_sibling(unsigned int cpu)
+static int find_closest_node(int node)
 {
-	unsigned int ret;
+	int n, val, min_val = INT_MAX, best_node = node;
+
+	for_each_online_node(n) {
+		if (n == node)
+			continue;
+		val = node_distance(node, n);
+		if (val < min_val) {
+			min_val = val;
+			best_node = n;
+		}
+	}
+	return best_node;
+}
+
+static void set_vec_cpus(const cpumask_t *qmask, struct cpumask *affinity_mask,
+								int count)
+{
+	int cpu;
+
+	for_each_cpu(cpu, qmask) {
+		if (cpumask_weight(affinity_mask) >= count)
+			break;
+		cpumask_set_cpu(cpu, affinity_mask);
+	}
+}
+
+static void add_cpus(cpumask_t *mask, const cpumask_t *unassigned_cpus,
+	const cpumask_t *new_mask, struct cpumask *affinity_mask,
+	int cpus_per_queue)
+{
+	int next_cpu;
+
+	for_each_cpu(next_cpu, new_mask) {
+		cpumask_or(mask, mask, get_cpu_mask(next_cpu));
+		cpumask_or(mask, mask, topology_sibling_cpumask(next_cpu));
+		cpumask_and(mask, mask, unassigned_cpus);
+	}
+	set_vec_cpus(mask, affinity_mask, cpus_per_queue);
 
-	ret = cpumask_first(topology_sibling_cpumask(cpu));
-	if (ret < nr_cpu_ids)
-		return ret;
-	return cpu;
 }
 
 /*
@@ -27,37 +60,76 @@ static int get_first_sibling(unsigned int cpu)
  */
 struct cpumask *irq_create_affinity_mask(unsigned int *nr_vecs)
 {
-	struct cpumask *affinity_mask;
-	unsigned int max_vecs = *nr_vecs;
+	struct cpumask *affinity_mask, *masks;
+	unsigned int max_vecs = *nr_vecs, cpu, cpus_per_vec, remainder, i;
+	cpumask_var_t unassigned_cpus;
 
 	if (max_vecs == 1)
 		return NULL;
 
-	affinity_mask = kzalloc(cpumask_size(), GFP_KERNEL);
-	if (!affinity_mask) {
+	masks = kcalloc(max_vecs, sizeof(*affinity_mask), GFP_KERNEL);
+	if (!masks) {
 		*nr_vecs = 1;
 		return NULL;
 	}
 
 	get_online_cpus();
-	if (max_vecs >= num_online_cpus()) {
-		cpumask_copy(affinity_mask, cpu_online_mask);
-		*nr_vecs = num_online_cpus();
-	} else {
-		unsigned int vecs = 0, cpu;
-
-		for_each_online_cpu(cpu) {
-			if (cpu == get_first_sibling(cpu)) {
-				cpumask_set_cpu(cpu, affinity_mask);
-				vecs++;
-			}
-
-			if (--max_vecs == 0)
-				break;
-		}
-		*nr_vecs = vecs;
+
+	cpus_per_vec = num_online_cpus() / max_vecs;
+	remainder = max_vecs - (num_online_cpus() - max_vecs * cpus_per_vec);
+
+	cpumask_copy(unassigned_cpus, cpu_online_mask);
+	cpu = cpumask_first(unassigned_cpus);
+
+	for (i = 0; i < max_vecs; i++) {
+		cpumask_t mask;
+
+		if (!cpumask_weight(unassigned_cpus))
+			break;
+
+		affinity_mask = &masks[i];
+
+		mask = *get_cpu_mask(cpu);
+		set_vec_cpus(&mask, affinity_mask, cpus_per_vec);
+
+		if (cpumask_weight(&mask) < cpus_per_vec)
+			add_cpus(&mask, unassigned_cpus,
+				topology_sibling_cpumask(cpu),
+				affinity_mask, cpus_per_vec);
+		if (cpumask_weight(&mask) < cpus_per_vec)
+			add_cpus(&mask, unassigned_cpus,
+				topology_core_cpumask(cpu),
+				affinity_mask, cpus_per_vec);
+		if (cpumask_weight(&mask) < cpus_per_vec)
+			add_cpus(&mask, unassigned_cpus,
+				cpumask_of_node(cpu_to_node(cpu)),
+				affinity_mask, cpus_per_vec);
+		if (cpumask_weight(&mask) < cpus_per_vec)
+			add_cpus(&mask, unassigned_cpus,
+				cpumask_of_node(
+					find_closest_node(
+						cpu_to_node(cpu))),
+				affinity_mask, cpus_per_vec);
+		if (cpumask_weight(&mask) < cpus_per_vec)
+			add_cpus(&mask, unassigned_cpus,
+				unassigned_cpus, affinity_mask,
+				cpus_per_vec);
+
+		cpumask_andnot(unassigned_cpus, unassigned_cpus, affinity_mask);
+		cpu = cpumask_next(cpu, unassigned_cpus);
+
+		if (remainder && !--remainder)
+			cpus_per_vec++;
 	}
 	put_online_cpus();
 
-	return affinity_mask;
+	i = 0;
+	cpumask_andnot(unassigned_cpus, cpu_possible_mask, cpu_online_mask);
+	for_each_cpu(cpu, unassigned_cpus) {
+		set_vec_cpus(get_cpu_mask(cpu), &masks[i], ~0);
+		i = (i + 1) % max_vecs;
+	}
+	free_cpumask_var(unassigned_cpus);
+
+	return masks;
 }
--

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
  2016-09-01 14:24         ` Keith Busch
@ 2016-09-05 19:48           ` Christoph Hellwig
  -1 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-09-05 19:48 UTC (permalink / raw)
  To: Keith Busch; +Cc: Christoph Hellwig, axboe, linux-block, linux-nvme

On Thu, Sep 01, 2016 at 10:24:10AM -0400, Keith Busch wrote:
> On Thu, Sep 01, 2016 at 10:46:24AM +0200, Christoph Hellwig wrote:
> > On Wed, Aug 31, 2016 at 12:38:53PM -0400, Keith Busch wrote:
> > > This can't be right. We have a single affinity mask for the entire
> > > set, but what I think we want is an one affinity mask for each
> > > nr_io_queues. The irq_create_affinity_mask should then create an array
> > > of cpumasks based on nr_vecs..
> > 
> > Nah, this is Thomas' creating abuse of the cpumask type.  Every bit set
> > in the affinity_mask means this is a cpu we allocate a vector / queue to.
> 
> Yeah, I gathered that's what it was providing, but that's just barely
> not enough information to do something useful. The CPUs that aren't set
> have to use a previously assigned vector/queue, but which one?

Always the previous one.  Below is a patch to get us back to the
previous behavior:

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 32f6cfc..09d4407 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -29,6 +29,8 @@ struct cpumask *irq_create_affinity_mask(unsigned int *nr_vecs)
 {
 	struct cpumask *affinity_mask;
 	unsigned int max_vecs = *nr_vecs;
+	unsigned int nr_cpus = 0, nr_uniq_cpus = 0, cpu;
+	unsigned int vec = 0, prev = -1, idx = 0;
 
 	if (max_vecs == 1)
 		return NULL;
@@ -40,24 +42,27 @@ struct cpumask *irq_create_affinity_mask(unsigned int *nr_vecs)
 	}
 
 	get_online_cpus();
-	if (max_vecs >= num_online_cpus()) {
-		cpumask_copy(affinity_mask, cpu_online_mask);
-		*nr_vecs = num_online_cpus();
-	} else {
-		unsigned int vecs = 0, cpu;
-
-		for_each_online_cpu(cpu) {
-			if (cpu == get_first_sibling(cpu)) {
-				cpumask_set_cpu(cpu, affinity_mask);
-				vecs++;
-			}
-
-			if (--max_vecs == 0)
-				break;
-		}
-		*nr_vecs = vecs;
+	for_each_online_cpu(cpu) {
+		nr_cpus++;
+		if (cpu == get_first_sibling(cpu))
+			nr_uniq_cpus++;
+	}
+
+	for_each_online_cpu(cpu) {
+		if (max_vecs >= nr_cpus || nr_cpus == nr_uniq_cpus)
+			vec = idx * max_vecs / nr_cpus;
+		else if (cpu == get_first_sibling(cpu))
+			vec = idx * max_vecs / nr_uniq_cpus;
+		else
+			continue;
+
+		if (vec != prev)
+			cpumask_set_cpu(cpu, affinity_mask);
+		prev = vec;
+		idx++;
 	}
 	put_online_cpus();
 
+	*nr_vecs = idx;
 	return affinity_mask;
 }

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
@ 2016-09-05 19:48           ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-09-05 19:48 UTC (permalink / raw)


On Thu, Sep 01, 2016@10:24:10AM -0400, Keith Busch wrote:
> On Thu, Sep 01, 2016@10:46:24AM +0200, Christoph Hellwig wrote:
> > On Wed, Aug 31, 2016@12:38:53PM -0400, Keith Busch wrote:
> > > This can't be right. We have a single affinity mask for the entire
> > > set, but what I think we want is an one affinity mask for each
> > > nr_io_queues. The irq_create_affinity_mask should then create an array
> > > of cpumasks based on nr_vecs..
> > 
> > Nah, this is Thomas' creating abuse of the cpumask type.  Every bit set
> > in the affinity_mask means this is a cpu we allocate a vector / queue to.
> 
> Yeah, I gathered that's what it was providing, but that's just barely
> not enough information to do something useful. The CPUs that aren't set
> have to use a previously assigned vector/queue, but which one?

Always the previous one.  Below is a patch to get us back to the
previous behavior:

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 32f6cfc..09d4407 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -29,6 +29,8 @@ struct cpumask *irq_create_affinity_mask(unsigned int *nr_vecs)
 {
 	struct cpumask *affinity_mask;
 	unsigned int max_vecs = *nr_vecs;
+	unsigned int nr_cpus = 0, nr_uniq_cpus = 0, cpu;
+	unsigned int vec = 0, prev = -1, idx = 0;
 
 	if (max_vecs == 1)
 		return NULL;
@@ -40,24 +42,27 @@ struct cpumask *irq_create_affinity_mask(unsigned int *nr_vecs)
 	}
 
 	get_online_cpus();
-	if (max_vecs >= num_online_cpus()) {
-		cpumask_copy(affinity_mask, cpu_online_mask);
-		*nr_vecs = num_online_cpus();
-	} else {
-		unsigned int vecs = 0, cpu;
-
-		for_each_online_cpu(cpu) {
-			if (cpu == get_first_sibling(cpu)) {
-				cpumask_set_cpu(cpu, affinity_mask);
-				vecs++;
-			}
-
-			if (--max_vecs == 0)
-				break;
-		}
-		*nr_vecs = vecs;
+	for_each_online_cpu(cpu) {
+		nr_cpus++;
+		if (cpu == get_first_sibling(cpu))
+			nr_uniq_cpus++;
+	}
+
+	for_each_online_cpu(cpu) {
+		if (max_vecs >= nr_cpus || nr_cpus == nr_uniq_cpus)
+			vec = idx * max_vecs / nr_cpus;
+		else if (cpu == get_first_sibling(cpu))
+			vec = idx * max_vecs / nr_uniq_cpus;
+		else
+			continue;
+
+		if (vec != prev)
+			cpumask_set_cpu(cpu, affinity_mask);
+		prev = vec;
+		idx++;
 	}
 	put_online_cpus();
 
+	*nr_vecs = idx;
 	return affinity_mask;
 }

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
  2016-09-05 19:48           ` Christoph Hellwig
@ 2016-09-06 14:39             ` Keith Busch
  -1 siblings, 0 replies; 38+ messages in thread
From: Keith Busch @ 2016-09-06 14:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, linux-block, linux-nvme

On Mon, Sep 05, 2016 at 09:48:00PM +0200, Christoph Hellwig wrote:
> On Thu, Sep 01, 2016 at 10:24:10AM -0400, Keith Busch wrote:
> > On Thu, Sep 01, 2016 at 10:46:24AM +0200, Christoph Hellwig wrote:
> > > On Wed, Aug 31, 2016 at 12:38:53PM -0400, Keith Busch wrote:
> > > > This can't be right. We have a single affinity mask for the entire
> > > > set, but what I think we want is an one affinity mask for each
> > > > nr_io_queues. The irq_create_affinity_mask should then create an array
> > > > of cpumasks based on nr_vecs..
> > > 
> > > Nah, this is Thomas' creating abuse of the cpumask type.  Every bit set
> > > in the affinity_mask means this is a cpu we allocate a vector / queue to.
> > 
> > Yeah, I gathered that's what it was providing, but that's just barely
> > not enough information to do something useful. The CPUs that aren't set
> > have to use a previously assigned vector/queue, but which one?
> 
> Always the previous one.  Below is a patch to get us back to the
> previous behavior:

No, that's not right.

Here's my topology info:

  # numactl --hardware
  available: 2 nodes (0-1)
  node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
  node 0 size: 15745 MB
  node 0 free: 15319 MB
  node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
  node 1 size: 16150 MB
  node 1 free: 15758 MB
  node distances:
  node   0   1
    0:  10  21
    1:  21  10

If I have 16 vectors, the affinity_mask generated by what you're doing
looks like 0000ffff, CPU's 0-15. So the first 16 bits are set since each
of those are the first unique CPU, getting a unique vector just like you
wanted. If an unset bit just means share with the previous, then all of
my thread siblings (CPU's 16-31) get to share with CPU 15. That's awful!

What we want for my CPU topology is the 16th CPU to pair with CPU 0,
17 pairs with 1, 18 with 2, and so on. You can't convey that information
with this scheme. We need affinity_masks per vector.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
@ 2016-09-06 14:39             ` Keith Busch
  0 siblings, 0 replies; 38+ messages in thread
From: Keith Busch @ 2016-09-06 14:39 UTC (permalink / raw)


On Mon, Sep 05, 2016@09:48:00PM +0200, Christoph Hellwig wrote:
> On Thu, Sep 01, 2016@10:24:10AM -0400, Keith Busch wrote:
> > On Thu, Sep 01, 2016@10:46:24AM +0200, Christoph Hellwig wrote:
> > > On Wed, Aug 31, 2016@12:38:53PM -0400, Keith Busch wrote:
> > > > This can't be right. We have a single affinity mask for the entire
> > > > set, but what I think we want is an one affinity mask for each
> > > > nr_io_queues. The irq_create_affinity_mask should then create an array
> > > > of cpumasks based on nr_vecs..
> > > 
> > > Nah, this is Thomas' creating abuse of the cpumask type.  Every bit set
> > > in the affinity_mask means this is a cpu we allocate a vector / queue to.
> > 
> > Yeah, I gathered that's what it was providing, but that's just barely
> > not enough information to do something useful. The CPUs that aren't set
> > have to use a previously assigned vector/queue, but which one?
> 
> Always the previous one.  Below is a patch to get us back to the
> previous behavior:

No, that's not right.

Here's my topology info:

  # numactl --hardware
  available: 2 nodes (0-1)
  node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
  node 0 size: 15745 MB
  node 0 free: 15319 MB
  node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
  node 1 size: 16150 MB
  node 1 free: 15758 MB
  node distances:
  node   0   1
    0:  10  21
    1:  21  10

If I have 16 vectors, the affinity_mask generated by what you're doing
looks like 0000ffff, CPU's 0-15. So the first 16 bits are set since each
of those are the first unique CPU, getting a unique vector just like you
wanted. If an unset bit just means share with the previous, then all of
my thread siblings (CPU's 16-31) get to share with CPU 15. That's awful!

What we want for my CPU topology is the 16th CPU to pair with CPU 0,
17 pairs with 1, 18 with 2, and so on. You can't convey that information
with this scheme. We need affinity_masks per vector.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
  2016-09-06 14:39             ` Keith Busch
@ 2016-09-06 16:50               ` Christoph Hellwig
  -1 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-09-06 16:50 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, linux-block, linux-nvme, Thomas Gleixner

[adding Thomas as it's about the affinity_mask he (we) added to the
 IRQ core]

On Tue, Sep 06, 2016 at 10:39:28AM -0400, Keith Busch wrote:
> > Always the previous one.  Below is a patch to get us back to the
> > previous behavior:
> 
> No, that's not right.
> 
> Here's my topology info:
> 
>   # numactl --hardware
>   available: 2 nodes (0-1)
>   node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
>   node 0 size: 15745 MB
>   node 0 free: 15319 MB
>   node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
>   node 1 size: 16150 MB
>   node 1 free: 15758 MB
>   node distances:
>   node   0   1
>     0:  10  21
>     1:  21  10

How do you get that mapping?  Does this CPU use Hyperthreading and
thus expose siblings using topology_sibling_cpumask?  As that's the
only thing the old code used for any sort of special casing.

I'll need to see if I can find a system with such a mapping to reproduce.

> If I have 16 vectors, the affinity_mask generated by what you're doing
> looks like 0000ffff, CPU's 0-15. So the first 16 bits are set since each
> of those are the first unique CPU, getting a unique vector just like you
> wanted. If an unset bit just means share with the previous, then all of
> my thread siblings (CPU's 16-31) get to share with CPU 15. That's awful!
> 
> What we want for my CPU topology is the 16th CPU to pair with CPU 0,
> 17 pairs with 1, 18 with 2, and so on. You can't convey that information
> with this scheme. We need affinity_masks per vector.

We actually have per-vector masks, but they are hidden inside the IRQ
core and awkward to use.  We could to the get_first_sibling magic
in the block-mq queue mapping (and in fact with the current code I guess
we need to).  Or take a step back from trying to emulate the old code
and instead look at NUMA nodes instead of siblings which some folks
suggested a while ago.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
@ 2016-09-06 16:50               ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2016-09-06 16:50 UTC (permalink / raw)


[adding Thomas as it's about the affinity_mask he (we) added to the
 IRQ core]

On Tue, Sep 06, 2016@10:39:28AM -0400, Keith Busch wrote:
> > Always the previous one.  Below is a patch to get us back to the
> > previous behavior:
> 
> No, that's not right.
> 
> Here's my topology info:
> 
>   # numactl --hardware
>   available: 2 nodes (0-1)
>   node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
>   node 0 size: 15745 MB
>   node 0 free: 15319 MB
>   node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
>   node 1 size: 16150 MB
>   node 1 free: 15758 MB
>   node distances:
>   node   0   1
>     0:  10  21
>     1:  21  10

How do you get that mapping?  Does this CPU use Hyperthreading and
thus expose siblings using topology_sibling_cpumask?  As that's the
only thing the old code used for any sort of special casing.

I'll need to see if I can find a system with such a mapping to reproduce.

> If I have 16 vectors, the affinity_mask generated by what you're doing
> looks like 0000ffff, CPU's 0-15. So the first 16 bits are set since each
> of those are the first unique CPU, getting a unique vector just like you
> wanted. If an unset bit just means share with the previous, then all of
> my thread siblings (CPU's 16-31) get to share with CPU 15. That's awful!
> 
> What we want for my CPU topology is the 16th CPU to pair with CPU 0,
> 17 pairs with 1, 18 with 2, and so on. You can't convey that information
> with this scheme. We need affinity_masks per vector.

We actually have per-vector masks, but they are hidden inside the IRQ
core and awkward to use.  We could to the get_first_sibling magic
in the block-mq queue mapping (and in fact with the current code I guess
we need to).  Or take a step back from trying to emulate the old code
and instead look at NUMA nodes instead of siblings which some folks
suggested a while ago.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
  2016-09-06 16:50               ` Christoph Hellwig
@ 2016-09-06 17:30                 ` Keith Busch
  -1 siblings, 0 replies; 38+ messages in thread
From: Keith Busch @ 2016-09-06 17:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: axboe, linux-block, linux-nvme, Thomas Gleixner

On Tue, Sep 06, 2016 at 06:50:56PM +0200, Christoph Hellwig wrote:
> [adding Thomas as it's about the affinity_mask he (we) added to the
>  IRQ core]
> > Here's my topology info:
> > 
> >   # numactl --hardware
> >   available: 2 nodes (0-1)
> >   node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
> >   node 0 size: 15745 MB
> >   node 0 free: 15319 MB
> >   node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
> >   node 1 size: 16150 MB
> >   node 1 free: 15758 MB
> >   node distances:
> >   node   0   1
> >     0:  10  21
> >     1:  21  10
> 
> How do you get that mapping?  Does this CPU use Hyperthreading and
> thus expose siblings using topology_sibling_cpumask?  As that's the
> only thing the old code used for any sort of special casing.
> 
> I'll need to see if I can find a system with such a mapping to reproduce.

Yes, this is a two-socket server with hyperthreading enabled. Numbering
the physical CPUs before the hyperthreads is a common numbering on
x86, so we're going to see this split numbering on any multi-socket
hyperthreaded server.

The topology_sibling_cpumask shows the right information. The resulting
mask from cpu 0 on my server is 0x00010001; cpu 1 is 0x00020002, etc...

> > What we want for my CPU topology is the 16th CPU to pair with CPU 0,
> > 17 pairs with 1, 18 with 2, and so on. You can't convey that information
> > with this scheme. We need affinity_masks per vector.
> 
> We actually have per-vector masks, but they are hidden inside the IRQ
> core and awkward to use.  We could to the get_first_sibling magic
> in the block-mq queue mapping (and in fact with the current code I guess
> we need to).  Or take a step back from trying to emulate the old code
> and instead look at NUMA nodes instead of siblings which some folks
> suggested a while ago.

Adding the first sibling magic in blk-mq would fix my specific case,
but it doesn't help genericly when we need to pair more than just thread
siblings.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
@ 2016-09-06 17:30                 ` Keith Busch
  0 siblings, 0 replies; 38+ messages in thread
From: Keith Busch @ 2016-09-06 17:30 UTC (permalink / raw)


On Tue, Sep 06, 2016@06:50:56PM +0200, Christoph Hellwig wrote:
> [adding Thomas as it's about the affinity_mask he (we) added to the
>  IRQ core]
> > Here's my topology info:
> > 
> >   # numactl --hardware
> >   available: 2 nodes (0-1)
> >   node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
> >   node 0 size: 15745 MB
> >   node 0 free: 15319 MB
> >   node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
> >   node 1 size: 16150 MB
> >   node 1 free: 15758 MB
> >   node distances:
> >   node   0   1
> >     0:  10  21
> >     1:  21  10
> 
> How do you get that mapping?  Does this CPU use Hyperthreading and
> thus expose siblings using topology_sibling_cpumask?  As that's the
> only thing the old code used for any sort of special casing.
> 
> I'll need to see if I can find a system with such a mapping to reproduce.

Yes, this is a two-socket server with hyperthreading enabled. Numbering
the physical CPUs before the hyperthreads is a common numbering on
x86, so we're going to see this split numbering on any multi-socket
hyperthreaded server.

The topology_sibling_cpumask shows the right information. The resulting
mask from cpu 0 on my server is 0x00010001; cpu 1 is 0x00020002, etc...

> > What we want for my CPU topology is the 16th CPU to pair with CPU 0,
> > 17 pairs with 1, 18 with 2, and so on. You can't convey that information
> > with this scheme. We need affinity_masks per vector.
> 
> We actually have per-vector masks, but they are hidden inside the IRQ
> core and awkward to use.  We could to the get_first_sibling magic
> in the block-mq queue mapping (and in fact with the current code I guess
> we need to).  Or take a step back from trying to emulate the old code
> and instead look at NUMA nodes instead of siblings which some folks
> suggested a while ago.

Adding the first sibling magic in blk-mq would fix my specific case,
but it doesn't help genericly when we need to pair more than just thread
siblings.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
  2016-09-06 16:50               ` Christoph Hellwig
@ 2016-09-07 15:38                 ` Thomas Gleixner
  -1 siblings, 0 replies; 38+ messages in thread
From: Thomas Gleixner @ 2016-09-07 15:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, axboe, linux-block, linux-nvme

On Tue, 6 Sep 2016, Christoph Hellwig wrote:

> [adding Thomas as it's about the affinity_mask he (we) added to the
>  IRQ core]
> 
> On Tue, Sep 06, 2016 at 10:39:28AM -0400, Keith Busch wrote:
> > > Always the previous one.  Below is a patch to get us back to the
> > > previous behavior:
> > 
> > No, that's not right.
> > 
> > Here's my topology info:
> > 
> >   # numactl --hardware
> >   available: 2 nodes (0-1)
> >   node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
> >   node 0 size: 15745 MB
> >   node 0 free: 15319 MB
> >   node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
> >   node 1 size: 16150 MB
> >   node 1 free: 15758 MB
> >   node distances:
> >   node   0   1
> >     0:  10  21
> >     1:  21  10
> 
> How do you get that mapping?  Does this CPU use Hyperthreading and
> thus expose siblings using topology_sibling_cpumask?  As that's the
> only thing the old code used for any sort of special casing.

That's a normal Intel mapping with two sockets and HT enabled. The cpu
enumeration is

Socket0 - physical cores
Socket1 - physical cores

Socket0 - HT siblings
Socket1 - HT siblings
 
> I'll need to see if I can find a system with such a mapping to reproduce.

Any 2 socket Intel with HT enabled will do. If you need access to one let
me know.

> > If I have 16 vectors, the affinity_mask generated by what you're doing
> > looks like 0000ffff, CPU's 0-15. So the first 16 bits are set since each
> > of those are the first unique CPU, getting a unique vector just like you
> > wanted. If an unset bit just means share with the previous, then all of
> > my thread siblings (CPU's 16-31) get to share with CPU 15. That's awful!
> > 
> > What we want for my CPU topology is the 16th CPU to pair with CPU 0,
> > 17 pairs with 1, 18 with 2, and so on. You can't convey that information
> > with this scheme. We need affinity_masks per vector.
> 
> We actually have per-vector masks, but they are hidden inside the IRQ
> core and awkward to use.  We could to the get_first_sibling magic
> in the block-mq queue mapping (and in fact with the current code I guess
> we need to).  Or take a step back from trying to emulate the old code
> and instead look at NUMA nodes instead of siblings which some folks
> suggested a while ago.

I think you want both.

NUMA nodes are certainly the first decision factor. You split the number of
vectors to the nodes:

  vecs_per_node = num_vector / num_nodes;

Then you spread the number of vectors per node by the number of cpus per
node.

  cpus_per_vec = cpus_on(node) / vecs_per_node;

If the number of cpus per vector is <= 1 you just use a round robin
scheme. If not, you need to look at siblings.

Looking at the whole thing, I think we need to be more clever when setting
up the msi descriptor affinity masks.

I'll send a RFC series soon.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask
@ 2016-09-07 15:38                 ` Thomas Gleixner
  0 siblings, 0 replies; 38+ messages in thread
From: Thomas Gleixner @ 2016-09-07 15:38 UTC (permalink / raw)


On Tue, 6 Sep 2016, Christoph Hellwig wrote:

> [adding Thomas as it's about the affinity_mask he (we) added to the
>  IRQ core]
> 
> On Tue, Sep 06, 2016@10:39:28AM -0400, Keith Busch wrote:
> > > Always the previous one.  Below is a patch to get us back to the
> > > previous behavior:
> > 
> > No, that's not right.
> > 
> > Here's my topology info:
> > 
> >   # numactl --hardware
> >   available: 2 nodes (0-1)
> >   node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
> >   node 0 size: 15745 MB
> >   node 0 free: 15319 MB
> >   node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
> >   node 1 size: 16150 MB
> >   node 1 free: 15758 MB
> >   node distances:
> >   node   0   1
> >     0:  10  21
> >     1:  21  10
> 
> How do you get that mapping?  Does this CPU use Hyperthreading and
> thus expose siblings using topology_sibling_cpumask?  As that's the
> only thing the old code used for any sort of special casing.

That's a normal Intel mapping with two sockets and HT enabled. The cpu
enumeration is

Socket0 - physical cores
Socket1 - physical cores

Socket0 - HT siblings
Socket1 - HT siblings
 
> I'll need to see if I can find a system with such a mapping to reproduce.

Any 2 socket Intel with HT enabled will do. If you need access to one let
me know.

> > If I have 16 vectors, the affinity_mask generated by what you're doing
> > looks like 0000ffff, CPU's 0-15. So the first 16 bits are set since each
> > of those are the first unique CPU, getting a unique vector just like you
> > wanted. If an unset bit just means share with the previous, then all of
> > my thread siblings (CPU's 16-31) get to share with CPU 15. That's awful!
> > 
> > What we want for my CPU topology is the 16th CPU to pair with CPU 0,
> > 17 pairs with 1, 18 with 2, and so on. You can't convey that information
> > with this scheme. We need affinity_masks per vector.
> 
> We actually have per-vector masks, but they are hidden inside the IRQ
> core and awkward to use.  We could to the get_first_sibling magic
> in the block-mq queue mapping (and in fact with the current code I guess
> we need to).  Or take a step back from trying to emulate the old code
> and instead look at NUMA nodes instead of siblings which some folks
> suggested a while ago.

I think you want both.

NUMA nodes are certainly the first decision factor. You split the number of
vectors to the nodes:

  vecs_per_node = num_vector / num_nodes;

Then you spread the number of vectors per node by the number of cpus per
node.

  cpus_per_vec = cpus_on(node) / vecs_per_node;

If the number of cpus per vector is <= 1 you just use a round robin
scheme. If not, you need to look at siblings.

Looking at the whole thing, I think we need to be more clever when setting
up the msi descriptor affinity masks.

I'll send a RFC series soon.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2016-09-07 15:41 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-29 10:53 blk-mq: allow passing in an external queue mapping V2 Christoph Hellwig
2016-08-29 10:53 ` Christoph Hellwig
2016-08-29 10:53 ` [PATCH 1/7] blk-mq: don't redistribute hardware queues on a CPU hotplug event Christoph Hellwig
2016-08-29 10:53   ` Christoph Hellwig
2016-08-29 10:53 ` [PATCH 2/7] blk-mq: only allocate a single mq_map per tag_set Christoph Hellwig
2016-08-29 10:53   ` Christoph Hellwig
2016-08-29 10:53 ` [PATCH 3/7] blk-mq: remove ->map_queue Christoph Hellwig
2016-08-29 10:53   ` Christoph Hellwig
2016-08-29 10:53 ` [PATCH 4/7] blk-mq: allow the driver to pass in an affinity mask Christoph Hellwig
2016-08-29 10:53   ` Christoph Hellwig
2016-08-31 16:38   ` Keith Busch
2016-08-31 16:38     ` Keith Busch
2016-09-01  8:46     ` Christoph Hellwig
2016-09-01  8:46       ` Christoph Hellwig
2016-09-01 14:24       ` Keith Busch
2016-09-01 14:24         ` Keith Busch
2016-09-01 23:30         ` Keith Busch
2016-09-01 23:30           ` Keith Busch
2016-09-05 19:48         ` Christoph Hellwig
2016-09-05 19:48           ` Christoph Hellwig
2016-09-06 14:39           ` Keith Busch
2016-09-06 14:39             ` Keith Busch
2016-09-06 16:50             ` Christoph Hellwig
2016-09-06 16:50               ` Christoph Hellwig
2016-09-06 17:30               ` Keith Busch
2016-09-06 17:30                 ` Keith Busch
2016-09-07 15:38               ` Thomas Gleixner
2016-09-07 15:38                 ` Thomas Gleixner
2016-08-29 10:53 ` [PATCH 5/7] nvme: switch to use pci_alloc_irq_vectors Christoph Hellwig
2016-08-29 10:53   ` Christoph Hellwig
2016-08-29 10:53 ` [PATCH 6/7] nvme: remove the post_scan callout Christoph Hellwig
2016-08-29 10:53   ` Christoph Hellwig
2016-08-29 10:53 ` [PATCH 7/7] blk-mq: get rid of the cpumask in struct blk_mq_tags Christoph Hellwig
2016-08-29 10:53   ` Christoph Hellwig
2016-08-30 23:28 ` blk-mq: allow passing in an external queue mapping V2 Keith Busch
2016-08-30 23:28   ` Keith Busch
2016-09-01  8:45   ` Christoph Hellwig
2016-09-01  8:45     ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.