All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
@ 2016-09-16  8:51 ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

Linux block device layer limits number of hardware contexts queues
to number of CPUs in the system. That looks like suboptimal hardware
utilization in systems where number of CPUs is (significantly) less
than number of hardware queues.

In addition, there is a need to deal with tag starvation (see commit
0d2602ca "blk-mq: improve support for shared tags maps"). While unused
hardware queues stay idle, extra efforts are taken to maintain a notion
of fairness between queue users. Deeper queue depth could probably
mitigate the whole issue sometimes.

That all brings a straightforward idea that hardware queues provided by
a device should be utilized as much as possible.

This series is an attempt to introduce 1:N mapping between CPUs and
hardware queues. The code is experimental and hence some checks and
sysfs interfaces and are withdrawn as blocking the demo implementation.

The implementation evenly distributes hardware queues by CPUs, with
moderate changes to the existing codebase. But further developments
of the design are possible if needed. I.e. complete device utilization,
CPU and/or interrupt topology-driven queue distribution, workload-driven
queue redistribution.

Comments and suggestions are very welcomed!

The series is against linux-block tree.

Thanks!

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org

Alexander Gordeev (21):
  blk-mq: Fix memory leaks on a queue cleanup
  blk-mq: Fix a potential NULL pointer assignment to hctx tags
  block: Get rid of unused request_queue::nr_queues member
  blk-mq: Do not limit number of queues to 'nr_cpu_ids' in allocations
  blk-mq: Update hardware queue map after q->nr_hw_queues is set
  block: Remove redundant blk_mq_ops::map_queue() interface
  blk-mq: Remove a redundant assignment
  blk-mq: Cleanup hardware context data node selection
  blk-mq: Cleanup a loop exit condition
  blk-mq: Get rid of unnecessary blk_mq_free_hw_queues()
  blk-mq: Move duplicating code to blk_mq_exit_hctx()
  blk-mq: Uninit hardware context in order reverse to init
  blk-mq: Move hardware context init code into blk_mq_init_hctx()
  blk-mq: Rework blk_mq_init_hctx() function
  blk-mq: Pair blk_mq_hctx_kobj_init() with blk_mq_hctx_kobj_put()
  blk-mq: Set flush_start_tag to BLK_MQ_MAX_DEPTH
  blk-mq: Introduce a 1:N hardware contexts
  blk-mq: Enable tag numbers exceed hardware queue depth
  blk-mq: Enable combined hardware queues
  blk-mq: Allow combined hardware queues
  null_blk: Do not limit # of hardware queues to # of CPUs

 block/blk-core.c                  |   5 +-
 block/blk-flush.c                 |   6 +-
 block/blk-mq-cpumap.c             |  49 +++--
 block/blk-mq-sysfs.c              |   5 +
 block/blk-mq-tag.c                |   9 +-
 block/blk-mq.c                    | 373 +++++++++++++++-----------------------
 block/blk-mq.h                    |   4 +-
 block/blk.h                       |   2 +-
 drivers/block/loop.c              |   3 +-
 drivers/block/mtip32xx/mtip32xx.c |   4 +-
 drivers/block/null_blk.c          |  16 +-
 drivers/block/rbd.c               |   3 +-
 drivers/block/virtio_blk.c        |   6 +-
 drivers/block/xen-blkfront.c      |   6 +-
 drivers/md/dm-rq.c                |   4 +-
 drivers/mtd/ubi/block.c           |   1 -
 drivers/nvme/host/pci.c           |  29 +--
 drivers/nvme/host/rdma.c          |   2 -
 drivers/nvme/target/loop.c        |   2 -
 drivers/scsi/scsi_lib.c           |   4 +-
 include/linux/blk-mq.h            |  51 ++++--
 include/linux/blkdev.h            |   1 -
 22 files changed, 279 insertions(+), 306 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
@ 2016-09-16  8:51 ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


Linux block device layer limits number of hardware contexts queues
to number of CPUs in the system. That looks like suboptimal hardware
utilization in systems where number of CPUs is (significantly) less
than number of hardware queues.

In addition, there is a need to deal with tag starvation (see commit
0d2602ca "blk-mq: improve support for shared tags maps"). While unused
hardware queues stay idle, extra efforts are taken to maintain a notion
of fairness between queue users. Deeper queue depth could probably
mitigate the whole issue sometimes.

That all brings a straightforward idea that hardware queues provided by
a device should be utilized as much as possible.

This series is an attempt to introduce 1:N mapping between CPUs and
hardware queues. The code is experimental and hence some checks and
sysfs interfaces and are withdrawn as blocking the demo implementation.

The implementation evenly distributes hardware queues by CPUs, with
moderate changes to the existing codebase. But further developments
of the design are possible if needed. I.e. complete device utilization,
CPU and/or interrupt topology-driven queue distribution, workload-driven
queue redistribution.

Comments and suggestions are very welcomed!

The series is against linux-block tree.

Thanks!

CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org

Alexander Gordeev (21):
  blk-mq: Fix memory leaks on a queue cleanup
  blk-mq: Fix a potential NULL pointer assignment to hctx tags
  block: Get rid of unused request_queue::nr_queues member
  blk-mq: Do not limit number of queues to 'nr_cpu_ids' in allocations
  blk-mq: Update hardware queue map after q->nr_hw_queues is set
  block: Remove redundant blk_mq_ops::map_queue() interface
  blk-mq: Remove a redundant assignment
  blk-mq: Cleanup hardware context data node selection
  blk-mq: Cleanup a loop exit condition
  blk-mq: Get rid of unnecessary blk_mq_free_hw_queues()
  blk-mq: Move duplicating code to blk_mq_exit_hctx()
  blk-mq: Uninit hardware context in order reverse to init
  blk-mq: Move hardware context init code into blk_mq_init_hctx()
  blk-mq: Rework blk_mq_init_hctx() function
  blk-mq: Pair blk_mq_hctx_kobj_init() with blk_mq_hctx_kobj_put()
  blk-mq: Set flush_start_tag to BLK_MQ_MAX_DEPTH
  blk-mq: Introduce a 1:N hardware contexts
  blk-mq: Enable tag numbers exceed hardware queue depth
  blk-mq: Enable combined hardware queues
  blk-mq: Allow combined hardware queues
  null_blk: Do not limit # of hardware queues to # of CPUs

 block/blk-core.c                  |   5 +-
 block/blk-flush.c                 |   6 +-
 block/blk-mq-cpumap.c             |  49 +++--
 block/blk-mq-sysfs.c              |   5 +
 block/blk-mq-tag.c                |   9 +-
 block/blk-mq.c                    | 373 +++++++++++++++-----------------------
 block/blk-mq.h                    |   4 +-
 block/blk.h                       |   2 +-
 drivers/block/loop.c              |   3 +-
 drivers/block/mtip32xx/mtip32xx.c |   4 +-
 drivers/block/null_blk.c          |  16 +-
 drivers/block/rbd.c               |   3 +-
 drivers/block/virtio_blk.c        |   6 +-
 drivers/block/xen-blkfront.c      |   6 +-
 drivers/md/dm-rq.c                |   4 +-
 drivers/mtd/ubi/block.c           |   1 -
 drivers/nvme/host/pci.c           |  29 +--
 drivers/nvme/host/rdma.c          |   2 -
 drivers/nvme/target/loop.c        |   2 -
 drivers/scsi/scsi_lib.c           |   4 +-
 include/linux/blk-mq.h            |  51 ++++--
 include/linux/blkdev.h            |   1 -
 22 files changed, 279 insertions(+), 306 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 01/21] blk-mq: Fix memory leaks on a queue cleanup
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

Some data are leaked when blk_cleanup_queue() interface
is called.

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-mq.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 13f5a6c..90e3fef 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1707,8 +1707,13 @@ static void blk_mq_free_hw_queues(struct request_queue *q,
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
-	queue_for_each_hw_ctx(q, hctx, i)
+	queue_for_each_hw_ctx(q, hctx, i) {
 		free_cpumask_var(hctx->cpumask);
+		kfree(hctx->ctxs);
+		kfree(hctx);
+	}
+
+	q->nr_hw_queues = 0;
 }
 
 static int blk_mq_init_hctx(struct request_queue *q,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 01/21] blk-mq: Fix memory leaks on a queue cleanup
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


Some data are leaked when blk_cleanup_queue() interface
is called.

CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-mq.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 13f5a6c..90e3fef 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1707,8 +1707,13 @@ static void blk_mq_free_hw_queues(struct request_queue *q,
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
-	queue_for_each_hw_ctx(q, hctx, i)
+	queue_for_each_hw_ctx(q, hctx, i) {
 		free_cpumask_var(hctx->cpumask);
+		kfree(hctx->ctxs);
+		kfree(hctx);
+	}
+
+	q->nr_hw_queues = 0;
 }
 
 static int blk_mq_init_hctx(struct request_queue *q,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 02/21] blk-mq: Fix a potential NULL pointer assignment to hctx tags
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

If number of used hardware queues is dynamically decreased
then tags corresponding to the newly unused queues are freed.

If previously unused hardware queues are then reused again
they will start referring the previously freed tags.

CC: Jens Axboe <axboe@fb.com>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-mq.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 90e3fef..1cacf83 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2005,6 +2005,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 
 		if (hctxs[i])
 			continue;
+		if (!set->tags[i])
+			break;
 
 		node = blk_mq_hw_queue_to_node(q->mq_map, i);
 		hctxs[i] = kzalloc_node(sizeof(struct blk_mq_hw_ctx),
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 02/21] blk-mq: Fix a potential NULL pointer assignment to hctx tags
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


If number of used hardware queues is dynamically decreased
then tags corresponding to the newly unused queues are freed.

If previously unused hardware queues are then reused again
they will start referring the previously freed tags.

CC: Jens Axboe <axboe at fb.com>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-mq.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 90e3fef..1cacf83 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2005,6 +2005,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 
 		if (hctxs[i])
 			continue;
+		if (!set->tags[i])
+			break;
 
 		node = blk_mq_hw_queue_to_node(q->mq_map, i);
 		hctxs[i] = kzalloc_node(sizeof(struct blk_mq_hw_ctx),
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 03/21] block: Get rid of unused request_queue::nr_queues member
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-mq.c         | 2 --
 include/linux/blkdev.h | 1 -
 2 files changed, 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1cacf83..b6a7dee 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2080,8 +2080,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
 	blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
 
-	q->nr_queues = nr_cpu_ids;
-
 	q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
 
 	if (!(set->flags & BLK_MQ_F_SG_MERGE))
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e79055c..268f160 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -325,7 +325,6 @@ struct request_queue {
 
 	/* sw queues */
 	struct blk_mq_ctx __percpu	*queue_ctx;
-	unsigned int		nr_queues;
 
 	/* hw dispatch queues */
 	struct blk_mq_hw_ctx	**queue_hw_ctx;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 03/21] block: Get rid of unused request_queue::nr_queues member
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-mq.c         | 2 --
 include/linux/blkdev.h | 1 -
 2 files changed, 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1cacf83..b6a7dee 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2080,8 +2080,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
 	blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
 
-	q->nr_queues = nr_cpu_ids;
-
 	q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
 
 	if (!(set->flags & BLK_MQ_F_SG_MERGE))
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index e79055c..268f160 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -325,7 +325,6 @@ struct request_queue {
 
 	/* sw queues */
 	struct blk_mq_ctx __percpu	*queue_ctx;
-	unsigned int		nr_queues;
 
 	/* hw dispatch queues */
 	struct blk_mq_hw_ctx	**queue_hw_ctx;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 04/21] blk-mq: Do not limit number of queues to 'nr_cpu_ids' in allocations
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

Currently maximum number of used hardware queues is limited to
number of CPUs in the system. However, using 'nr_cpu_ids' as
the limit for (de-)allocations of data structures instead of
existing data structures' counters (a) worsens readability and
(b) leads to unused memory when number of hardware queues is
less than number of CPUs.

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-mq.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b6a7dee..0f0a01a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2064,8 +2064,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->queue_ctx)
 		goto err_exit;
 
-	q->queue_hw_ctx = kzalloc_node(nr_cpu_ids * sizeof(*(q->queue_hw_ctx)),
-						GFP_KERNEL, set->numa_node);
+	q->queue_hw_ctx = kzalloc_node(set->nr_hw_queues *
+			sizeof(*(q->queue_hw_ctx)), GFP_KERNEL, set->numa_node);
 	if (!q->queue_hw_ctx)
 		goto err_percpu;
 
@@ -2339,7 +2339,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (set->nr_hw_queues > nr_cpu_ids)
 		set->nr_hw_queues = nr_cpu_ids;
 
-	set->tags = kzalloc_node(nr_cpu_ids * sizeof(struct blk_mq_tags *),
+	set->tags = kzalloc_node(set->nr_hw_queues * sizeof(*set->tags),
 				 GFP_KERNEL, set->numa_node);
 	if (!set->tags)
 		return -ENOMEM;
@@ -2362,7 +2362,7 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 {
 	int i;
 
-	for (i = 0; i < nr_cpu_ids; i++) {
+	for (i = 0; i < set->nr_hw_queues; i++) {
 		if (set->tags[i])
 			blk_mq_free_rq_map(set, set->tags[i], i);
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 04/21] blk-mq: Do not limit number of queues to 'nr_cpu_ids' in allocations
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


Currently maximum number of used hardware queues is limited to
number of CPUs in the system. However, using 'nr_cpu_ids' as
the limit for (de-)allocations of data structures instead of
existing data structures' counters (a) worsens readability and
(b) leads to unused memory when number of hardware queues is
less than number of CPUs.

CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-mq.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b6a7dee..0f0a01a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2064,8 +2064,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->queue_ctx)
 		goto err_exit;
 
-	q->queue_hw_ctx = kzalloc_node(nr_cpu_ids * sizeof(*(q->queue_hw_ctx)),
-						GFP_KERNEL, set->numa_node);
+	q->queue_hw_ctx = kzalloc_node(set->nr_hw_queues *
+			sizeof(*(q->queue_hw_ctx)), GFP_KERNEL, set->numa_node);
 	if (!q->queue_hw_ctx)
 		goto err_percpu;
 
@@ -2339,7 +2339,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (set->nr_hw_queues > nr_cpu_ids)
 		set->nr_hw_queues = nr_cpu_ids;
 
-	set->tags = kzalloc_node(nr_cpu_ids * sizeof(struct blk_mq_tags *),
+	set->tags = kzalloc_node(set->nr_hw_queues * sizeof(*set->tags),
 				 GFP_KERNEL, set->numa_node);
 	if (!set->tags)
 		return -ENOMEM;
@@ -2362,7 +2362,7 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 {
 	int i;
 
-	for (i = 0; i < nr_cpu_ids; i++) {
+	for (i = 0; i < set->nr_hw_queues; i++) {
 		if (set->tags[i])
 			blk_mq_free_rq_map(set, set->tags[i], i);
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 05/21] blk-mq: Update hardware queue map after q->nr_hw_queues is set
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

Initializing of hardware queue map should be done after
hardware context allocations, since we might end up with
less hardware contexts than requested.

Because mapping of hardware context to CPUs depends on
both number of CPUs and number of hardware contexts we
should map after both those numbers are determined.

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-mq-cpumap.c | 17 -----------------
 block/blk-mq.c        |  8 +++++++-
 block/blk-mq.h        |  1 -
 3 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index d0634bc..ee553a4 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -86,23 +86,6 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
 	return 0;
 }
 
-unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set)
-{
-	unsigned int *map;
-
-	/* If cpus are offline, map them to first hctx */
-	map = kzalloc_node(sizeof(*map) * nr_cpu_ids, GFP_KERNEL,
-				set->numa_node);
-	if (!map)
-		return NULL;
-
-	if (!blk_mq_update_queue_map(map, set->nr_hw_queues, cpu_online_mask))
-		return map;
-
-	kfree(map);
-	return NULL;
-}
-
 /*
  * We have no quick way of doing reverse lookups. This is only used at
  * queue init time, so runtime isn't important.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0f0a01a..401ceea 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2069,7 +2069,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->queue_hw_ctx)
 		goto err_percpu;
 
-	q->mq_map = blk_mq_make_queue_map(set);
+	/* If cpus are offline, map them to first hctx */
+	q->mq_map = kzalloc_node(sizeof(*q->mq_map) * nr_cpu_ids, GFP_KERNEL,
+					set->numa_node);
 	if (!q->mq_map)
 		goto err_map;
 
@@ -2077,6 +2079,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->nr_hw_queues)
 		goto err_hctxs;
 
+	if (blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, cpu_online_mask))
+		goto err_update;
+
 	INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
 	blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
 
@@ -2119,6 +2124,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	return q;
 
 err_hctxs:
+err_update:
 	kfree(q->mq_map);
 err_map:
 	kfree(q->queue_hw_ctx);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9087b11..97b0051 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -47,7 +47,6 @@ void blk_mq_disable_hotplug(void);
 /*
  * CPU -> queue mappings
  */
-extern unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set);
 extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
 				   const struct cpumask *online_mask);
 extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 05/21] blk-mq: Update hardware queue map after q->nr_hw_queues is set
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


Initializing of hardware queue map should be done after
hardware context allocations, since we might end up with
less hardware contexts than requested.

Because mapping of hardware context to CPUs depends on
both number of CPUs and number of hardware contexts we
should map after both those numbers are determined.

CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-mq-cpumap.c | 17 -----------------
 block/blk-mq.c        |  8 +++++++-
 block/blk-mq.h        |  1 -
 3 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index d0634bc..ee553a4 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -86,23 +86,6 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
 	return 0;
 }
 
-unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set)
-{
-	unsigned int *map;
-
-	/* If cpus are offline, map them to first hctx */
-	map = kzalloc_node(sizeof(*map) * nr_cpu_ids, GFP_KERNEL,
-				set->numa_node);
-	if (!map)
-		return NULL;
-
-	if (!blk_mq_update_queue_map(map, set->nr_hw_queues, cpu_online_mask))
-		return map;
-
-	kfree(map);
-	return NULL;
-}
-
 /*
  * We have no quick way of doing reverse lookups. This is only used at
  * queue init time, so runtime isn't important.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0f0a01a..401ceea 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2069,7 +2069,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->queue_hw_ctx)
 		goto err_percpu;
 
-	q->mq_map = blk_mq_make_queue_map(set);
+	/* If cpus are offline, map them to first hctx */
+	q->mq_map = kzalloc_node(sizeof(*q->mq_map) * nr_cpu_ids, GFP_KERNEL,
+					set->numa_node);
 	if (!q->mq_map)
 		goto err_map;
 
@@ -2077,6 +2079,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->nr_hw_queues)
 		goto err_hctxs;
 
+	if (blk_mq_update_queue_map(q->mq_map, q->nr_hw_queues, cpu_online_mask))
+		goto err_update;
+
 	INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
 	blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
 
@@ -2119,6 +2124,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	return q;
 
 err_hctxs:
+err_update:
 	kfree(q->mq_map);
 err_map:
 	kfree(q->queue_hw_ctx);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9087b11..97b0051 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -47,7 +47,6 @@ void blk_mq_disable_hotplug(void);
 /*
  * CPU -> queue mappings
  */
-extern unsigned int *blk_mq_make_queue_map(struct blk_mq_tag_set *set);
 extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
 				   const struct cpumask *online_mask);
 extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 06/21] block: Remove redundant blk_mq_ops::map_queue() interface
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

All drivers that override blk_mq_ops::map_queue() interface
use default function and hence make the interface itself
redundant.

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-flush.c                 |  6 +++---
 block/blk-mq-tag.c                |  5 ++---
 block/blk-mq.c                    | 32 +++++++++++---------------------
 block/blk.h                       |  2 +-
 drivers/block/loop.c              |  1 -
 drivers/block/mtip32xx/mtip32xx.c |  1 -
 drivers/block/null_blk.c          |  1 -
 drivers/block/rbd.c               |  1 -
 drivers/block/virtio_blk.c        |  1 -
 drivers/block/xen-blkfront.c      |  1 -
 drivers/md/dm-rq.c                |  1 -
 drivers/mtd/ubi/block.c           |  1 -
 drivers/nvme/host/pci.c           |  2 --
 drivers/nvme/host/rdma.c          |  2 --
 drivers/nvme/target/loop.c        |  2 --
 drivers/scsi/scsi_lib.c           |  1 -
 include/linux/blk-mq.h            | 12 ++++++------
 17 files changed, 23 insertions(+), 49 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index d308def..6a14b68 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -232,7 +232,7 @@ static void flush_end_io(struct request *flush_rq, int error)
 
 		/* release the tag's ownership to the req cloned from */
 		spin_lock_irqsave(&fq->mq_flush_lock, flags);
-		hctx = q->mq_ops->map_queue(q, flush_rq->mq_ctx->cpu);
+		hctx = blk_mq_map_queue(q, flush_rq->mq_ctx->cpu);
 		blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
 		flush_rq->tag = -1;
 	}
@@ -325,7 +325,7 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq)
 		flush_rq->tag = first_rq->tag;
 		fq->orig_rq = first_rq;
 
-		hctx = q->mq_ops->map_queue(q, first_rq->mq_ctx->cpu);
+		hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu);
 		blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
 	}
 
@@ -358,7 +358,7 @@ static void mq_flush_data_end_io(struct request *rq, int error)
 	unsigned long flags;
 	struct blk_flush_queue *fq = blk_get_flush_queue(q, ctx);
 
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
 
 	/*
 	 * After populating an empty queue, kick it to avoid stall.  Read
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 729bac3..1602813 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -301,8 +301,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
 		io_schedule();
 
 		data->ctx = blk_mq_get_ctx(data->q);
-		data->hctx = data->q->mq_ops->map_queue(data->q,
-				data->ctx->cpu);
+		data->hctx = blk_mq_map_queue(data->q, data->ctx->cpu);
 		if (data->flags & BLK_MQ_REQ_RESERVED) {
 			bt = &data->hctx->tags->breserved_tags;
 		} else {
@@ -726,7 +725,7 @@ u32 blk_mq_unique_tag(struct request *rq)
 	int hwq = 0;
 
 	if (q->mq_ops) {
-		hctx = q->mq_ops->map_queue(q, rq->mq_ctx->cpu);
+		hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
 		hwq = hctx->queue_num;
 	}
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 401ceea..738e109 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -244,7 +244,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
 		return ERR_PTR(ret);
 
 	ctx = blk_mq_get_ctx(q);
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
 	blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
 
 	rq = __blk_mq_alloc_request(&alloc_data, rw, 0);
@@ -253,7 +253,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
 		blk_mq_put_ctx(ctx);
 
 		ctx = blk_mq_get_ctx(q);
-		hctx = q->mq_ops->map_queue(q, ctx->cpu);
+		hctx = blk_mq_map_queue(q, ctx->cpu);
 		blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
 		rq =  __blk_mq_alloc_request(&alloc_data, rw, 0);
 		ctx = alloc_data.ctx;
@@ -340,7 +340,7 @@ void blk_mq_free_request(struct request *rq)
 	struct blk_mq_hw_ctx *hctx;
 	struct request_queue *q = rq->q;
 
-	hctx = q->mq_ops->map_queue(q, rq->mq_ctx->cpu);
+	hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
 	blk_mq_free_hctx_request(hctx, rq);
 }
 EXPORT_SYMBOL_GPL(blk_mq_free_request);
@@ -1066,7 +1066,7 @@ void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
 	struct request_queue *q = rq->q;
 	struct blk_mq_hw_ctx *hctx;
 
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
 
 	spin_lock(&ctx->lock);
 	__blk_mq_insert_request(hctx, rq, at_head);
@@ -1087,7 +1087,7 @@ static void blk_mq_insert_requests(struct request_queue *q,
 
 	trace_block_unplug(q, depth, !from_schedule);
 
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
 
 	/*
 	 * preemption doesn't flush plug list, so it's possible ctx->cpu is
@@ -1222,7 +1222,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 
 	blk_queue_enter_live(q);
 	ctx = blk_mq_get_ctx(q);
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
 
 	if (rw_is_sync(bio_op(bio), bio->bi_opf))
 		op_flags |= REQ_SYNC;
@@ -1236,7 +1236,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 		trace_block_sleeprq(q, bio, op);
 
 		ctx = blk_mq_get_ctx(q);
-		hctx = q->mq_ops->map_queue(q, ctx->cpu);
+		hctx = blk_mq_map_queue(q, ctx->cpu);
 		blk_mq_set_alloc_data(&alloc_data, q, 0, ctx, hctx);
 		rq = __blk_mq_alloc_request(&alloc_data, op, op_flags);
 		ctx = alloc_data.ctx;
@@ -1253,8 +1253,7 @@ static int blk_mq_direct_issue_request(struct request *rq, blk_qc_t *cookie)
 {
 	int ret;
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = q->mq_ops->map_queue(q,
-			rq->mq_ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
 	struct blk_mq_queue_data bd = {
 		.rq = rq,
 		.list = NULL,
@@ -1458,15 +1457,6 @@ run_queue:
 	return cookie;
 }
 
-/*
- * Default mapping to a software queue, since we use one per CPU.
- */
-struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q, const int cpu)
-{
-	return q->queue_hw_ctx[q->mq_map[cpu]];
-}
-EXPORT_SYMBOL(blk_mq_map_queue);
-
 static void blk_mq_free_rq_map(struct blk_mq_tag_set *set,
 		struct blk_mq_tags *tags, unsigned int hctx_idx)
 {
@@ -1805,7 +1795,7 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
 		if (!cpu_online(i))
 			continue;
 
-		hctx = q->mq_ops->map_queue(q, i);
+		hctx = blk_mq_map_queue(q, i);
 
 		/*
 		 * Set local node, IFF we have more than one hw queue. If
@@ -1843,7 +1833,7 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 			continue;
 
 		ctx = per_cpu_ptr(q->queue_ctx, i);
-		hctx = q->mq_ops->map_queue(q, i);
+		hctx = blk_mq_map_queue(q, i);
 
 		cpumask_set_cpu(i, hctx->cpumask);
 		ctx->index_hw = hctx->nr_ctx;
@@ -2321,7 +2311,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN)
 		return -EINVAL;
 
-	if (!set->ops->queue_rq || !set->ops->map_queue)
+	if (!set->ops->queue_rq)
 		return -EINVAL;
 
 	if (set->queue_depth > BLK_MQ_MAX_DEPTH) {
diff --git a/block/blk.h b/block/blk.h
index c37492f..7b0ffbd 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -44,7 +44,7 @@ static inline struct blk_flush_queue *blk_get_flush_queue(
 	if (!q->mq_ops)
 		return q->fq;
 
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
 
 	return hctx->fq;
 }
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c9f2107..cbdb3b1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1703,7 +1703,6 @@ static int loop_init_request(void *data, struct request *rq,
 
 static struct blk_mq_ops loop_mq_ops = {
 	.queue_rq       = loop_queue_rq,
-	.map_queue      = blk_mq_map_queue,
 	.init_request	= loop_init_request,
 };
 
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 2aca98e..3cc92e9 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -3895,7 +3895,6 @@ exit_handler:
 
 static struct blk_mq_ops mtip_mq_ops = {
 	.queue_rq	= mtip_queue_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= mtip_init_cmd,
 	.exit_request	= mtip_free_cmd,
 	.complete	= mtip_softirq_done_fn,
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 75a7f88..7d3b7d6 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -393,7 +393,6 @@ static int null_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 
 static struct blk_mq_ops null_mq_ops = {
 	.queue_rq       = null_queue_rq,
-	.map_queue      = blk_mq_map_queue,
 	.init_hctx	= null_init_hctx,
 	.complete	= null_softirq_done_fn,
 };
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 6c6519f..c1f84df 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3621,7 +3621,6 @@ static int rbd_init_request(void *data, struct request *rq,
 
 static struct blk_mq_ops rbd_mq_ops = {
 	.queue_rq	= rbd_queue_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= rbd_init_request,
 };
 
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 93b1aaa..2dc5c96 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -542,7 +542,6 @@ static int virtblk_init_request(void *data, struct request *rq,
 
 static struct blk_mq_ops virtio_mq_ops = {
 	.queue_rq	= virtio_queue_rq,
-	.map_queue	= blk_mq_map_queue,
 	.complete	= virtblk_request_done,
 	.init_request	= virtblk_init_request,
 };
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 88ef6d4..9908597 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -909,7 +909,6 @@ out_busy:
 
 static struct blk_mq_ops blkfront_mq_ops = {
 	.queue_rq = blkif_queue_rq,
-	.map_queue = blk_mq_map_queue,
 };
 
 static void blkif_set_queue_limits(struct blkfront_info *info)
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 1ca7463..d1c3645 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -908,7 +908,6 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 static struct blk_mq_ops dm_mq_ops = {
 	.queue_rq = dm_mq_queue_rq,
-	.map_queue = blk_mq_map_queue,
 	.complete = dm_softirq_done,
 	.init_request = dm_mq_init_request,
 };
diff --git a/drivers/mtd/ubi/block.c b/drivers/mtd/ubi/block.c
index ebf46ad..d1e6931 100644
--- a/drivers/mtd/ubi/block.c
+++ b/drivers/mtd/ubi/block.c
@@ -351,7 +351,6 @@ static int ubiblock_init_request(void *data, struct request *req,
 static struct blk_mq_ops ubiblock_mq_ops = {
 	.queue_rq       = ubiblock_queue_rq,
 	.init_request	= ubiblock_init_request,
-	.map_queue      = blk_mq_map_queue,
 };
 
 static DEFINE_IDR(ubiblock_minor_idr);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8dcf5a9..086fd7e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1131,7 +1131,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 static struct blk_mq_ops nvme_mq_admin_ops = {
 	.queue_rq	= nvme_queue_rq,
 	.complete	= nvme_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_hctx	= nvme_admin_init_hctx,
 	.exit_hctx      = nvme_admin_exit_hctx,
 	.init_request	= nvme_admin_init_request,
@@ -1141,7 +1140,6 @@ static struct blk_mq_ops nvme_mq_admin_ops = {
 static struct blk_mq_ops nvme_mq_ops = {
 	.queue_rq	= nvme_queue_rq,
 	.complete	= nvme_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_hctx	= nvme_init_hctx,
 	.init_request	= nvme_init_request,
 	.timeout	= nvme_timeout,
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index ab545fb..9bbd886 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1531,7 +1531,6 @@ static void nvme_rdma_complete_rq(struct request *rq)
 static struct blk_mq_ops nvme_rdma_mq_ops = {
 	.queue_rq	= nvme_rdma_queue_rq,
 	.complete	= nvme_rdma_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= nvme_rdma_init_request,
 	.exit_request	= nvme_rdma_exit_request,
 	.reinit_request	= nvme_rdma_reinit_request,
@@ -1543,7 +1542,6 @@ static struct blk_mq_ops nvme_rdma_mq_ops = {
 static struct blk_mq_ops nvme_rdma_admin_mq_ops = {
 	.queue_rq	= nvme_rdma_queue_rq,
 	.complete	= nvme_rdma_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= nvme_rdma_init_admin_request,
 	.exit_request	= nvme_rdma_exit_admin_request,
 	.reinit_request	= nvme_rdma_reinit_request,
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 395e60d..d5df77d 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -273,7 +273,6 @@ static int nvme_loop_init_admin_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 static struct blk_mq_ops nvme_loop_mq_ops = {
 	.queue_rq	= nvme_loop_queue_rq,
 	.complete	= nvme_loop_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= nvme_loop_init_request,
 	.init_hctx	= nvme_loop_init_hctx,
 	.timeout	= nvme_loop_timeout,
@@ -282,7 +281,6 @@ static struct blk_mq_ops nvme_loop_mq_ops = {
 static struct blk_mq_ops nvme_loop_admin_mq_ops = {
 	.queue_rq	= nvme_loop_queue_rq,
 	.complete	= nvme_loop_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= nvme_loop_init_admin_request,
 	.init_hctx	= nvme_loop_init_admin_hctx,
 	.timeout	= nvme_loop_timeout,
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c71344a..2cca9cf 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2077,7 +2077,6 @@ struct request_queue *scsi_alloc_queue(struct scsi_device *sdev)
 }
 
 static struct blk_mq_ops scsi_mq_ops = {
-	.map_queue	= blk_mq_map_queue,
 	.queue_rq	= scsi_queue_rq,
 	.complete	= scsi_softirq_done,
 	.timeout	= scsi_timeout,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e43bbff..6c7ee56 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -111,11 +111,6 @@ struct blk_mq_ops {
 	queue_rq_fn		*queue_rq;
 
 	/*
-	 * Map to specific hardware queue
-	 */
-	map_queue_fn		*map_queue;
-
-	/*
 	 * Called on request timeout
 	 */
 	timeout_fn		*timeout;
@@ -220,7 +215,12 @@ static inline u16 blk_mq_unique_tag_to_tag(u32 unique_tag)
 	return unique_tag & BLK_MQ_UNIQUE_TAG_MASK;
 }
 
-struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *, const int ctx_index);
+static inline struct blk_mq_hw_ctx *blk_mq_map_queue(
+		struct request_queue *q, const int cpu)
+{
+	return q->queue_hw_ctx[q->mq_map[cpu]];
+}
+
 struct blk_mq_hw_ctx *blk_mq_alloc_single_hw_queue(struct blk_mq_tag_set *, unsigned int, int);
 
 int blk_mq_request_started(struct request *rq);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 06/21] block: Remove redundant blk_mq_ops::map_queue() interface
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


All drivers that override blk_mq_ops::map_queue() interface
use default function and hence make the interface itself
redundant.

CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-flush.c                 |  6 +++---
 block/blk-mq-tag.c                |  5 ++---
 block/blk-mq.c                    | 32 +++++++++++---------------------
 block/blk.h                       |  2 +-
 drivers/block/loop.c              |  1 -
 drivers/block/mtip32xx/mtip32xx.c |  1 -
 drivers/block/null_blk.c          |  1 -
 drivers/block/rbd.c               |  1 -
 drivers/block/virtio_blk.c        |  1 -
 drivers/block/xen-blkfront.c      |  1 -
 drivers/md/dm-rq.c                |  1 -
 drivers/mtd/ubi/block.c           |  1 -
 drivers/nvme/host/pci.c           |  2 --
 drivers/nvme/host/rdma.c          |  2 --
 drivers/nvme/target/loop.c        |  2 --
 drivers/scsi/scsi_lib.c           |  1 -
 include/linux/blk-mq.h            | 12 ++++++------
 17 files changed, 23 insertions(+), 49 deletions(-)

diff --git a/block/blk-flush.c b/block/blk-flush.c
index d308def..6a14b68 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -232,7 +232,7 @@ static void flush_end_io(struct request *flush_rq, int error)
 
 		/* release the tag's ownership to the req cloned from */
 		spin_lock_irqsave(&fq->mq_flush_lock, flags);
-		hctx = q->mq_ops->map_queue(q, flush_rq->mq_ctx->cpu);
+		hctx = blk_mq_map_queue(q, flush_rq->mq_ctx->cpu);
 		blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
 		flush_rq->tag = -1;
 	}
@@ -325,7 +325,7 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq)
 		flush_rq->tag = first_rq->tag;
 		fq->orig_rq = first_rq;
 
-		hctx = q->mq_ops->map_queue(q, first_rq->mq_ctx->cpu);
+		hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu);
 		blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq);
 	}
 
@@ -358,7 +358,7 @@ static void mq_flush_data_end_io(struct request *rq, int error)
 	unsigned long flags;
 	struct blk_flush_queue *fq = blk_get_flush_queue(q, ctx);
 
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
 
 	/*
 	 * After populating an empty queue, kick it to avoid stall.  Read
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 729bac3..1602813 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -301,8 +301,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
 		io_schedule();
 
 		data->ctx = blk_mq_get_ctx(data->q);
-		data->hctx = data->q->mq_ops->map_queue(data->q,
-				data->ctx->cpu);
+		data->hctx = blk_mq_map_queue(data->q, data->ctx->cpu);
 		if (data->flags & BLK_MQ_REQ_RESERVED) {
 			bt = &data->hctx->tags->breserved_tags;
 		} else {
@@ -726,7 +725,7 @@ u32 blk_mq_unique_tag(struct request *rq)
 	int hwq = 0;
 
 	if (q->mq_ops) {
-		hctx = q->mq_ops->map_queue(q, rq->mq_ctx->cpu);
+		hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
 		hwq = hctx->queue_num;
 	}
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 401ceea..738e109 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -244,7 +244,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
 		return ERR_PTR(ret);
 
 	ctx = blk_mq_get_ctx(q);
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
 	blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
 
 	rq = __blk_mq_alloc_request(&alloc_data, rw, 0);
@@ -253,7 +253,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
 		blk_mq_put_ctx(ctx);
 
 		ctx = blk_mq_get_ctx(q);
-		hctx = q->mq_ops->map_queue(q, ctx->cpu);
+		hctx = blk_mq_map_queue(q, ctx->cpu);
 		blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
 		rq =  __blk_mq_alloc_request(&alloc_data, rw, 0);
 		ctx = alloc_data.ctx;
@@ -340,7 +340,7 @@ void blk_mq_free_request(struct request *rq)
 	struct blk_mq_hw_ctx *hctx;
 	struct request_queue *q = rq->q;
 
-	hctx = q->mq_ops->map_queue(q, rq->mq_ctx->cpu);
+	hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
 	blk_mq_free_hctx_request(hctx, rq);
 }
 EXPORT_SYMBOL_GPL(blk_mq_free_request);
@@ -1066,7 +1066,7 @@ void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
 	struct request_queue *q = rq->q;
 	struct blk_mq_hw_ctx *hctx;
 
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
 
 	spin_lock(&ctx->lock);
 	__blk_mq_insert_request(hctx, rq, at_head);
@@ -1087,7 +1087,7 @@ static void blk_mq_insert_requests(struct request_queue *q,
 
 	trace_block_unplug(q, depth, !from_schedule);
 
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
 
 	/*
 	 * preemption doesn't flush plug list, so it's possible ctx->cpu is
@@ -1222,7 +1222,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 
 	blk_queue_enter_live(q);
 	ctx = blk_mq_get_ctx(q);
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
 
 	if (rw_is_sync(bio_op(bio), bio->bi_opf))
 		op_flags |= REQ_SYNC;
@@ -1236,7 +1236,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 		trace_block_sleeprq(q, bio, op);
 
 		ctx = blk_mq_get_ctx(q);
-		hctx = q->mq_ops->map_queue(q, ctx->cpu);
+		hctx = blk_mq_map_queue(q, ctx->cpu);
 		blk_mq_set_alloc_data(&alloc_data, q, 0, ctx, hctx);
 		rq = __blk_mq_alloc_request(&alloc_data, op, op_flags);
 		ctx = alloc_data.ctx;
@@ -1253,8 +1253,7 @@ static int blk_mq_direct_issue_request(struct request *rq, blk_qc_t *cookie)
 {
 	int ret;
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = q->mq_ops->map_queue(q,
-			rq->mq_ctx->cpu);
+	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
 	struct blk_mq_queue_data bd = {
 		.rq = rq,
 		.list = NULL,
@@ -1458,15 +1457,6 @@ run_queue:
 	return cookie;
 }
 
-/*
- * Default mapping to a software queue, since we use one per CPU.
- */
-struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q, const int cpu)
-{
-	return q->queue_hw_ctx[q->mq_map[cpu]];
-}
-EXPORT_SYMBOL(blk_mq_map_queue);
-
 static void blk_mq_free_rq_map(struct blk_mq_tag_set *set,
 		struct blk_mq_tags *tags, unsigned int hctx_idx)
 {
@@ -1805,7 +1795,7 @@ static void blk_mq_init_cpu_queues(struct request_queue *q,
 		if (!cpu_online(i))
 			continue;
 
-		hctx = q->mq_ops->map_queue(q, i);
+		hctx = blk_mq_map_queue(q, i);
 
 		/*
 		 * Set local node, IFF we have more than one hw queue. If
@@ -1843,7 +1833,7 @@ static void blk_mq_map_swqueue(struct request_queue *q,
 			continue;
 
 		ctx = per_cpu_ptr(q->queue_ctx, i);
-		hctx = q->mq_ops->map_queue(q, i);
+		hctx = blk_mq_map_queue(q, i);
 
 		cpumask_set_cpu(i, hctx->cpumask);
 		ctx->index_hw = hctx->nr_ctx;
@@ -2321,7 +2311,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN)
 		return -EINVAL;
 
-	if (!set->ops->queue_rq || !set->ops->map_queue)
+	if (!set->ops->queue_rq)
 		return -EINVAL;
 
 	if (set->queue_depth > BLK_MQ_MAX_DEPTH) {
diff --git a/block/blk.h b/block/blk.h
index c37492f..7b0ffbd 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -44,7 +44,7 @@ static inline struct blk_flush_queue *blk_get_flush_queue(
 	if (!q->mq_ops)
 		return q->fq;
 
-	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	hctx = blk_mq_map_queue(q, ctx->cpu);
 
 	return hctx->fq;
 }
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c9f2107..cbdb3b1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1703,7 +1703,6 @@ static int loop_init_request(void *data, struct request *rq,
 
 static struct blk_mq_ops loop_mq_ops = {
 	.queue_rq       = loop_queue_rq,
-	.map_queue      = blk_mq_map_queue,
 	.init_request	= loop_init_request,
 };
 
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 2aca98e..3cc92e9 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -3895,7 +3895,6 @@ exit_handler:
 
 static struct blk_mq_ops mtip_mq_ops = {
 	.queue_rq	= mtip_queue_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= mtip_init_cmd,
 	.exit_request	= mtip_free_cmd,
 	.complete	= mtip_softirq_done_fn,
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 75a7f88..7d3b7d6 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -393,7 +393,6 @@ static int null_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 
 static struct blk_mq_ops null_mq_ops = {
 	.queue_rq       = null_queue_rq,
-	.map_queue      = blk_mq_map_queue,
 	.init_hctx	= null_init_hctx,
 	.complete	= null_softirq_done_fn,
 };
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 6c6519f..c1f84df 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3621,7 +3621,6 @@ static int rbd_init_request(void *data, struct request *rq,
 
 static struct blk_mq_ops rbd_mq_ops = {
 	.queue_rq	= rbd_queue_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= rbd_init_request,
 };
 
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 93b1aaa..2dc5c96 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -542,7 +542,6 @@ static int virtblk_init_request(void *data, struct request *rq,
 
 static struct blk_mq_ops virtio_mq_ops = {
 	.queue_rq	= virtio_queue_rq,
-	.map_queue	= blk_mq_map_queue,
 	.complete	= virtblk_request_done,
 	.init_request	= virtblk_init_request,
 };
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 88ef6d4..9908597 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -909,7 +909,6 @@ out_busy:
 
 static struct blk_mq_ops blkfront_mq_ops = {
 	.queue_rq = blkif_queue_rq,
-	.map_queue = blk_mq_map_queue,
 };
 
 static void blkif_set_queue_limits(struct blkfront_info *info)
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 1ca7463..d1c3645 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -908,7 +908,6 @@ static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
 
 static struct blk_mq_ops dm_mq_ops = {
 	.queue_rq = dm_mq_queue_rq,
-	.map_queue = blk_mq_map_queue,
 	.complete = dm_softirq_done,
 	.init_request = dm_mq_init_request,
 };
diff --git a/drivers/mtd/ubi/block.c b/drivers/mtd/ubi/block.c
index ebf46ad..d1e6931 100644
--- a/drivers/mtd/ubi/block.c
+++ b/drivers/mtd/ubi/block.c
@@ -351,7 +351,6 @@ static int ubiblock_init_request(void *data, struct request *req,
 static struct blk_mq_ops ubiblock_mq_ops = {
 	.queue_rq       = ubiblock_queue_rq,
 	.init_request	= ubiblock_init_request,
-	.map_queue      = blk_mq_map_queue,
 };
 
 static DEFINE_IDR(ubiblock_minor_idr);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8dcf5a9..086fd7e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1131,7 +1131,6 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 static struct blk_mq_ops nvme_mq_admin_ops = {
 	.queue_rq	= nvme_queue_rq,
 	.complete	= nvme_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_hctx	= nvme_admin_init_hctx,
 	.exit_hctx      = nvme_admin_exit_hctx,
 	.init_request	= nvme_admin_init_request,
@@ -1141,7 +1140,6 @@ static struct blk_mq_ops nvme_mq_admin_ops = {
 static struct blk_mq_ops nvme_mq_ops = {
 	.queue_rq	= nvme_queue_rq,
 	.complete	= nvme_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_hctx	= nvme_init_hctx,
 	.init_request	= nvme_init_request,
 	.timeout	= nvme_timeout,
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index ab545fb..9bbd886 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1531,7 +1531,6 @@ static void nvme_rdma_complete_rq(struct request *rq)
 static struct blk_mq_ops nvme_rdma_mq_ops = {
 	.queue_rq	= nvme_rdma_queue_rq,
 	.complete	= nvme_rdma_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= nvme_rdma_init_request,
 	.exit_request	= nvme_rdma_exit_request,
 	.reinit_request	= nvme_rdma_reinit_request,
@@ -1543,7 +1542,6 @@ static struct blk_mq_ops nvme_rdma_mq_ops = {
 static struct blk_mq_ops nvme_rdma_admin_mq_ops = {
 	.queue_rq	= nvme_rdma_queue_rq,
 	.complete	= nvme_rdma_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= nvme_rdma_init_admin_request,
 	.exit_request	= nvme_rdma_exit_admin_request,
 	.reinit_request	= nvme_rdma_reinit_request,
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 395e60d..d5df77d 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -273,7 +273,6 @@ static int nvme_loop_init_admin_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 static struct blk_mq_ops nvme_loop_mq_ops = {
 	.queue_rq	= nvme_loop_queue_rq,
 	.complete	= nvme_loop_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= nvme_loop_init_request,
 	.init_hctx	= nvme_loop_init_hctx,
 	.timeout	= nvme_loop_timeout,
@@ -282,7 +281,6 @@ static struct blk_mq_ops nvme_loop_mq_ops = {
 static struct blk_mq_ops nvme_loop_admin_mq_ops = {
 	.queue_rq	= nvme_loop_queue_rq,
 	.complete	= nvme_loop_complete_rq,
-	.map_queue	= blk_mq_map_queue,
 	.init_request	= nvme_loop_init_admin_request,
 	.init_hctx	= nvme_loop_init_admin_hctx,
 	.timeout	= nvme_loop_timeout,
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c71344a..2cca9cf 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2077,7 +2077,6 @@ struct request_queue *scsi_alloc_queue(struct scsi_device *sdev)
 }
 
 static struct blk_mq_ops scsi_mq_ops = {
-	.map_queue	= blk_mq_map_queue,
 	.queue_rq	= scsi_queue_rq,
 	.complete	= scsi_softirq_done,
 	.timeout	= scsi_timeout,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index e43bbff..6c7ee56 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -111,11 +111,6 @@ struct blk_mq_ops {
 	queue_rq_fn		*queue_rq;
 
 	/*
-	 * Map to specific hardware queue
-	 */
-	map_queue_fn		*map_queue;
-
-	/*
 	 * Called on request timeout
 	 */
 	timeout_fn		*timeout;
@@ -220,7 +215,12 @@ static inline u16 blk_mq_unique_tag_to_tag(u32 unique_tag)
 	return unique_tag & BLK_MQ_UNIQUE_TAG_MASK;
 }
 
-struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *, const int ctx_index);
+static inline struct blk_mq_hw_ctx *blk_mq_map_queue(
+		struct request_queue *q, const int cpu)
+{
+	return q->queue_hw_ctx[q->mq_map[cpu]];
+}
+
 struct blk_mq_hw_ctx *blk_mq_alloc_single_hw_queue(struct blk_mq_tag_set *, unsigned int, int);
 
 int blk_mq_request_started(struct request *rq);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 07/21] blk-mq: Remove a redundant assignment
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

blk_mq_hw_ctx::queue_num is initialized in blk_mq_init_hctx()
function.

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-mq.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 738e109..657e748 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2013,7 +2013,6 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 
 		atomic_set(&hctxs[i]->nr_active, 0);
 		hctxs[i]->numa_node = node;
-		hctxs[i]->queue_num = i;
 
 		if (blk_mq_init_hctx(q, set, hctxs[i], i)) {
 			free_cpumask_var(hctxs[i]->cpumask);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 07/21] blk-mq: Remove a redundant assignment
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


blk_mq_hw_ctx::queue_num is initialized in blk_mq_init_hctx()
function.

CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-mq.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 738e109..657e748 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2013,7 +2013,6 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 
 		atomic_set(&hctxs[i]->nr_active, 0);
 		hctxs[i]->numa_node = node;
-		hctxs[i]->queue_num = i;
 
 		if (blk_mq_init_hctx(q, set, hctxs[i], i)) {
 			free_cpumask_var(hctxs[i]->cpumask);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 08/21] blk-mq: Cleanup hardware context data node selection
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-mq.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 657e748..d40013c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1710,13 +1710,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set,
 		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
 {
-	int node;
+	int node = hctx->numa_node;
 	unsigned flush_start_tag = set->queue_depth;
 
-	node = hctx->numa_node;
-	if (node == NUMA_NO_NODE)
-		node = hctx->numa_node = set->numa_node;
-
 	INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
 	INIT_DELAYED_WORK(&hctx->delay_work, blk_mq_delay_work_fn);
 	spin_lock_init(&hctx->lock);
@@ -1999,6 +1995,9 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 			break;
 
 		node = blk_mq_hw_queue_to_node(q->mq_map, i);
+		if (node == NUMA_NO_NODE)
+			node = set->numa_node;
+
 		hctxs[i] = kzalloc_node(sizeof(struct blk_mq_hw_ctx),
 					GFP_KERNEL, node);
 		if (!hctxs[i])
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 08/21] blk-mq: Cleanup hardware context data node selection
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-mq.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 657e748..d40013c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1710,13 +1710,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set,
 		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
 {
-	int node;
+	int node = hctx->numa_node;
 	unsigned flush_start_tag = set->queue_depth;
 
-	node = hctx->numa_node;
-	if (node == NUMA_NO_NODE)
-		node = hctx->numa_node = set->numa_node;
-
 	INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
 	INIT_DELAYED_WORK(&hctx->delay_work, blk_mq_delay_work_fn);
 	spin_lock_init(&hctx->lock);
@@ -1999,6 +1995,9 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 			break;
 
 		node = blk_mq_hw_queue_to_node(q->mq_map, i);
+		if (node == NUMA_NO_NODE)
+			node = set->numa_node;
+
 		hctxs[i] = kzalloc_node(sizeof(struct blk_mq_hw_ctx),
 					GFP_KERNEL, node);
 		if (!hctxs[i])
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 09/21] blk-mq: Cleanup a loop exit condition
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-mq.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d40013c..de700c8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1679,16 +1679,13 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
-		struct blk_mq_tag_set *set, int nr_queue)
+		struct blk_mq_tag_set *set)
 {
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (i == nr_queue)
-			break;
+	queue_for_each_hw_ctx(q, hctx, i)
 		blk_mq_exit_hctx(q, set, hctx, i);
-	}
 }
 
 static void blk_mq_free_hw_queues(struct request_queue *q,
@@ -2134,7 +2131,7 @@ void blk_mq_free_queue(struct request_queue *q)
 
 	blk_mq_del_queue_tag_set(q);
 
-	blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
+	blk_mq_exit_hw_queues(q, set);
 	blk_mq_free_hw_queues(q, set);
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 09/21] blk-mq: Cleanup a loop exit condition
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-mq.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d40013c..de700c8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1679,16 +1679,13 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
-		struct blk_mq_tag_set *set, int nr_queue)
+		struct blk_mq_tag_set *set)
 {
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (i == nr_queue)
-			break;
+	queue_for_each_hw_ctx(q, hctx, i)
 		blk_mq_exit_hctx(q, set, hctx, i);
-	}
 }
 
 static void blk_mq_free_hw_queues(struct request_queue *q,
@@ -2134,7 +2131,7 @@ void blk_mq_free_queue(struct request_queue *q)
 
 	blk_mq_del_queue_tag_set(q);
 
-	blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
+	blk_mq_exit_hw_queues(q, set);
 	blk_mq_free_hw_queues(q, set);
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 10/21] blk-mq: Get rid of unnecessary blk_mq_free_hw_queues()
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-mq.c | 13 +------------
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index de700c8..639b90d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1684,17 +1684,8 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
-	queue_for_each_hw_ctx(q, hctx, i)
-		blk_mq_exit_hctx(q, set, hctx, i);
-}
-
-static void blk_mq_free_hw_queues(struct request_queue *q,
-		struct blk_mq_tag_set *set)
-{
-	struct blk_mq_hw_ctx *hctx;
-	unsigned int i;
-
 	queue_for_each_hw_ctx(q, hctx, i) {
+		blk_mq_exit_hctx(q, set, hctx, i);
 		free_cpumask_var(hctx->cpumask);
 		kfree(hctx->ctxs);
 		kfree(hctx);
@@ -2130,9 +2121,7 @@ void blk_mq_free_queue(struct request_queue *q)
 	mutex_unlock(&all_q_mutex);
 
 	blk_mq_del_queue_tag_set(q);
-
 	blk_mq_exit_hw_queues(q, set);
-	blk_mq_free_hw_queues(q, set);
 }
 
 /* Basically redo blk_mq_init_queue with queue frozen */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 10/21] blk-mq: Get rid of unnecessary blk_mq_free_hw_queues()
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-mq.c | 13 +------------
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index de700c8..639b90d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1684,17 +1684,8 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
-	queue_for_each_hw_ctx(q, hctx, i)
-		blk_mq_exit_hctx(q, set, hctx, i);
-}
-
-static void blk_mq_free_hw_queues(struct request_queue *q,
-		struct blk_mq_tag_set *set)
-{
-	struct blk_mq_hw_ctx *hctx;
-	unsigned int i;
-
 	queue_for_each_hw_ctx(q, hctx, i) {
+		blk_mq_exit_hctx(q, set, hctx, i);
 		free_cpumask_var(hctx->cpumask);
 		kfree(hctx->ctxs);
 		kfree(hctx);
@@ -2130,9 +2121,7 @@ void blk_mq_free_queue(struct request_queue *q)
 	mutex_unlock(&all_q_mutex);
 
 	blk_mq_del_queue_tag_set(q);
-
 	blk_mq_exit_hw_queues(q, set);
-	blk_mq_free_hw_queues(q, set);
 }
 
 /* Basically redo blk_mq_init_queue with queue frozen */
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 11/21] blk-mq: Move duplicating code to blk_mq_exit_hctx()
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-mq.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 639b90d..9b1b6dc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1676,6 +1676,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 	blk_mq_unregister_cpu_notifier(&hctx->cpu_notifier);
 	blk_free_flush_queue(hctx->fq);
 	blk_mq_free_bitmap(&hctx->ctx_map);
+
+	free_cpumask_var(hctx->cpumask);
+	kfree(hctx->ctxs);
+	kfree(hctx);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -1684,12 +1688,8 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
-	queue_for_each_hw_ctx(q, hctx, i) {
+	queue_for_each_hw_ctx(q, hctx, i)
 		blk_mq_exit_hctx(q, set, hctx, i);
-		free_cpumask_var(hctx->cpumask);
-		kfree(hctx->ctxs);
-		kfree(hctx);
-	}
 
 	q->nr_hw_queues = 0;
 }
@@ -2018,12 +2018,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 				set->tags[j] = NULL;
 			}
 			blk_mq_exit_hctx(q, set, hctx, j);
-			free_cpumask_var(hctx->cpumask);
 			kobject_put(&hctx->kobj);
-			kfree(hctx->ctxs);
-			kfree(hctx);
 			hctxs[j] = NULL;
-
 		}
 	}
 	q->nr_hw_queues = i;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 11/21] blk-mq: Move duplicating code to blk_mq_exit_hctx()
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-mq.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 639b90d..9b1b6dc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1676,6 +1676,10 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 	blk_mq_unregister_cpu_notifier(&hctx->cpu_notifier);
 	blk_free_flush_queue(hctx->fq);
 	blk_mq_free_bitmap(&hctx->ctx_map);
+
+	free_cpumask_var(hctx->cpumask);
+	kfree(hctx->ctxs);
+	kfree(hctx);
 }
 
 static void blk_mq_exit_hw_queues(struct request_queue *q,
@@ -1684,12 +1688,8 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	struct blk_mq_hw_ctx *hctx;
 	unsigned int i;
 
-	queue_for_each_hw_ctx(q, hctx, i) {
+	queue_for_each_hw_ctx(q, hctx, i)
 		blk_mq_exit_hctx(q, set, hctx, i);
-		free_cpumask_var(hctx->cpumask);
-		kfree(hctx->ctxs);
-		kfree(hctx);
-	}
 
 	q->nr_hw_queues = 0;
 }
@@ -2018,12 +2018,8 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 				set->tags[j] = NULL;
 			}
 			blk_mq_exit_hctx(q, set, hctx, j);
-			free_cpumask_var(hctx->cpumask);
 			kobject_put(&hctx->kobj);
-			kfree(hctx->ctxs);
-			kfree(hctx);
 			hctxs[j] = NULL;
-
 		}
 	}
 	q->nr_hw_queues = i;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 12/21] blk-mq: Uninit hardware context in order reverse to init
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-mq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9b1b6dc..f2bae1a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2013,12 +2013,13 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 		struct blk_mq_hw_ctx *hctx = hctxs[j];
 
 		if (hctx) {
+			kobject_put(&hctx->kobj);
+
 			if (hctx->tags) {
 				blk_mq_free_rq_map(set, hctx->tags, j);
 				set->tags[j] = NULL;
 			}
 			blk_mq_exit_hctx(q, set, hctx, j);
-			kobject_put(&hctx->kobj);
 			hctxs[j] = NULL;
 		}
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 12/21] blk-mq: Uninit hardware context in order reverse to init
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-mq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9b1b6dc..f2bae1a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2013,12 +2013,13 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 		struct blk_mq_hw_ctx *hctx = hctxs[j];
 
 		if (hctx) {
+			kobject_put(&hctx->kobj);
+
 			if (hctx->tags) {
 				blk_mq_free_rq_map(set, hctx->tags, j);
 				set->tags[j] = NULL;
 			}
 			blk_mq_exit_hctx(q, set, hctx, j);
-			kobject_put(&hctx->kobj);
 			hctxs[j] = NULL;
 		}
 	}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 13/21] blk-mq: Move hardware context init code into blk_mq_init_hctx()
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

Move scattered hardware context initialization code into
a single function destined to do that, blk_mq_init_hctx()

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-mq.c | 81 +++++++++++++++++++++++++++++-----------------------------
 1 file changed, 40 insertions(+), 41 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f2bae1a..b77e73b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1694,17 +1694,30 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	q->nr_hw_queues = 0;
 }
 
-static int blk_mq_init_hctx(struct request_queue *q,
-		struct blk_mq_tag_set *set,
-		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
+static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
+		struct blk_mq_tag_set *set, unsigned hctx_idx)
 {
-	int node = hctx->numa_node;
 	unsigned flush_start_tag = set->queue_depth;
+	struct blk_mq_hw_ctx *hctx;
+	int node;
+
+	node = blk_mq_hw_queue_to_node(q->mq_map, hctx_idx);
+	if (node == NUMA_NO_NODE)
+		node = set->numa_node;
+
+	hctx = kzalloc_node(sizeof(*hctx), GFP_KERNEL, node);
+	if (!hctx)
+		return NULL;
+
+	if (!zalloc_cpumask_var_node(&hctx->cpumask, GFP_KERNEL, node))
+		goto free_hctx;
 
 	INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
 	INIT_DELAYED_WORK(&hctx->delay_work, blk_mq_delay_work_fn);
 	spin_lock_init(&hctx->lock);
 	INIT_LIST_HEAD(&hctx->dispatch);
+	atomic_set(&hctx->nr_active, 0);
+	hctx->numa_node = node;
 	hctx->queue = q;
 	hctx->queue_num = hctx_idx;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
@@ -1743,7 +1756,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
 				   flush_start_tag + hctx_idx, node))
 		goto free_fq;
 
-	return 0;
+	return hctx;
 
  free_fq:
 	kfree(hctx->fq);
@@ -1756,8 +1769,11 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	kfree(hctx->ctxs);
  unregister_cpu_notifier:
 	blk_mq_unregister_cpu_notifier(&hctx->cpu_notifier);
-
-	return -1;
+	free_cpumask_var(hctx->cpumask);
+ free_hctx:
+	kfree(hctx);
+	
+	return NULL;
 }
 
 static void blk_mq_init_cpu_queues(struct request_queue *q,
@@ -1971,57 +1987,40 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 						struct request_queue *q)
 {
 	int i, j;
+	struct blk_mq_hw_ctx *hctx;
 	struct blk_mq_hw_ctx **hctxs = q->queue_hw_ctx;
 
 	blk_mq_sysfs_unregister(q);
 	for (i = 0; i < set->nr_hw_queues; i++) {
-		int node;
-
 		if (hctxs[i])
 			continue;
 		if (!set->tags[i])
 			break;
 
-		node = blk_mq_hw_queue_to_node(q->mq_map, i);
-		if (node == NUMA_NO_NODE)
-			node = set->numa_node;
-
-		hctxs[i] = kzalloc_node(sizeof(struct blk_mq_hw_ctx),
-					GFP_KERNEL, node);
-		if (!hctxs[i])
-			break;
-
-		if (!zalloc_cpumask_var_node(&hctxs[i]->cpumask, GFP_KERNEL,
-						node)) {
-			kfree(hctxs[i]);
-			hctxs[i] = NULL;
+		hctx = blk_mq_init_hctx(q, set, i);
+		if (!hctx)
 			break;
-		}
 
-		atomic_set(&hctxs[i]->nr_active, 0);
-		hctxs[i]->numa_node = node;
+		blk_mq_hctx_kobj_init(hctx);
 
-		if (blk_mq_init_hctx(q, set, hctxs[i], i)) {
-			free_cpumask_var(hctxs[i]->cpumask);
-			kfree(hctxs[i]);
-			hctxs[i] = NULL;
-			break;
-		}
-		blk_mq_hctx_kobj_init(hctxs[i]);
+		hctxs[i] = hctx;
 	}
 	for (j = i; j < q->nr_hw_queues; j++) {
-		struct blk_mq_hw_ctx *hctx = hctxs[j];
+		hctx = hctxs[j];
 
-		if (hctx) {
-			kobject_put(&hctx->kobj);
+		if (!hctx)
+			continue;
 
-			if (hctx->tags) {
-				blk_mq_free_rq_map(set, hctx->tags, j);
-				set->tags[j] = NULL;
-			}
-			blk_mq_exit_hctx(q, set, hctx, j);
-			hctxs[j] = NULL;
+		kobject_put(&hctx->kobj);
+
+		if (hctx->tags) {
+			blk_mq_free_rq_map(set, hctx->tags, j);
+			set->tags[j] = NULL;
 		}
+
+		blk_mq_exit_hctx(q, set, hctx, j);
+
+		hctxs[j] = NULL;
 	}
 	q->nr_hw_queues = i;
 	blk_mq_sysfs_register(q);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 13/21] blk-mq: Move hardware context init code into blk_mq_init_hctx()
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


Move scattered hardware context initialization code into
a single function destined to do that, blk_mq_init_hctx()

CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-mq.c | 81 +++++++++++++++++++++++++++++-----------------------------
 1 file changed, 40 insertions(+), 41 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f2bae1a..b77e73b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1694,17 +1694,30 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 	q->nr_hw_queues = 0;
 }
 
-static int blk_mq_init_hctx(struct request_queue *q,
-		struct blk_mq_tag_set *set,
-		struct blk_mq_hw_ctx *hctx, unsigned hctx_idx)
+static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
+		struct blk_mq_tag_set *set, unsigned hctx_idx)
 {
-	int node = hctx->numa_node;
 	unsigned flush_start_tag = set->queue_depth;
+	struct blk_mq_hw_ctx *hctx;
+	int node;
+
+	node = blk_mq_hw_queue_to_node(q->mq_map, hctx_idx);
+	if (node == NUMA_NO_NODE)
+		node = set->numa_node;
+
+	hctx = kzalloc_node(sizeof(*hctx), GFP_KERNEL, node);
+	if (!hctx)
+		return NULL;
+
+	if (!zalloc_cpumask_var_node(&hctx->cpumask, GFP_KERNEL, node))
+		goto free_hctx;
 
 	INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
 	INIT_DELAYED_WORK(&hctx->delay_work, blk_mq_delay_work_fn);
 	spin_lock_init(&hctx->lock);
 	INIT_LIST_HEAD(&hctx->dispatch);
+	atomic_set(&hctx->nr_active, 0);
+	hctx->numa_node = node;
 	hctx->queue = q;
 	hctx->queue_num = hctx_idx;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
@@ -1743,7 +1756,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
 				   flush_start_tag + hctx_idx, node))
 		goto free_fq;
 
-	return 0;
+	return hctx;
 
  free_fq:
 	kfree(hctx->fq);
@@ -1756,8 +1769,11 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	kfree(hctx->ctxs);
  unregister_cpu_notifier:
 	blk_mq_unregister_cpu_notifier(&hctx->cpu_notifier);
-
-	return -1;
+	free_cpumask_var(hctx->cpumask);
+ free_hctx:
+	kfree(hctx);
+	
+	return NULL;
 }
 
 static void blk_mq_init_cpu_queues(struct request_queue *q,
@@ -1971,57 +1987,40 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 						struct request_queue *q)
 {
 	int i, j;
+	struct blk_mq_hw_ctx *hctx;
 	struct blk_mq_hw_ctx **hctxs = q->queue_hw_ctx;
 
 	blk_mq_sysfs_unregister(q);
 	for (i = 0; i < set->nr_hw_queues; i++) {
-		int node;
-
 		if (hctxs[i])
 			continue;
 		if (!set->tags[i])
 			break;
 
-		node = blk_mq_hw_queue_to_node(q->mq_map, i);
-		if (node == NUMA_NO_NODE)
-			node = set->numa_node;
-
-		hctxs[i] = kzalloc_node(sizeof(struct blk_mq_hw_ctx),
-					GFP_KERNEL, node);
-		if (!hctxs[i])
-			break;
-
-		if (!zalloc_cpumask_var_node(&hctxs[i]->cpumask, GFP_KERNEL,
-						node)) {
-			kfree(hctxs[i]);
-			hctxs[i] = NULL;
+		hctx = blk_mq_init_hctx(q, set, i);
+		if (!hctx)
 			break;
-		}
 
-		atomic_set(&hctxs[i]->nr_active, 0);
-		hctxs[i]->numa_node = node;
+		blk_mq_hctx_kobj_init(hctx);
 
-		if (blk_mq_init_hctx(q, set, hctxs[i], i)) {
-			free_cpumask_var(hctxs[i]->cpumask);
-			kfree(hctxs[i]);
-			hctxs[i] = NULL;
-			break;
-		}
-		blk_mq_hctx_kobj_init(hctxs[i]);
+		hctxs[i] = hctx;
 	}
 	for (j = i; j < q->nr_hw_queues; j++) {
-		struct blk_mq_hw_ctx *hctx = hctxs[j];
+		hctx = hctxs[j];
 
-		if (hctx) {
-			kobject_put(&hctx->kobj);
+		if (!hctx)
+			continue;
 
-			if (hctx->tags) {
-				blk_mq_free_rq_map(set, hctx->tags, j);
-				set->tags[j] = NULL;
-			}
-			blk_mq_exit_hctx(q, set, hctx, j);
-			hctxs[j] = NULL;
+		kobject_put(&hctx->kobj);
+
+		if (hctx->tags) {
+			blk_mq_free_rq_map(set, hctx->tags, j);
+			set->tags[j] = NULL;
 		}
+
+		blk_mq_exit_hctx(q, set, hctx, j);
+
+		hctxs[j] = NULL;
 	}
 	q->nr_hw_queues = i;
 	blk_mq_sysfs_register(q);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 14/21] blk-mq: Rework blk_mq_init_hctx() function
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

Rework blk_mq_init_hctx() function so all reaquired memory
allocations are done before data initialization and callbacks
invocation.

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-mq.c | 50 ++++++++++++++++++++++++--------------------------
 1 file changed, 24 insertions(+), 26 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b77e73b..9e5cd1f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1712,6 +1712,22 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 	if (!zalloc_cpumask_var_node(&hctx->cpumask, GFP_KERNEL, node))
 		goto free_hctx;
 
+	/*
+	 * Allocate space for all possible cpus to avoid allocation at
+	 * runtime
+	 */
+	hctx->ctxs = kmalloc_node(nr_cpu_ids * sizeof(void *),
+					GFP_KERNEL, node);
+	if (!hctx->ctxs)
+		goto free_cpumask;
+
+	if (blk_mq_alloc_bitmap(&hctx->ctx_map, node))
+		goto free_ctxs;
+
+	hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size);
+	if (!hctx->fq)
+		goto free_bitmap;
+
 	INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
 	INIT_DELAYED_WORK(&hctx->delay_work, blk_mq_delay_work_fn);
 	spin_lock_init(&hctx->lock);
@@ -1720,55 +1736,37 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 	hctx->numa_node = node;
 	hctx->queue = q;
 	hctx->queue_num = hctx_idx;
+	hctx->nr_ctx = 0;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
+	hctx->tags = set->tags[hctx_idx];
 
 	blk_mq_init_cpu_notifier(&hctx->cpu_notifier,
 					blk_mq_hctx_notify, hctx);
 	blk_mq_register_cpu_notifier(&hctx->cpu_notifier);
 
-	hctx->tags = set->tags[hctx_idx];
-
-	/*
-	 * Allocate space for all possible cpus to avoid allocation at
-	 * runtime
-	 */
-	hctx->ctxs = kmalloc_node(nr_cpu_ids * sizeof(void *),
-					GFP_KERNEL, node);
-	if (!hctx->ctxs)
-		goto unregister_cpu_notifier;
-
-	if (blk_mq_alloc_bitmap(&hctx->ctx_map, node))
-		goto free_ctxs;
-
-	hctx->nr_ctx = 0;
-
 	if (set->ops->init_hctx &&
 	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
-		goto free_bitmap;
-
-	hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size);
-	if (!hctx->fq)
-		goto exit_hctx;
+		goto unregister_cpu_notifier;
 
 	if (set->ops->init_request &&
 	    set->ops->init_request(set->driver_data,
 				   hctx->fq->flush_rq, hctx_idx,
 				   flush_start_tag + hctx_idx, node))
-		goto free_fq;
+		goto exit_hctx;
 
 	return hctx;
 
- free_fq:
-	kfree(hctx->fq);
  exit_hctx:
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
+ unregister_cpu_notifier:
+	blk_mq_unregister_cpu_notifier(&hctx->cpu_notifier);
+	kfree(hctx->fq);
  free_bitmap:
 	blk_mq_free_bitmap(&hctx->ctx_map);
  free_ctxs:
 	kfree(hctx->ctxs);
- unregister_cpu_notifier:
-	blk_mq_unregister_cpu_notifier(&hctx->cpu_notifier);
+ free_cpumask:
 	free_cpumask_var(hctx->cpumask);
  free_hctx:
 	kfree(hctx);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 14/21] blk-mq: Rework blk_mq_init_hctx() function
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


Rework blk_mq_init_hctx() function so all reaquired memory
allocations are done before data initialization and callbacks
invocation.

CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-mq.c | 50 ++++++++++++++++++++++++--------------------------
 1 file changed, 24 insertions(+), 26 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index b77e73b..9e5cd1f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1712,6 +1712,22 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 	if (!zalloc_cpumask_var_node(&hctx->cpumask, GFP_KERNEL, node))
 		goto free_hctx;
 
+	/*
+	 * Allocate space for all possible cpus to avoid allocation at
+	 * runtime
+	 */
+	hctx->ctxs = kmalloc_node(nr_cpu_ids * sizeof(void *),
+					GFP_KERNEL, node);
+	if (!hctx->ctxs)
+		goto free_cpumask;
+
+	if (blk_mq_alloc_bitmap(&hctx->ctx_map, node))
+		goto free_ctxs;
+
+	hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size);
+	if (!hctx->fq)
+		goto free_bitmap;
+
 	INIT_DELAYED_WORK(&hctx->run_work, blk_mq_run_work_fn);
 	INIT_DELAYED_WORK(&hctx->delay_work, blk_mq_delay_work_fn);
 	spin_lock_init(&hctx->lock);
@@ -1720,55 +1736,37 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 	hctx->numa_node = node;
 	hctx->queue = q;
 	hctx->queue_num = hctx_idx;
+	hctx->nr_ctx = 0;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
+	hctx->tags = set->tags[hctx_idx];
 
 	blk_mq_init_cpu_notifier(&hctx->cpu_notifier,
 					blk_mq_hctx_notify, hctx);
 	blk_mq_register_cpu_notifier(&hctx->cpu_notifier);
 
-	hctx->tags = set->tags[hctx_idx];
-
-	/*
-	 * Allocate space for all possible cpus to avoid allocation at
-	 * runtime
-	 */
-	hctx->ctxs = kmalloc_node(nr_cpu_ids * sizeof(void *),
-					GFP_KERNEL, node);
-	if (!hctx->ctxs)
-		goto unregister_cpu_notifier;
-
-	if (blk_mq_alloc_bitmap(&hctx->ctx_map, node))
-		goto free_ctxs;
-
-	hctx->nr_ctx = 0;
-
 	if (set->ops->init_hctx &&
 	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
-		goto free_bitmap;
-
-	hctx->fq = blk_alloc_flush_queue(q, hctx->numa_node, set->cmd_size);
-	if (!hctx->fq)
-		goto exit_hctx;
+		goto unregister_cpu_notifier;
 
 	if (set->ops->init_request &&
 	    set->ops->init_request(set->driver_data,
 				   hctx->fq->flush_rq, hctx_idx,
 				   flush_start_tag + hctx_idx, node))
-		goto free_fq;
+		goto exit_hctx;
 
 	return hctx;
 
- free_fq:
-	kfree(hctx->fq);
  exit_hctx:
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
+ unregister_cpu_notifier:
+	blk_mq_unregister_cpu_notifier(&hctx->cpu_notifier);
+	kfree(hctx->fq);
  free_bitmap:
 	blk_mq_free_bitmap(&hctx->ctx_map);
  free_ctxs:
 	kfree(hctx->ctxs);
- unregister_cpu_notifier:
-	blk_mq_unregister_cpu_notifier(&hctx->cpu_notifier);
+ free_cpumask:
 	free_cpumask_var(hctx->cpumask);
  free_hctx:
 	kfree(hctx);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 15/21] blk-mq: Pair blk_mq_hctx_kobj_init() with blk_mq_hctx_kobj_put()
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-mq-sysfs.c | 5 +++++
 block/blk-mq.c       | 2 +-
 block/blk-mq.h       | 1 +
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index fe822aa..404ccd5 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -417,6 +417,11 @@ void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx)
 	kobject_init(&hctx->kobj, &blk_mq_hw_ktype);
 }
 
+void blk_mq_hctx_kobj_put(struct blk_mq_hw_ctx *hctx)
+{
+	kobject_put(&hctx->kobj);
+}
+
 static void blk_mq_sysfs_init(struct request_queue *q)
 {
 	struct blk_mq_ctx *ctx;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9e5cd1f..e14f7e8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2009,7 +2009,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 		if (!hctx)
 			continue;
 
-		kobject_put(&hctx->kobj);
+		blk_mq_hctx_kobj_put(hctx);
 
 		if (hctx->tags) {
 			blk_mq_free_rq_map(set, hctx->tags, j);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 97b0051..592e308 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -57,6 +57,7 @@ extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
 extern int blk_mq_sysfs_register(struct request_queue *q);
 extern void blk_mq_sysfs_unregister(struct request_queue *q);
 extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
+extern void blk_mq_hctx_kobj_put(struct blk_mq_hw_ctx *hctx);
 
 extern void blk_mq_rq_timed_out(struct request *req, bool reserved);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 15/21] blk-mq: Pair blk_mq_hctx_kobj_init() with blk_mq_hctx_kobj_put()
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-mq-sysfs.c | 5 +++++
 block/blk-mq.c       | 2 +-
 block/blk-mq.h       | 1 +
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index fe822aa..404ccd5 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -417,6 +417,11 @@ void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx)
 	kobject_init(&hctx->kobj, &blk_mq_hw_ktype);
 }
 
+void blk_mq_hctx_kobj_put(struct blk_mq_hw_ctx *hctx)
+{
+	kobject_put(&hctx->kobj);
+}
+
 static void blk_mq_sysfs_init(struct request_queue *q)
 {
 	struct blk_mq_ctx *ctx;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9e5cd1f..e14f7e8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2009,7 +2009,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 		if (!hctx)
 			continue;
 
-		kobject_put(&hctx->kobj);
+		blk_mq_hctx_kobj_put(hctx);
 
 		if (hctx->tags) {
 			blk_mq_free_rq_map(set, hctx->tags, j);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 97b0051..592e308 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -57,6 +57,7 @@ extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
 extern int blk_mq_sysfs_register(struct request_queue *q);
 extern void blk_mq_sysfs_unregister(struct request_queue *q);
 extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
+extern void blk_mq_hctx_kobj_put(struct blk_mq_hw_ctx *hctx);
 
 extern void blk_mq_rq_timed_out(struct request *req, bool reserved);
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 16/21] blk-mq: Set flush_start_tag to BLK_MQ_MAX_DEPTH
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

We need flush tags unique across hardware contexts and do
not overlap with normal tags. BLK_MQ_MAX_DEPTH as a base
number seems better choise than a queue's depth.

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-mq.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e14f7e8..c27e64e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1661,14 +1661,12 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set,
 		struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
 {
-	unsigned flush_start_tag = set->queue_depth;
-
 	blk_mq_tag_idle(hctx);
 
 	if (set->ops->exit_request)
 		set->ops->exit_request(set->driver_data,
 				       hctx->fq->flush_rq, hctx_idx,
-				       flush_start_tag + hctx_idx);
+				       BLK_MQ_MAX_DEPTH + hctx_idx);
 
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
@@ -1697,7 +1695,6 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set, unsigned hctx_idx)
 {
-	unsigned flush_start_tag = set->queue_depth;
 	struct blk_mq_hw_ctx *hctx;
 	int node;
 
@@ -1751,7 +1748,7 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 	if (set->ops->init_request &&
 	    set->ops->init_request(set->driver_data,
 				   hctx->fq->flush_rq, hctx_idx,
-				   flush_start_tag + hctx_idx, node))
+				   BLK_MQ_MAX_DEPTH + hctx_idx, node))
 		goto exit_hctx;
 
 	return hctx;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 16/21] blk-mq: Set flush_start_tag to BLK_MQ_MAX_DEPTH
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


We need flush tags unique across hardware contexts and do
not overlap with normal tags. BLK_MQ_MAX_DEPTH as a base
number seems better choise than a queue's depth.

CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-mq.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e14f7e8..c27e64e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1661,14 +1661,12 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set,
 		struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
 {
-	unsigned flush_start_tag = set->queue_depth;
-
 	blk_mq_tag_idle(hctx);
 
 	if (set->ops->exit_request)
 		set->ops->exit_request(set->driver_data,
 				       hctx->fq->flush_rq, hctx_idx,
-				       flush_start_tag + hctx_idx);
+				       BLK_MQ_MAX_DEPTH + hctx_idx);
 
 	if (set->ops->exit_hctx)
 		set->ops->exit_hctx(hctx, hctx_idx);
@@ -1697,7 +1695,6 @@ static void blk_mq_exit_hw_queues(struct request_queue *q,
 static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set, unsigned hctx_idx)
 {
-	unsigned flush_start_tag = set->queue_depth;
 	struct blk_mq_hw_ctx *hctx;
 	int node;
 
@@ -1751,7 +1748,7 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 	if (set->ops->init_request &&
 	    set->ops->init_request(set->driver_data,
 				   hctx->fq->flush_rq, hctx_idx,
-				   flush_start_tag + hctx_idx, node))
+				   BLK_MQ_MAX_DEPTH + hctx_idx, node))
 		goto exit_hctx;
 
 	return hctx;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH RFC 17/21] blk-mq: Introduce a 1:N hardware contexts
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

This is 1st change in a bid to enable mapping of multiple
device hardware queues to a single CPU.

It introduces concepts of 1:1 low-level hardware context
(1 low-level hardware context to 1 device hardware queue)
and opposed to 1:N hardware context (1 hardware context to
N device hardware queues). Basically, it replaces what is
now 1:1 hardware context.

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-core.c                  |  3 ++-
 block/blk-mq.c                    | 32 +++++++++++++++++++++++---------
 drivers/block/loop.c              |  2 +-
 drivers/block/mtip32xx/mtip32xx.c |  3 ++-
 drivers/block/null_blk.c          | 11 +++++------
 drivers/block/rbd.c               |  2 +-
 drivers/block/virtio_blk.c        |  5 +++--
 drivers/block/xen-blkfront.c      |  5 +++--
 drivers/md/dm-rq.c                |  3 ++-
 drivers/nvme/host/pci.c           | 27 +++++++++++++++------------
 drivers/scsi/scsi_lib.c           |  3 ++-
 include/linux/blk-mq.h            | 27 +++++++++++++++++++++------
 12 files changed, 80 insertions(+), 43 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 36c7ac3..bf4f196 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3314,11 +3314,12 @@ bool blk_poll(struct request_queue *q, blk_qc_t cookie)
 	while (!need_resched()) {
 		unsigned int queue_num = blk_qc_t_to_queue_num(cookie);
 		struct blk_mq_hw_ctx *hctx = q->queue_hw_ctx[queue_num];
+		struct blk_mq_llhw_ctx *llhw_ctx = &hctx->llhw_ctxs[0];
 		int ret;
 
 		hctx->poll_invoked++;
 
-		ret = q->mq_ops->poll(hctx, blk_qc_t_to_tag(cookie));
+		ret = q->mq_ops->poll(llhw_ctx, blk_qc_t_to_tag(cookie));
 		if (ret > 0) {
 			hctx->poll_success++;
 			set_current_state(TASK_RUNNING);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c27e64e..274eab8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -838,7 +838,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 		bd.list = dptr;
 		bd.last = list_empty(&rq_list);
 
-		ret = q->mq_ops->queue_rq(hctx, &bd);
+		ret = q->mq_ops->queue_rq(&hctx->llhw_ctxs[0], &bd);
 		switch (ret) {
 		case BLK_MQ_RQ_QUEUE_OK:
 			queued++;
@@ -1266,7 +1266,7 @@ static int blk_mq_direct_issue_request(struct request *rq, blk_qc_t *cookie)
 	 * error (busy), just add it to our list as we previously
 	 * would have done
 	 */
-	ret = q->mq_ops->queue_rq(hctx, &bd);
+	ret = q->mq_ops->queue_rq(&hctx->llhw_ctxs[0], &bd);
 	if (ret == BLK_MQ_RQ_QUEUE_OK) {
 		*cookie = new_cookie;
 		return 0;
@@ -1661,6 +1661,8 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set,
 		struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
 {
+	int i;
+
 	blk_mq_tag_idle(hctx);
 
 	if (set->ops->exit_request)
@@ -1669,7 +1671,8 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 				       BLK_MQ_MAX_DEPTH + hctx_idx);
 
 	if (set->ops->exit_hctx)
-		set->ops->exit_hctx(hctx, hctx_idx);
+		for (i = 0; i < hctx->nr_llhw_ctx; i++)
+			set->ops->exit_hctx(&hctx->llhw_ctxs[i]);
 
 	blk_mq_unregister_cpu_notifier(&hctx->cpu_notifier);
 	blk_free_flush_queue(hctx->fq);
@@ -1696,13 +1699,16 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set, unsigned hctx_idx)
 {
 	struct blk_mq_hw_ctx *hctx;
+	unsigned int nr_llhw_ctx = 1;
 	int node;
+	int i;
 
 	node = blk_mq_hw_queue_to_node(q->mq_map, hctx_idx);
 	if (node == NUMA_NO_NODE)
 		node = set->numa_node;
 
-	hctx = kzalloc_node(sizeof(*hctx), GFP_KERNEL, node);
+	hctx = kzalloc_node(sizeof(*hctx) +
+		nr_llhw_ctx * sizeof(hctx->llhw_ctxs[0]), GFP_KERNEL, node);
 	if (!hctx)
 		return NULL;
 
@@ -1734,6 +1740,7 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 	hctx->queue = q;
 	hctx->queue_num = hctx_idx;
 	hctx->nr_ctx = 0;
+	hctx->nr_llhw_ctx = nr_llhw_ctx;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 	hctx->tags = set->tags[hctx_idx];
 
@@ -1741,9 +1748,16 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 					blk_mq_hctx_notify, hctx);
 	blk_mq_register_cpu_notifier(&hctx->cpu_notifier);
 
-	if (set->ops->init_hctx &&
-	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
-		goto unregister_cpu_notifier;
+	for (i = 0; i < hctx->nr_llhw_ctx; i++) {
+		struct blk_mq_llhw_ctx *llhw_ctx = &hctx->llhw_ctxs[i];
+
+		llhw_ctx->index = i;
+		llhw_ctx->queue_id = hctx_idx;
+
+		if (set->ops->init_hctx &&
+		    set->ops->init_hctx(llhw_ctx, set->driver_data))
+			goto exit_hctx;
+	}
 
 	if (set->ops->init_request &&
 	    set->ops->init_request(set->driver_data,
@@ -1755,8 +1769,8 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 
  exit_hctx:
 	if (set->ops->exit_hctx)
-		set->ops->exit_hctx(hctx, hctx_idx);
- unregister_cpu_notifier:
+		for (i--; i >= 0; i--)
+			set->ops->exit_hctx(&hctx->llhw_ctxs[i]);
 	blk_mq_unregister_cpu_notifier(&hctx->cpu_notifier);
 	kfree(hctx->fq);
  free_bitmap:
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cbdb3b1..f290c64 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1637,7 +1637,7 @@ int loop_unregister_transfer(int number)
 EXPORT_SYMBOL(loop_register_transfer);
 EXPORT_SYMBOL(loop_unregister_transfer);
 
-static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
+static int loop_queue_rq(struct blk_mq_llhw_ctx *llhw_ctx,
 		const struct blk_mq_queue_data *bd)
 {
 	struct loop_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 3cc92e9..5d7c17d 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -3805,9 +3805,10 @@ static bool mtip_check_unal_depth(struct blk_mq_hw_ctx *hctx,
 	return false;
 }
 
-static int mtip_queue_rq(struct blk_mq_hw_ctx *hctx,
+static int mtip_queue_rq(struct blk_mq_llhw_ctx *llhw_ctx,
 			 const struct blk_mq_queue_data *bd)
 {
+	struct blk_mq_hw_ctx *hctx = blk_mq_to_hctx(llhw_ctx);
 	struct request *rq = bd->rq;
 	int ret;
 
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 7d3b7d6..1747040 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -351,7 +351,7 @@ static void null_request_fn(struct request_queue *q)
 	}
 }
 
-static int null_queue_rq(struct blk_mq_hw_ctx *hctx,
+static int null_queue_rq(struct blk_mq_llhw_ctx *llhw_ctx,
 			 const struct blk_mq_queue_data *bd)
 {
 	struct nullb_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
@@ -361,7 +361,7 @@ static int null_queue_rq(struct blk_mq_hw_ctx *hctx,
 		cmd->timer.function = null_cmd_timer_expired;
 	}
 	cmd->rq = bd->rq;
-	cmd->nq = hctx->driver_data;
+	cmd->nq = llhw_ctx->driver_data;
 
 	blk_mq_start_request(bd->rq);
 
@@ -378,13 +378,12 @@ static void null_init_queue(struct nullb *nullb, struct nullb_queue *nq)
 	nq->queue_depth = nullb->queue_depth;
 }
 
-static int null_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
-			  unsigned int index)
+static int null_init_hctx(struct blk_mq_llhw_ctx *llhw_ctx, void *data)
 {
 	struct nullb *nullb = data;
-	struct nullb_queue *nq = &nullb->queues[index];
+	struct nullb_queue *nq = &nullb->queues[llhw_ctx->queue_id];
 
-	hctx->driver_data = nq;
+	llhw_ctx->driver_data = nq;
 	null_init_queue(nullb, nq);
 	nullb->nr_queues++;
 
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index c1f84df..7dd5e0e 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3383,7 +3383,7 @@ err:
 	blk_mq_end_request(rq, result);
 }
 
-static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
+static int rbd_queue_rq(struct blk_mq_llhw_ctx *llhw_ctx,
 		const struct blk_mq_queue_data *bd)
 {
 	struct request *rq = bd->rq;
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 2dc5c96..9cc26c7 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -157,15 +157,16 @@ static void virtblk_done(struct virtqueue *vq)
 	spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
 }
 
-static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
+static int virtio_queue_rq(struct blk_mq_llhw_ctx *llhw_ctx,
 			   const struct blk_mq_queue_data *bd)
 {
+	struct blk_mq_hw_ctx *hctx = blk_mq_to_hctx(llhw_ctx);
 	struct virtio_blk *vblk = hctx->queue->queuedata;
 	struct request *req = bd->rq;
 	struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
 	unsigned long flags;
 	unsigned int num;
-	int qid = hctx->queue_num;
+	int qid = llhw_ctx->queue_id;
 	int err;
 	bool notify = false;
 
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 9908597..784c4d5 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -872,11 +872,12 @@ static inline bool blkif_request_flush_invalid(struct request *req,
 		 !info->feature_fua));
 }
 
-static int blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
+static int blkif_queue_rq(struct blk_mq_llhw_ctx *llhw_ctx,
 			  const struct blk_mq_queue_data *qd)
 {
 	unsigned long flags;
-	int qid = hctx->queue_num;
+	int qid = llhw_ctx->queue_id;
+	struct blk_mq_hw_ctx *hctx = blk_mq_to_hctx(llhw_ctx);
 	struct blkfront_info *info = hctx->queue->queuedata;
 	struct blkfront_ring_info *rinfo = NULL;
 
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index d1c3645..b074137 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -855,9 +855,10 @@ static int dm_mq_init_request(void *data, struct request *rq,
 	return 0;
 }
 
-static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
+static int dm_mq_queue_rq(struct blk_mq_llhw_ctx *llhw_ctx,
 			  const struct blk_mq_queue_data *bd)
 {
+	struct blk_mq_hw_ctx *hctx = blk_mq_to_hctx(llhw_ctx);
 	struct request *rq = bd->rq;
 	struct dm_rq_target_io *tio = blk_mq_rq_to_pdu(rq);
 	struct mapped_device *md = tio->md;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 086fd7e..eef2e40 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -201,9 +201,10 @@ static unsigned int nvme_cmd_size(struct nvme_dev *dev)
 		nvme_iod_alloc_size(dev, NVME_INT_BYTES(dev), NVME_INT_PAGES);
 }
 
-static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
-				unsigned int hctx_idx)
+static int nvme_admin_init_hctx(struct blk_mq_llhw_ctx *llhw_ctx, void *data)
 {
+	struct blk_mq_hw_ctx *hctx = blk_mq_to_hctx(llhw_ctx);
+	unsigned int hctx_idx = llhw_ctx->queue_id;
 	struct nvme_dev *dev = data;
 	struct nvme_queue *nvmeq = dev->queues[0];
 
@@ -211,14 +212,14 @@ static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 	WARN_ON(dev->admin_tagset.tags[0] != hctx->tags);
 	WARN_ON(nvmeq->tags);
 
-	hctx->driver_data = nvmeq;
+	llhw_ctx->driver_data = nvmeq;
 	nvmeq->tags = &dev->admin_tagset.tags[0];
 	return 0;
 }
 
-static void nvme_admin_exit_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+static void nvme_admin_exit_hctx(struct blk_mq_llhw_ctx *llhw_ctx)
 {
-	struct nvme_queue *nvmeq = hctx->driver_data;
+	struct nvme_queue *nvmeq = llhw_ctx->driver_data;
 
 	nvmeq->tags = NULL;
 }
@@ -236,9 +237,10 @@ static int nvme_admin_init_request(void *data, struct request *req,
 	return 0;
 }
 
-static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
-			  unsigned int hctx_idx)
+static int nvme_init_hctx(struct blk_mq_llhw_ctx *llhw_ctx, void *data)
 {
+	struct blk_mq_hw_ctx *hctx = blk_mq_to_hctx(llhw_ctx);
+	unsigned int hctx_idx = llhw_ctx->queue_id;
 	struct nvme_dev *dev = data;
 	struct nvme_queue *nvmeq = dev->queues[hctx_idx + 1];
 
@@ -246,7 +248,7 @@ static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 		nvmeq->tags = &dev->tagset.tags[hctx_idx];
 
 	WARN_ON(dev->tagset.tags[hctx_idx] != hctx->tags);
-	hctx->driver_data = nvmeq;
+	llhw_ctx->driver_data = nvmeq;
 	return 0;
 }
 
@@ -558,11 +560,12 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 /*
  * NOTE: ns is NULL when called on the admin queue.
  */
-static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
+static int nvme_queue_rq(struct blk_mq_llhw_ctx *llhw_ctx,
 			 const struct blk_mq_queue_data *bd)
 {
+	struct blk_mq_hw_ctx *hctx = blk_mq_to_hctx(llhw_ctx);
 	struct nvme_ns *ns = hctx->queue->queuedata;
-	struct nvme_queue *nvmeq = hctx->driver_data;
+	struct nvme_queue *nvmeq = llhw_ctx->driver_data;
 	struct nvme_dev *dev = nvmeq->dev;
 	struct request *req = bd->rq;
 	struct nvme_command cmnd;
@@ -742,9 +745,9 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
 	return IRQ_NONE;
 }
 
-static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
+static int nvme_poll(struct blk_mq_llhw_ctx *llhw_ctx, unsigned int tag)
 {
-	struct nvme_queue *nvmeq = hctx->driver_data;
+	struct nvme_queue *nvmeq = llhw_ctx->driver_data;
 
 	if (nvme_cqe_valid(nvmeq, nvmeq->cq_head, nvmeq->cq_phase)) {
 		spin_lock_irq(&nvmeq->q_lock);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 2cca9cf..0019213 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1876,9 +1876,10 @@ static void scsi_mq_done(struct scsi_cmnd *cmd)
 	blk_mq_complete_request(cmd->request, cmd->request->errors);
 }
 
-static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
+static int scsi_queue_rq(struct blk_mq_llhw_ctx	*llhw_ctx,
 			 const struct blk_mq_queue_data *bd)
 {
+	struct blk_mq_hw_ctx *hctx = blk_mq_to_hctx(llhw_ctx);
 	struct request *req = bd->rq;
 	struct request_queue *q = req->q;
 	struct scsi_device *sdev = q->queuedata;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 6c7ee56..2c3392b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -18,6 +18,12 @@ struct blk_mq_ctxmap {
 	struct blk_align_bitmap *map;
 };
 
+struct blk_mq_llhw_ctx {
+	int			index;
+	int			queue_id;
+	void			*driver_data;
+};
+
 struct blk_mq_hw_ctx {
 	struct {
 		spinlock_t		lock;
@@ -36,8 +42,6 @@ struct blk_mq_hw_ctx {
 	struct request_queue	*queue;
 	struct blk_flush_queue	*fq;
 
-	void			*driver_data;
-
 	struct blk_mq_ctxmap	ctx_map;
 
 	unsigned int		nr_ctx;
@@ -62,8 +66,19 @@ struct blk_mq_hw_ctx {
 
 	unsigned long		poll_invoked;
 	unsigned long		poll_success;
+
+	unsigned int		nr_llhw_ctx;
+	struct blk_mq_llhw_ctx	llhw_ctxs[0];
 };
 
+static inline
+struct blk_mq_hw_ctx *blk_mq_to_hctx(struct blk_mq_llhw_ctx *llhw_ctx)
+{
+	struct blk_mq_llhw_ctx *llhw_ctx_0 = llhw_ctx - llhw_ctx->index;
+
+	return (void *)llhw_ctx_0 - offsetof(struct blk_mq_hw_ctx, llhw_ctxs);
+}
+
 struct blk_mq_tag_set {
 	struct blk_mq_ops	*ops;
 	unsigned int		nr_hw_queues;
@@ -87,11 +102,11 @@ struct blk_mq_queue_data {
 	bool last;
 };
 
-typedef int (queue_rq_fn)(struct blk_mq_hw_ctx *, const struct blk_mq_queue_data *);
+typedef int (queue_rq_fn)(struct blk_mq_llhw_ctx *, const struct blk_mq_queue_data *);
 typedef struct blk_mq_hw_ctx *(map_queue_fn)(struct request_queue *, const int);
 typedef enum blk_eh_timer_return (timeout_fn)(struct request *, bool);
-typedef int (init_hctx_fn)(struct blk_mq_hw_ctx *, void *, unsigned int);
-typedef void (exit_hctx_fn)(struct blk_mq_hw_ctx *, unsigned int);
+typedef int (init_hctx_fn)(struct blk_mq_llhw_ctx *, void *);
+typedef void (exit_hctx_fn)(struct blk_mq_llhw_ctx *);
 typedef int (init_request_fn)(void *, struct request *, unsigned int,
 		unsigned int, unsigned int);
 typedef void (exit_request_fn)(void *, struct request *, unsigned int,
@@ -101,7 +116,7 @@ typedef int (reinit_request_fn)(void *, struct request *);
 typedef void (busy_iter_fn)(struct blk_mq_hw_ctx *, struct request *, void *,
 		bool);
 typedef void (busy_tag_iter_fn)(struct request *, void *, bool);
-typedef int (poll_fn)(struct blk_mq_hw_ctx *, unsigned int);
+typedef int (poll_fn)(struct blk_mq_llhw_ctx *, unsigned int);
 
 
 struct blk_mq_ops {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH RFC 17/21] blk-mq: Introduce a 1:N hardware contexts
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


This is 1st change in a bid to enable mapping of multiple
device hardware queues to a single CPU.

It introduces concepts of 1:1 low-level hardware context
(1 low-level hardware context to 1 device hardware queue)
and opposed to 1:N hardware context (1 hardware context to
N device hardware queues). Basically, it replaces what is
now 1:1 hardware context.

CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-core.c                  |  3 ++-
 block/blk-mq.c                    | 32 +++++++++++++++++++++++---------
 drivers/block/loop.c              |  2 +-
 drivers/block/mtip32xx/mtip32xx.c |  3 ++-
 drivers/block/null_blk.c          | 11 +++++------
 drivers/block/rbd.c               |  2 +-
 drivers/block/virtio_blk.c        |  5 +++--
 drivers/block/xen-blkfront.c      |  5 +++--
 drivers/md/dm-rq.c                |  3 ++-
 drivers/nvme/host/pci.c           | 27 +++++++++++++++------------
 drivers/scsi/scsi_lib.c           |  3 ++-
 include/linux/blk-mq.h            | 27 +++++++++++++++++++++------
 12 files changed, 80 insertions(+), 43 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 36c7ac3..bf4f196 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3314,11 +3314,12 @@ bool blk_poll(struct request_queue *q, blk_qc_t cookie)
 	while (!need_resched()) {
 		unsigned int queue_num = blk_qc_t_to_queue_num(cookie);
 		struct blk_mq_hw_ctx *hctx = q->queue_hw_ctx[queue_num];
+		struct blk_mq_llhw_ctx *llhw_ctx = &hctx->llhw_ctxs[0];
 		int ret;
 
 		hctx->poll_invoked++;
 
-		ret = q->mq_ops->poll(hctx, blk_qc_t_to_tag(cookie));
+		ret = q->mq_ops->poll(llhw_ctx, blk_qc_t_to_tag(cookie));
 		if (ret > 0) {
 			hctx->poll_success++;
 			set_current_state(TASK_RUNNING);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c27e64e..274eab8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -838,7 +838,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 		bd.list = dptr;
 		bd.last = list_empty(&rq_list);
 
-		ret = q->mq_ops->queue_rq(hctx, &bd);
+		ret = q->mq_ops->queue_rq(&hctx->llhw_ctxs[0], &bd);
 		switch (ret) {
 		case BLK_MQ_RQ_QUEUE_OK:
 			queued++;
@@ -1266,7 +1266,7 @@ static int blk_mq_direct_issue_request(struct request *rq, blk_qc_t *cookie)
 	 * error (busy), just add it to our list as we previously
 	 * would have done
 	 */
-	ret = q->mq_ops->queue_rq(hctx, &bd);
+	ret = q->mq_ops->queue_rq(&hctx->llhw_ctxs[0], &bd);
 	if (ret == BLK_MQ_RQ_QUEUE_OK) {
 		*cookie = new_cookie;
 		return 0;
@@ -1661,6 +1661,8 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set,
 		struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
 {
+	int i;
+
 	blk_mq_tag_idle(hctx);
 
 	if (set->ops->exit_request)
@@ -1669,7 +1671,8 @@ static void blk_mq_exit_hctx(struct request_queue *q,
 				       BLK_MQ_MAX_DEPTH + hctx_idx);
 
 	if (set->ops->exit_hctx)
-		set->ops->exit_hctx(hctx, hctx_idx);
+		for (i = 0; i < hctx->nr_llhw_ctx; i++)
+			set->ops->exit_hctx(&hctx->llhw_ctxs[i]);
 
 	blk_mq_unregister_cpu_notifier(&hctx->cpu_notifier);
 	blk_free_flush_queue(hctx->fq);
@@ -1696,13 +1699,16 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set, unsigned hctx_idx)
 {
 	struct blk_mq_hw_ctx *hctx;
+	unsigned int nr_llhw_ctx = 1;
 	int node;
+	int i;
 
 	node = blk_mq_hw_queue_to_node(q->mq_map, hctx_idx);
 	if (node == NUMA_NO_NODE)
 		node = set->numa_node;
 
-	hctx = kzalloc_node(sizeof(*hctx), GFP_KERNEL, node);
+	hctx = kzalloc_node(sizeof(*hctx) +
+		nr_llhw_ctx * sizeof(hctx->llhw_ctxs[0]), GFP_KERNEL, node);
 	if (!hctx)
 		return NULL;
 
@@ -1734,6 +1740,7 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 	hctx->queue = q;
 	hctx->queue_num = hctx_idx;
 	hctx->nr_ctx = 0;
+	hctx->nr_llhw_ctx = nr_llhw_ctx;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 	hctx->tags = set->tags[hctx_idx];
 
@@ -1741,9 +1748,16 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 					blk_mq_hctx_notify, hctx);
 	blk_mq_register_cpu_notifier(&hctx->cpu_notifier);
 
-	if (set->ops->init_hctx &&
-	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
-		goto unregister_cpu_notifier;
+	for (i = 0; i < hctx->nr_llhw_ctx; i++) {
+		struct blk_mq_llhw_ctx *llhw_ctx = &hctx->llhw_ctxs[i];
+
+		llhw_ctx->index = i;
+		llhw_ctx->queue_id = hctx_idx;
+
+		if (set->ops->init_hctx &&
+		    set->ops->init_hctx(llhw_ctx, set->driver_data))
+			goto exit_hctx;
+	}
 
 	if (set->ops->init_request &&
 	    set->ops->init_request(set->driver_data,
@@ -1755,8 +1769,8 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 
  exit_hctx:
 	if (set->ops->exit_hctx)
-		set->ops->exit_hctx(hctx, hctx_idx);
- unregister_cpu_notifier:
+		for (i--; i >= 0; i--)
+			set->ops->exit_hctx(&hctx->llhw_ctxs[i]);
 	blk_mq_unregister_cpu_notifier(&hctx->cpu_notifier);
 	kfree(hctx->fq);
  free_bitmap:
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cbdb3b1..f290c64 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1637,7 +1637,7 @@ int loop_unregister_transfer(int number)
 EXPORT_SYMBOL(loop_register_transfer);
 EXPORT_SYMBOL(loop_unregister_transfer);
 
-static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
+static int loop_queue_rq(struct blk_mq_llhw_ctx *llhw_ctx,
 		const struct blk_mq_queue_data *bd)
 {
 	struct loop_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 3cc92e9..5d7c17d 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -3805,9 +3805,10 @@ static bool mtip_check_unal_depth(struct blk_mq_hw_ctx *hctx,
 	return false;
 }
 
-static int mtip_queue_rq(struct blk_mq_hw_ctx *hctx,
+static int mtip_queue_rq(struct blk_mq_llhw_ctx *llhw_ctx,
 			 const struct blk_mq_queue_data *bd)
 {
+	struct blk_mq_hw_ctx *hctx = blk_mq_to_hctx(llhw_ctx);
 	struct request *rq = bd->rq;
 	int ret;
 
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 7d3b7d6..1747040 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -351,7 +351,7 @@ static void null_request_fn(struct request_queue *q)
 	}
 }
 
-static int null_queue_rq(struct blk_mq_hw_ctx *hctx,
+static int null_queue_rq(struct blk_mq_llhw_ctx *llhw_ctx,
 			 const struct blk_mq_queue_data *bd)
 {
 	struct nullb_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
@@ -361,7 +361,7 @@ static int null_queue_rq(struct blk_mq_hw_ctx *hctx,
 		cmd->timer.function = null_cmd_timer_expired;
 	}
 	cmd->rq = bd->rq;
-	cmd->nq = hctx->driver_data;
+	cmd->nq = llhw_ctx->driver_data;
 
 	blk_mq_start_request(bd->rq);
 
@@ -378,13 +378,12 @@ static void null_init_queue(struct nullb *nullb, struct nullb_queue *nq)
 	nq->queue_depth = nullb->queue_depth;
 }
 
-static int null_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
-			  unsigned int index)
+static int null_init_hctx(struct blk_mq_llhw_ctx *llhw_ctx, void *data)
 {
 	struct nullb *nullb = data;
-	struct nullb_queue *nq = &nullb->queues[index];
+	struct nullb_queue *nq = &nullb->queues[llhw_ctx->queue_id];
 
-	hctx->driver_data = nq;
+	llhw_ctx->driver_data = nq;
 	null_init_queue(nullb, nq);
 	nullb->nr_queues++;
 
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index c1f84df..7dd5e0e 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3383,7 +3383,7 @@ err:
 	blk_mq_end_request(rq, result);
 }
 
-static int rbd_queue_rq(struct blk_mq_hw_ctx *hctx,
+static int rbd_queue_rq(struct blk_mq_llhw_ctx *llhw_ctx,
 		const struct blk_mq_queue_data *bd)
 {
 	struct request *rq = bd->rq;
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 2dc5c96..9cc26c7 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -157,15 +157,16 @@ static void virtblk_done(struct virtqueue *vq)
 	spin_unlock_irqrestore(&vblk->vqs[qid].lock, flags);
 }
 
-static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
+static int virtio_queue_rq(struct blk_mq_llhw_ctx *llhw_ctx,
 			   const struct blk_mq_queue_data *bd)
 {
+	struct blk_mq_hw_ctx *hctx = blk_mq_to_hctx(llhw_ctx);
 	struct virtio_blk *vblk = hctx->queue->queuedata;
 	struct request *req = bd->rq;
 	struct virtblk_req *vbr = blk_mq_rq_to_pdu(req);
 	unsigned long flags;
 	unsigned int num;
-	int qid = hctx->queue_num;
+	int qid = llhw_ctx->queue_id;
 	int err;
 	bool notify = false;
 
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 9908597..784c4d5 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -872,11 +872,12 @@ static inline bool blkif_request_flush_invalid(struct request *req,
 		 !info->feature_fua));
 }
 
-static int blkif_queue_rq(struct blk_mq_hw_ctx *hctx,
+static int blkif_queue_rq(struct blk_mq_llhw_ctx *llhw_ctx,
 			  const struct blk_mq_queue_data *qd)
 {
 	unsigned long flags;
-	int qid = hctx->queue_num;
+	int qid = llhw_ctx->queue_id;
+	struct blk_mq_hw_ctx *hctx = blk_mq_to_hctx(llhw_ctx);
 	struct blkfront_info *info = hctx->queue->queuedata;
 	struct blkfront_ring_info *rinfo = NULL;
 
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index d1c3645..b074137 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -855,9 +855,10 @@ static int dm_mq_init_request(void *data, struct request *rq,
 	return 0;
 }
 
-static int dm_mq_queue_rq(struct blk_mq_hw_ctx *hctx,
+static int dm_mq_queue_rq(struct blk_mq_llhw_ctx *llhw_ctx,
 			  const struct blk_mq_queue_data *bd)
 {
+	struct blk_mq_hw_ctx *hctx = blk_mq_to_hctx(llhw_ctx);
 	struct request *rq = bd->rq;
 	struct dm_rq_target_io *tio = blk_mq_rq_to_pdu(rq);
 	struct mapped_device *md = tio->md;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 086fd7e..eef2e40 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -201,9 +201,10 @@ static unsigned int nvme_cmd_size(struct nvme_dev *dev)
 		nvme_iod_alloc_size(dev, NVME_INT_BYTES(dev), NVME_INT_PAGES);
 }
 
-static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
-				unsigned int hctx_idx)
+static int nvme_admin_init_hctx(struct blk_mq_llhw_ctx *llhw_ctx, void *data)
 {
+	struct blk_mq_hw_ctx *hctx = blk_mq_to_hctx(llhw_ctx);
+	unsigned int hctx_idx = llhw_ctx->queue_id;
 	struct nvme_dev *dev = data;
 	struct nvme_queue *nvmeq = dev->queues[0];
 
@@ -211,14 +212,14 @@ static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 	WARN_ON(dev->admin_tagset.tags[0] != hctx->tags);
 	WARN_ON(nvmeq->tags);
 
-	hctx->driver_data = nvmeq;
+	llhw_ctx->driver_data = nvmeq;
 	nvmeq->tags = &dev->admin_tagset.tags[0];
 	return 0;
 }
 
-static void nvme_admin_exit_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+static void nvme_admin_exit_hctx(struct blk_mq_llhw_ctx *llhw_ctx)
 {
-	struct nvme_queue *nvmeq = hctx->driver_data;
+	struct nvme_queue *nvmeq = llhw_ctx->driver_data;
 
 	nvmeq->tags = NULL;
 }
@@ -236,9 +237,10 @@ static int nvme_admin_init_request(void *data, struct request *req,
 	return 0;
 }
 
-static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
-			  unsigned int hctx_idx)
+static int nvme_init_hctx(struct blk_mq_llhw_ctx *llhw_ctx, void *data)
 {
+	struct blk_mq_hw_ctx *hctx = blk_mq_to_hctx(llhw_ctx);
+	unsigned int hctx_idx = llhw_ctx->queue_id;
 	struct nvme_dev *dev = data;
 	struct nvme_queue *nvmeq = dev->queues[hctx_idx + 1];
 
@@ -246,7 +248,7 @@ static int nvme_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 		nvmeq->tags = &dev->tagset.tags[hctx_idx];
 
 	WARN_ON(dev->tagset.tags[hctx_idx] != hctx->tags);
-	hctx->driver_data = nvmeq;
+	llhw_ctx->driver_data = nvmeq;
 	return 0;
 }
 
@@ -558,11 +560,12 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
 /*
  * NOTE: ns is NULL when called on the admin queue.
  */
-static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
+static int nvme_queue_rq(struct blk_mq_llhw_ctx *llhw_ctx,
 			 const struct blk_mq_queue_data *bd)
 {
+	struct blk_mq_hw_ctx *hctx = blk_mq_to_hctx(llhw_ctx);
 	struct nvme_ns *ns = hctx->queue->queuedata;
-	struct nvme_queue *nvmeq = hctx->driver_data;
+	struct nvme_queue *nvmeq = llhw_ctx->driver_data;
 	struct nvme_dev *dev = nvmeq->dev;
 	struct request *req = bd->rq;
 	struct nvme_command cmnd;
@@ -742,9 +745,9 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
 	return IRQ_NONE;
 }
 
-static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
+static int nvme_poll(struct blk_mq_llhw_ctx *llhw_ctx, unsigned int tag)
 {
-	struct nvme_queue *nvmeq = hctx->driver_data;
+	struct nvme_queue *nvmeq = llhw_ctx->driver_data;
 
 	if (nvme_cqe_valid(nvmeq, nvmeq->cq_head, nvmeq->cq_phase)) {
 		spin_lock_irq(&nvmeq->q_lock);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 2cca9cf..0019213 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1876,9 +1876,10 @@ static void scsi_mq_done(struct scsi_cmnd *cmd)
 	blk_mq_complete_request(cmd->request, cmd->request->errors);
 }
 
-static int scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
+static int scsi_queue_rq(struct blk_mq_llhw_ctx	*llhw_ctx,
 			 const struct blk_mq_queue_data *bd)
 {
+	struct blk_mq_hw_ctx *hctx = blk_mq_to_hctx(llhw_ctx);
 	struct request *req = bd->rq;
 	struct request_queue *q = req->q;
 	struct scsi_device *sdev = q->queuedata;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 6c7ee56..2c3392b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -18,6 +18,12 @@ struct blk_mq_ctxmap {
 	struct blk_align_bitmap *map;
 };
 
+struct blk_mq_llhw_ctx {
+	int			index;
+	int			queue_id;
+	void			*driver_data;
+};
+
 struct blk_mq_hw_ctx {
 	struct {
 		spinlock_t		lock;
@@ -36,8 +42,6 @@ struct blk_mq_hw_ctx {
 	struct request_queue	*queue;
 	struct blk_flush_queue	*fq;
 
-	void			*driver_data;
-
 	struct blk_mq_ctxmap	ctx_map;
 
 	unsigned int		nr_ctx;
@@ -62,8 +66,19 @@ struct blk_mq_hw_ctx {
 
 	unsigned long		poll_invoked;
 	unsigned long		poll_success;
+
+	unsigned int		nr_llhw_ctx;
+	struct blk_mq_llhw_ctx	llhw_ctxs[0];
 };
 
+static inline
+struct blk_mq_hw_ctx *blk_mq_to_hctx(struct blk_mq_llhw_ctx *llhw_ctx)
+{
+	struct blk_mq_llhw_ctx *llhw_ctx_0 = llhw_ctx - llhw_ctx->index;
+
+	return (void *)llhw_ctx_0 - offsetof(struct blk_mq_hw_ctx, llhw_ctxs);
+}
+
 struct blk_mq_tag_set {
 	struct blk_mq_ops	*ops;
 	unsigned int		nr_hw_queues;
@@ -87,11 +102,11 @@ struct blk_mq_queue_data {
 	bool last;
 };
 
-typedef int (queue_rq_fn)(struct blk_mq_hw_ctx *, const struct blk_mq_queue_data *);
+typedef int (queue_rq_fn)(struct blk_mq_llhw_ctx *, const struct blk_mq_queue_data *);
 typedef struct blk_mq_hw_ctx *(map_queue_fn)(struct request_queue *, const int);
 typedef enum blk_eh_timer_return (timeout_fn)(struct request *, bool);
-typedef int (init_hctx_fn)(struct blk_mq_hw_ctx *, void *, unsigned int);
-typedef void (exit_hctx_fn)(struct blk_mq_hw_ctx *, unsigned int);
+typedef int (init_hctx_fn)(struct blk_mq_llhw_ctx *, void *);
+typedef void (exit_hctx_fn)(struct blk_mq_llhw_ctx *);
 typedef int (init_request_fn)(void *, struct request *, unsigned int,
 		unsigned int, unsigned int);
 typedef void (exit_request_fn)(void *, struct request *, unsigned int,
@@ -101,7 +116,7 @@ typedef int (reinit_request_fn)(void *, struct request *);
 typedef void (busy_iter_fn)(struct blk_mq_hw_ctx *, struct request *, void *,
 		bool);
 typedef void (busy_tag_iter_fn)(struct request *, void *, bool);
-typedef int (poll_fn)(struct blk_mq_hw_ctx *, unsigned int);
+typedef int (poll_fn)(struct blk_mq_llhw_ctx *, unsigned int);
 
 
 struct blk_mq_ops {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH RFC 18/21] blk-mq: Enable tag numbers exceed hardware queue depth
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

This is 2nd step change in a bid to enable mapping of multiple
device hardware queues to a single CPU.

It enables number of tags assigned to a hardware context to exceed
the device hardware queue depth. As result single hardware context
could be mapped to multiple low-level hardware contexts. This is a
prerequisite to introduce combined hardware contexts.

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-core.c       | 4 +++-
 block/blk-mq.c         | 9 +++++++--
 include/linux/blk-mq.h | 7 +++++++
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index bf4f196..36ae127 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3312,9 +3312,11 @@ bool blk_poll(struct request_queue *q, blk_qc_t cookie)
 
 	state = current->state;
 	while (!need_resched()) {
+		unsigned int tag = blk_qc_t_to_tag(cookie);
 		unsigned int queue_num = blk_qc_t_to_queue_num(cookie);
 		struct blk_mq_hw_ctx *hctx = q->queue_hw_ctx[queue_num];
-		struct blk_mq_llhw_ctx *llhw_ctx = &hctx->llhw_ctxs[0];
+		int idx = blk_mq_tag_to_llhw_ctx_idx(hctx, tag);
+		struct blk_mq_llhw_ctx *llhw_ctx = &hctx->llhw_ctxs[idx];
 		int ret;
 
 		hctx->poll_invoked++;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 274eab8..6d055ec 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -829,6 +829,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 	queued = 0;
 	while (!list_empty(&rq_list)) {
 		struct blk_mq_queue_data bd;
+		int llhw_ctx_idx;
 		int ret;
 
 		rq = list_first_entry(&rq_list, struct request, queuelist);
@@ -838,7 +839,9 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 		bd.list = dptr;
 		bd.last = list_empty(&rq_list);
 
-		ret = q->mq_ops->queue_rq(&hctx->llhw_ctxs[0], &bd);
+		llhw_ctx_idx = blk_mq_tag_to_llhw_ctx_idx(hctx, rq->tag);
+
+		ret = q->mq_ops->queue_rq(&hctx->llhw_ctxs[llhw_ctx_idx], &bd);
 		switch (ret) {
 		case BLK_MQ_RQ_QUEUE_OK:
 			queued++;
@@ -1260,13 +1263,14 @@ static int blk_mq_direct_issue_request(struct request *rq, blk_qc_t *cookie)
 		.last = 1
 	};
 	blk_qc_t new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num);
+	int llhw_ctx_idx = blk_mq_tag_to_llhw_ctx_idx(hctx, rq->tag);
 
 	/*
 	 * For OK queue, we are done. For error, kill it. Any other
 	 * error (busy), just add it to our list as we previously
 	 * would have done
 	 */
-	ret = q->mq_ops->queue_rq(&hctx->llhw_ctxs[0], &bd);
+	ret = q->mq_ops->queue_rq(&hctx->llhw_ctxs[llhw_ctx_idx], &bd);
 	if (ret == BLK_MQ_RQ_QUEUE_OK) {
 		*cookie = new_cookie;
 		return 0;
@@ -1741,6 +1745,7 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 	hctx->queue_num = hctx_idx;
 	hctx->nr_ctx = 0;
 	hctx->nr_llhw_ctx = nr_llhw_ctx;
+	hctx->llhw_queue_depth = set->queue_depth;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 	hctx->tags = set->tags[hctx_idx];
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2c3392b..52a9e7c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -67,6 +67,7 @@ struct blk_mq_hw_ctx {
 	unsigned long		poll_invoked;
 	unsigned long		poll_success;
 
+	unsigned int		llhw_queue_depth;
 	unsigned int		nr_llhw_ctx;
 	struct blk_mq_llhw_ctx	llhw_ctxs[0];
 };
@@ -79,6 +80,12 @@ struct blk_mq_hw_ctx *blk_mq_to_hctx(struct blk_mq_llhw_ctx *llhw_ctx)
 	return (void *)llhw_ctx_0 - offsetof(struct blk_mq_hw_ctx, llhw_ctxs);
 }
 
+static inline
+int blk_mq_tag_to_llhw_ctx_idx(struct blk_mq_hw_ctx *hctx, unsigned int tag)
+{
+	return tag / hctx->llhw_queue_depth;
+}
+
 struct blk_mq_tag_set {
 	struct blk_mq_ops	*ops;
 	unsigned int		nr_hw_queues;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH RFC 18/21] blk-mq: Enable tag numbers exceed hardware queue depth
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


This is 2nd step change in a bid to enable mapping of multiple
device hardware queues to a single CPU.

It enables number of tags assigned to a hardware context to exceed
the device hardware queue depth. As result single hardware context
could be mapped to multiple low-level hardware contexts. This is a
prerequisite to introduce combined hardware contexts.

CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-core.c       | 4 +++-
 block/blk-mq.c         | 9 +++++++--
 include/linux/blk-mq.h | 7 +++++++
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index bf4f196..36ae127 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3312,9 +3312,11 @@ bool blk_poll(struct request_queue *q, blk_qc_t cookie)
 
 	state = current->state;
 	while (!need_resched()) {
+		unsigned int tag = blk_qc_t_to_tag(cookie);
 		unsigned int queue_num = blk_qc_t_to_queue_num(cookie);
 		struct blk_mq_hw_ctx *hctx = q->queue_hw_ctx[queue_num];
-		struct blk_mq_llhw_ctx *llhw_ctx = &hctx->llhw_ctxs[0];
+		int idx = blk_mq_tag_to_llhw_ctx_idx(hctx, tag);
+		struct blk_mq_llhw_ctx *llhw_ctx = &hctx->llhw_ctxs[idx];
 		int ret;
 
 		hctx->poll_invoked++;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 274eab8..6d055ec 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -829,6 +829,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 	queued = 0;
 	while (!list_empty(&rq_list)) {
 		struct blk_mq_queue_data bd;
+		int llhw_ctx_idx;
 		int ret;
 
 		rq = list_first_entry(&rq_list, struct request, queuelist);
@@ -838,7 +839,9 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 		bd.list = dptr;
 		bd.last = list_empty(&rq_list);
 
-		ret = q->mq_ops->queue_rq(&hctx->llhw_ctxs[0], &bd);
+		llhw_ctx_idx = blk_mq_tag_to_llhw_ctx_idx(hctx, rq->tag);
+
+		ret = q->mq_ops->queue_rq(&hctx->llhw_ctxs[llhw_ctx_idx], &bd);
 		switch (ret) {
 		case BLK_MQ_RQ_QUEUE_OK:
 			queued++;
@@ -1260,13 +1263,14 @@ static int blk_mq_direct_issue_request(struct request *rq, blk_qc_t *cookie)
 		.last = 1
 	};
 	blk_qc_t new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num);
+	int llhw_ctx_idx = blk_mq_tag_to_llhw_ctx_idx(hctx, rq->tag);
 
 	/*
 	 * For OK queue, we are done. For error, kill it. Any other
 	 * error (busy), just add it to our list as we previously
 	 * would have done
 	 */
-	ret = q->mq_ops->queue_rq(&hctx->llhw_ctxs[0], &bd);
+	ret = q->mq_ops->queue_rq(&hctx->llhw_ctxs[llhw_ctx_idx], &bd);
 	if (ret == BLK_MQ_RQ_QUEUE_OK) {
 		*cookie = new_cookie;
 		return 0;
@@ -1741,6 +1745,7 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 	hctx->queue_num = hctx_idx;
 	hctx->nr_ctx = 0;
 	hctx->nr_llhw_ctx = nr_llhw_ctx;
+	hctx->llhw_queue_depth = set->queue_depth;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 	hctx->tags = set->tags[hctx_idx];
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2c3392b..52a9e7c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -67,6 +67,7 @@ struct blk_mq_hw_ctx {
 	unsigned long		poll_invoked;
 	unsigned long		poll_success;
 
+	unsigned int		llhw_queue_depth;
 	unsigned int		nr_llhw_ctx;
 	struct blk_mq_llhw_ctx	llhw_ctxs[0];
 };
@@ -79,6 +80,12 @@ struct blk_mq_hw_ctx *blk_mq_to_hctx(struct blk_mq_llhw_ctx *llhw_ctx)
 	return (void *)llhw_ctx_0 - offsetof(struct blk_mq_hw_ctx, llhw_ctxs);
 }
 
+static inline
+int blk_mq_tag_to_llhw_ctx_idx(struct blk_mq_hw_ctx *hctx, unsigned int tag)
+{
+	return tag / hctx->llhw_queue_depth;
+}
+
 struct blk_mq_tag_set {
 	struct blk_mq_ops	*ops;
 	unsigned int		nr_hw_queues;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH RFC 19/21] blk-mq: Enable combined hardware queues
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

This is 3rd step change in a bid to enable mapping of multiple
device hardware queues to a single CPU.

It introduces combined hardware context - the one consisting from
multiple low-level hardware contexts. As result, queue depths deeper
than the device hardware queue depth are made possible (but not
yet allowed).

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-mq-tag.c     |   4 +-
 block/blk-mq.c         | 150 +++++++++++++++----------------------------------
 include/linux/blk-mq.h |   5 ++
 3 files changed, 51 insertions(+), 108 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 1602813..e987a6b 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -477,7 +477,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 {
 	int i;
 
-	for (i = 0; i < tagset->nr_hw_queues; i++) {
+	for (i = 0; i < tagset->nr_co_queues; i++) {
 		if (tagset->tags && tagset->tags[i])
 			blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
 	}
@@ -491,7 +491,7 @@ int blk_mq_reinit_tagset(struct blk_mq_tag_set *set)
 	if (!set->ops->reinit_request)
 		goto out;
 
-	for (i = 0; i < set->nr_hw_queues; i++) {
+	for (i = 0; i < set->nr_co_queues; i++) {
 		struct blk_mq_tags *tags = set->tags[i];
 
 		for (j = 0; j < tags->nr_tags; j++) {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6d055ec..450a3ed 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1499,22 +1499,27 @@ static size_t order_to_size(unsigned int order)
 	return (size_t)PAGE_SIZE << order;
 }
 
+static unsigned int queue_depth(struct blk_mq_tag_set *set)
+{
+	return set->queue_depth * set->co_queue_size;
+}
+
 static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 		unsigned int hctx_idx)
 {
 	struct blk_mq_tags *tags;
 	unsigned int i, j, entries_per_page, max_order = 4;
 	size_t rq_size, left;
+	unsigned int depth = queue_depth(set);
 
-	tags = blk_mq_init_tags(set->queue_depth, set->reserved_tags,
-				set->numa_node,
+	tags = blk_mq_init_tags(depth, set->reserved_tags, set->numa_node,
 				BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
 	if (!tags)
 		return NULL;
 
 	INIT_LIST_HEAD(&tags->page_list);
 
-	tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
+	tags->rqs = kzalloc_node(depth * sizeof(struct request *),
 				 GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
 				 set->numa_node);
 	if (!tags->rqs) {
@@ -1528,9 +1533,9 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 	 */
 	rq_size = round_up(sizeof(struct request) + set->cmd_size,
 				cache_line_size());
-	left = rq_size * set->queue_depth;
+	left = rq_size * depth;
 
-	for (i = 0; i < set->queue_depth; ) {
+	for (i = 0; i < depth; ) {
 		int this_order = max_order;
 		struct page *page;
 		int to_do;
@@ -1564,7 +1569,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 		 */
 		kmemleak_alloc(p, order_to_size(this_order), 1, GFP_KERNEL);
 		entries_per_page = order_to_size(this_order) / rq_size;
-		to_do = min(entries_per_page, set->queue_depth - i);
+		to_do = min(entries_per_page, depth - i);
 		left -= to_do * rq_size;
 		for (j = 0; j < to_do; j++) {
 			tags->rqs[i] = p;
@@ -1703,7 +1708,7 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set, unsigned hctx_idx)
 {
 	struct blk_mq_hw_ctx *hctx;
-	unsigned int nr_llhw_ctx = 1;
+	unsigned int nr_llhw_ctx = set->co_queue_size;
 	int node;
 	int i;
 
@@ -1757,7 +1762,7 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 		struct blk_mq_llhw_ctx *llhw_ctx = &hctx->llhw_ctxs[i];
 
 		llhw_ctx->index = i;
-		llhw_ctx->queue_id = hctx_idx;
+		llhw_ctx->queue_id = (hctx_idx * set->co_queue_size) + i;
 
 		if (set->ops->init_hctx &&
 		    set->ops->init_hctx(llhw_ctx, set->driver_data))
@@ -2005,7 +2010,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 	struct blk_mq_hw_ctx **hctxs = q->queue_hw_ctx;
 
 	blk_mq_sysfs_unregister(q);
-	for (i = 0; i < set->nr_hw_queues; i++) {
+	for (i = 0; i < set->nr_co_queues; i++) {
 		if (hctxs[i])
 			continue;
 		if (!set->tags[i])
@@ -2050,7 +2055,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->queue_ctx)
 		goto err_exit;
 
-	q->queue_hw_ctx = kzalloc_node(set->nr_hw_queues *
+	q->queue_hw_ctx = kzalloc_node(set->nr_co_queues *
 			sizeof(*(q->queue_hw_ctx)), GFP_KERNEL, set->numa_node);
 	if (!q->queue_hw_ctx)
 		goto err_percpu;
@@ -2090,12 +2095,12 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	/*
 	 * Do this after blk_queue_make_request() overrides it...
 	 */
-	q->nr_requests = set->queue_depth;
+	q->nr_requests = queue_depth(set);
 
 	if (set->ops->complete)
 		blk_queue_softirq_done(q, set->ops->complete);
 
-	blk_mq_init_cpu_queues(q, set->nr_hw_queues);
+	blk_mq_init_cpu_queues(q, set->nr_co_queues);
 
 	get_online_cpus();
 	mutex_lock(&all_q_mutex);
@@ -2232,7 +2237,7 @@ static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 {
 	int i;
 
-	for (i = 0; i < set->nr_hw_queues; i++) {
+	for (i = 0; i < set->nr_co_queues; i++) {
 		set->tags[i] = blk_mq_init_rq_map(set, i);
 		if (!set->tags[i])
 			goto out_unwind;
@@ -2248,38 +2253,11 @@ out_unwind:
 }
 
 /*
- * Allocate the request maps associated with this tag_set. Note that this
- * may reduce the depth asked for, if memory is tight. set->queue_depth
- * will be updated to reflect the allocated depth.
+ * TODO	Restore original functionality
  */
 static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 {
-	unsigned int depth;
-	int err;
-
-	depth = set->queue_depth;
-	do {
-		err = __blk_mq_alloc_rq_maps(set);
-		if (!err)
-			break;
-
-		set->queue_depth >>= 1;
-		if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
-			err = -ENOMEM;
-			break;
-		}
-	} while (set->queue_depth);
-
-	if (!set->queue_depth || err) {
-		pr_err("blk-mq: failed to allocate request map\n");
-		return -ENOMEM;
-	}
-
-	if (depth != set->queue_depth)
-		pr_info("blk-mq: reduced tag depth (%u -> %u)\n",
-						depth, set->queue_depth);
-
-	return 0;
+	return __blk_mq_alloc_rq_maps(set);
 }
 
 struct cpumask *blk_mq_tags_cpumask(struct blk_mq_tags *tags)
@@ -2291,8 +2269,7 @@ EXPORT_SYMBOL_GPL(blk_mq_tags_cpumask);
 /*
  * Alloc a tag set to be associated with one or more request queues.
  * May fail with EINVAL for various error conditions. May adjust the
- * requested depth down, if if it too large. In that case, the set
- * value will be stored in set->queue_depth.
+ * requested depth down, if if it too large.
  */
 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 {
@@ -2302,34 +2279,32 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 		return -EINVAL;
 	if (!set->queue_depth)
 		return -EINVAL;
-	if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN)
-		return -EINVAL;
-
 	if (!set->ops->queue_rq)
 		return -EINVAL;
 
-	if (set->queue_depth > BLK_MQ_MAX_DEPTH) {
-		pr_info("blk-mq: reduced tag depth to %u\n",
-			BLK_MQ_MAX_DEPTH);
-		set->queue_depth = BLK_MQ_MAX_DEPTH;
-	}
+	/*
+	 * TODO	Restore original queue depth and count limits
+	 */
 
 	/*
 	 * If a crashdump is active, then we are potentially in a very
-	 * memory constrained environment. Limit us to 1 queue and
-	 * 64 tags to prevent using too much memory.
+	 * memory constrained environment. Limit us to 1 queue.
 	 */
-	if (is_kdump_kernel()) {
-		set->nr_hw_queues = 1;
-		set->queue_depth = min(64U, set->queue_depth);
-	}
+	set->nr_co_queues = is_kdump_kernel() ? 1 : set->nr_hw_queues;
+	set->co_queue_size = 1;
+
+	if (queue_depth(set) < set->reserved_tags + BLK_MQ_TAG_MIN)
+		return -EINVAL;
+	if (queue_depth(set) > BLK_MQ_MAX_DEPTH)
+		return -EINVAL;
+
 	/*
 	 * There is no use for more h/w queues than cpus.
 	 */
-	if (set->nr_hw_queues > nr_cpu_ids)
-		set->nr_hw_queues = nr_cpu_ids;
+	if (set->nr_co_queues > nr_cpu_ids)
+		set->nr_co_queues = nr_cpu_ids;
 
-	set->tags = kzalloc_node(set->nr_hw_queues * sizeof(*set->tags),
+	set->tags = kzalloc_node(set->nr_co_queues * sizeof(*set->tags),
 				 GFP_KERNEL, set->numa_node);
 	if (!set->tags)
 		return -ENOMEM;
@@ -2352,7 +2327,7 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 {
 	int i;
 
-	for (i = 0; i < set->nr_hw_queues; i++) {
+	for (i = 0; i < set->nr_co_queues; i++) {
 		if (set->tags[i])
 			blk_mq_free_rq_map(set, set->tags[i], i);
 	}
@@ -2362,56 +2337,19 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 }
 EXPORT_SYMBOL(blk_mq_free_tag_set);
 
+/*
+ * TODO	Restore original functionality
+ */
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 {
-	struct blk_mq_tag_set *set = q->tag_set;
-	struct blk_mq_hw_ctx *hctx;
-	int i, ret;
-
-	if (!set || nr > set->queue_depth)
-		return -EINVAL;
-
-	ret = 0;
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (!hctx->tags)
-			continue;
-		ret = blk_mq_tag_update_depth(hctx->tags, nr);
-		if (ret)
-			break;
-	}
-
-	if (!ret)
-		q->nr_requests = nr;
-
-	return ret;
+	return -EINVAL;
 }
 
+/*
+ * TODO	Restore original functionality
+ */
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 {
-	struct request_queue *q;
-
-	if (nr_hw_queues > nr_cpu_ids)
-		nr_hw_queues = nr_cpu_ids;
-	if (nr_hw_queues < 1 || nr_hw_queues == set->nr_hw_queues)
-		return;
-
-	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		blk_mq_freeze_queue(q);
-
-	set->nr_hw_queues = nr_hw_queues;
-	list_for_each_entry(q, &set->tag_list, tag_set_list) {
-		blk_mq_realloc_hw_ctxs(set, q);
-
-		if (q->nr_hw_queues > 1)
-			blk_queue_make_request(q, blk_mq_make_request);
-		else
-			blk_queue_make_request(q, blk_sq_make_request);
-
-		blk_mq_queue_reinit(q, cpu_online_mask);
-	}
-
-	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		blk_mq_unfreeze_queue(q);
 }
 EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 52a9e7c..579dfaf 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -88,8 +88,13 @@ int blk_mq_tag_to_llhw_ctx_idx(struct blk_mq_hw_ctx *hctx, unsigned int tag)
 
 struct blk_mq_tag_set {
 	struct blk_mq_ops	*ops;
+
 	unsigned int		nr_hw_queues;
 	unsigned int		queue_depth;	/* max hw supported */
+
+	unsigned int		nr_co_queues;	/* number of combined queues */
+	unsigned int		co_queue_size;	/* hw queues in one combined */
+
 	unsigned int		reserved_tags;
 	unsigned int		cmd_size;	/* per-request extra data */
 	int			numa_node;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH RFC 19/21] blk-mq: Enable combined hardware queues
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


This is 3rd step change in a bid to enable mapping of multiple
device hardware queues to a single CPU.

It introduces combined hardware context - the one consisting from
multiple low-level hardware contexts. As result, queue depths deeper
than the device hardware queue depth are made possible (but not
yet allowed).

CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-mq-tag.c     |   4 +-
 block/blk-mq.c         | 150 +++++++++++++++----------------------------------
 include/linux/blk-mq.h |   5 ++
 3 files changed, 51 insertions(+), 108 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 1602813..e987a6b 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -477,7 +477,7 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 {
 	int i;
 
-	for (i = 0; i < tagset->nr_hw_queues; i++) {
+	for (i = 0; i < tagset->nr_co_queues; i++) {
 		if (tagset->tags && tagset->tags[i])
 			blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
 	}
@@ -491,7 +491,7 @@ int blk_mq_reinit_tagset(struct blk_mq_tag_set *set)
 	if (!set->ops->reinit_request)
 		goto out;
 
-	for (i = 0; i < set->nr_hw_queues; i++) {
+	for (i = 0; i < set->nr_co_queues; i++) {
 		struct blk_mq_tags *tags = set->tags[i];
 
 		for (j = 0; j < tags->nr_tags; j++) {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6d055ec..450a3ed 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1499,22 +1499,27 @@ static size_t order_to_size(unsigned int order)
 	return (size_t)PAGE_SIZE << order;
 }
 
+static unsigned int queue_depth(struct blk_mq_tag_set *set)
+{
+	return set->queue_depth * set->co_queue_size;
+}
+
 static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 		unsigned int hctx_idx)
 {
 	struct blk_mq_tags *tags;
 	unsigned int i, j, entries_per_page, max_order = 4;
 	size_t rq_size, left;
+	unsigned int depth = queue_depth(set);
 
-	tags = blk_mq_init_tags(set->queue_depth, set->reserved_tags,
-				set->numa_node,
+	tags = blk_mq_init_tags(depth, set->reserved_tags, set->numa_node,
 				BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
 	if (!tags)
 		return NULL;
 
 	INIT_LIST_HEAD(&tags->page_list);
 
-	tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
+	tags->rqs = kzalloc_node(depth * sizeof(struct request *),
 				 GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
 				 set->numa_node);
 	if (!tags->rqs) {
@@ -1528,9 +1533,9 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 	 */
 	rq_size = round_up(sizeof(struct request) + set->cmd_size,
 				cache_line_size());
-	left = rq_size * set->queue_depth;
+	left = rq_size * depth;
 
-	for (i = 0; i < set->queue_depth; ) {
+	for (i = 0; i < depth; ) {
 		int this_order = max_order;
 		struct page *page;
 		int to_do;
@@ -1564,7 +1569,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 		 */
 		kmemleak_alloc(p, order_to_size(this_order), 1, GFP_KERNEL);
 		entries_per_page = order_to_size(this_order) / rq_size;
-		to_do = min(entries_per_page, set->queue_depth - i);
+		to_do = min(entries_per_page, depth - i);
 		left -= to_do * rq_size;
 		for (j = 0; j < to_do; j++) {
 			tags->rqs[i] = p;
@@ -1703,7 +1708,7 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 		struct blk_mq_tag_set *set, unsigned hctx_idx)
 {
 	struct blk_mq_hw_ctx *hctx;
-	unsigned int nr_llhw_ctx = 1;
+	unsigned int nr_llhw_ctx = set->co_queue_size;
 	int node;
 	int i;
 
@@ -1757,7 +1762,7 @@ static struct blk_mq_hw_ctx *blk_mq_init_hctx(struct request_queue *q,
 		struct blk_mq_llhw_ctx *llhw_ctx = &hctx->llhw_ctxs[i];
 
 		llhw_ctx->index = i;
-		llhw_ctx->queue_id = hctx_idx;
+		llhw_ctx->queue_id = (hctx_idx * set->co_queue_size) + i;
 
 		if (set->ops->init_hctx &&
 		    set->ops->init_hctx(llhw_ctx, set->driver_data))
@@ -2005,7 +2010,7 @@ static void blk_mq_realloc_hw_ctxs(struct blk_mq_tag_set *set,
 	struct blk_mq_hw_ctx **hctxs = q->queue_hw_ctx;
 
 	blk_mq_sysfs_unregister(q);
-	for (i = 0; i < set->nr_hw_queues; i++) {
+	for (i = 0; i < set->nr_co_queues; i++) {
 		if (hctxs[i])
 			continue;
 		if (!set->tags[i])
@@ -2050,7 +2055,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	if (!q->queue_ctx)
 		goto err_exit;
 
-	q->queue_hw_ctx = kzalloc_node(set->nr_hw_queues *
+	q->queue_hw_ctx = kzalloc_node(set->nr_co_queues *
 			sizeof(*(q->queue_hw_ctx)), GFP_KERNEL, set->numa_node);
 	if (!q->queue_hw_ctx)
 		goto err_percpu;
@@ -2090,12 +2095,12 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 	/*
 	 * Do this after blk_queue_make_request() overrides it...
 	 */
-	q->nr_requests = set->queue_depth;
+	q->nr_requests = queue_depth(set);
 
 	if (set->ops->complete)
 		blk_queue_softirq_done(q, set->ops->complete);
 
-	blk_mq_init_cpu_queues(q, set->nr_hw_queues);
+	blk_mq_init_cpu_queues(q, set->nr_co_queues);
 
 	get_online_cpus();
 	mutex_lock(&all_q_mutex);
@@ -2232,7 +2237,7 @@ static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 {
 	int i;
 
-	for (i = 0; i < set->nr_hw_queues; i++) {
+	for (i = 0; i < set->nr_co_queues; i++) {
 		set->tags[i] = blk_mq_init_rq_map(set, i);
 		if (!set->tags[i])
 			goto out_unwind;
@@ -2248,38 +2253,11 @@ out_unwind:
 }
 
 /*
- * Allocate the request maps associated with this tag_set. Note that this
- * may reduce the depth asked for, if memory is tight. set->queue_depth
- * will be updated to reflect the allocated depth.
+ * TODO	Restore original functionality
  */
 static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 {
-	unsigned int depth;
-	int err;
-
-	depth = set->queue_depth;
-	do {
-		err = __blk_mq_alloc_rq_maps(set);
-		if (!err)
-			break;
-
-		set->queue_depth >>= 1;
-		if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
-			err = -ENOMEM;
-			break;
-		}
-	} while (set->queue_depth);
-
-	if (!set->queue_depth || err) {
-		pr_err("blk-mq: failed to allocate request map\n");
-		return -ENOMEM;
-	}
-
-	if (depth != set->queue_depth)
-		pr_info("blk-mq: reduced tag depth (%u -> %u)\n",
-						depth, set->queue_depth);
-
-	return 0;
+	return __blk_mq_alloc_rq_maps(set);
 }
 
 struct cpumask *blk_mq_tags_cpumask(struct blk_mq_tags *tags)
@@ -2291,8 +2269,7 @@ EXPORT_SYMBOL_GPL(blk_mq_tags_cpumask);
 /*
  * Alloc a tag set to be associated with one or more request queues.
  * May fail with EINVAL for various error conditions. May adjust the
- * requested depth down, if if it too large. In that case, the set
- * value will be stored in set->queue_depth.
+ * requested depth down, if if it too large.
  */
 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 {
@@ -2302,34 +2279,32 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 		return -EINVAL;
 	if (!set->queue_depth)
 		return -EINVAL;
-	if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN)
-		return -EINVAL;
-
 	if (!set->ops->queue_rq)
 		return -EINVAL;
 
-	if (set->queue_depth > BLK_MQ_MAX_DEPTH) {
-		pr_info("blk-mq: reduced tag depth to %u\n",
-			BLK_MQ_MAX_DEPTH);
-		set->queue_depth = BLK_MQ_MAX_DEPTH;
-	}
+	/*
+	 * TODO	Restore original queue depth and count limits
+	 */
 
 	/*
 	 * If a crashdump is active, then we are potentially in a very
-	 * memory constrained environment. Limit us to 1 queue and
-	 * 64 tags to prevent using too much memory.
+	 * memory constrained environment. Limit us to 1 queue.
 	 */
-	if (is_kdump_kernel()) {
-		set->nr_hw_queues = 1;
-		set->queue_depth = min(64U, set->queue_depth);
-	}
+	set->nr_co_queues = is_kdump_kernel() ? 1 : set->nr_hw_queues;
+	set->co_queue_size = 1;
+
+	if (queue_depth(set) < set->reserved_tags + BLK_MQ_TAG_MIN)
+		return -EINVAL;
+	if (queue_depth(set) > BLK_MQ_MAX_DEPTH)
+		return -EINVAL;
+
 	/*
 	 * There is no use for more h/w queues than cpus.
 	 */
-	if (set->nr_hw_queues > nr_cpu_ids)
-		set->nr_hw_queues = nr_cpu_ids;
+	if (set->nr_co_queues > nr_cpu_ids)
+		set->nr_co_queues = nr_cpu_ids;
 
-	set->tags = kzalloc_node(set->nr_hw_queues * sizeof(*set->tags),
+	set->tags = kzalloc_node(set->nr_co_queues * sizeof(*set->tags),
 				 GFP_KERNEL, set->numa_node);
 	if (!set->tags)
 		return -ENOMEM;
@@ -2352,7 +2327,7 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 {
 	int i;
 
-	for (i = 0; i < set->nr_hw_queues; i++) {
+	for (i = 0; i < set->nr_co_queues; i++) {
 		if (set->tags[i])
 			blk_mq_free_rq_map(set, set->tags[i], i);
 	}
@@ -2362,56 +2337,19 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 }
 EXPORT_SYMBOL(blk_mq_free_tag_set);
 
+/*
+ * TODO	Restore original functionality
+ */
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 {
-	struct blk_mq_tag_set *set = q->tag_set;
-	struct blk_mq_hw_ctx *hctx;
-	int i, ret;
-
-	if (!set || nr > set->queue_depth)
-		return -EINVAL;
-
-	ret = 0;
-	queue_for_each_hw_ctx(q, hctx, i) {
-		if (!hctx->tags)
-			continue;
-		ret = blk_mq_tag_update_depth(hctx->tags, nr);
-		if (ret)
-			break;
-	}
-
-	if (!ret)
-		q->nr_requests = nr;
-
-	return ret;
+	return -EINVAL;
 }
 
+/*
+ * TODO	Restore original functionality
+ */
 void blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, int nr_hw_queues)
 {
-	struct request_queue *q;
-
-	if (nr_hw_queues > nr_cpu_ids)
-		nr_hw_queues = nr_cpu_ids;
-	if (nr_hw_queues < 1 || nr_hw_queues == set->nr_hw_queues)
-		return;
-
-	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		blk_mq_freeze_queue(q);
-
-	set->nr_hw_queues = nr_hw_queues;
-	list_for_each_entry(q, &set->tag_list, tag_set_list) {
-		blk_mq_realloc_hw_ctxs(set, q);
-
-		if (q->nr_hw_queues > 1)
-			blk_queue_make_request(q, blk_mq_make_request);
-		else
-			blk_queue_make_request(q, blk_sq_make_request);
-
-		blk_mq_queue_reinit(q, cpu_online_mask);
-	}
-
-	list_for_each_entry(q, &set->tag_list, tag_set_list)
-		blk_mq_unfreeze_queue(q);
 }
 EXPORT_SYMBOL_GPL(blk_mq_update_nr_hw_queues);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 52a9e7c..579dfaf 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -88,8 +88,13 @@ int blk_mq_tag_to_llhw_ctx_idx(struct blk_mq_hw_ctx *hctx, unsigned int tag)
 
 struct blk_mq_tag_set {
 	struct blk_mq_ops	*ops;
+
 	unsigned int		nr_hw_queues;
 	unsigned int		queue_depth;	/* max hw supported */
+
+	unsigned int		nr_co_queues;	/* number of combined queues */
+	unsigned int		co_queue_size;	/* hw queues in one combined */
+
 	unsigned int		reserved_tags;
 	unsigned int		cmd_size;	/* per-request extra data */
 	int			numa_node;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH RFC 20/21] blk-mq: Allow combined hardware queues
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

This is 4th and last step change in a bid to enable mapping
of multiple device hardware queues to a single CPU.

Available hardware queues are evenly distributed to CPUs.
Still, there might some number of queues left spared, but no
more than (number of queues) % (number of CPUs) in the worst
case.

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 block/blk-mq-cpumap.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
 block/blk-mq.c        | 14 +-------------
 block/blk-mq.h        |  2 ++
 3 files changed, 47 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index ee553a4..0b49f30 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -9,6 +9,7 @@
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/cpu.h>
+#include <linux/crash_dump.h>
 
 #include <linux/blk-mq.h>
 #include "blk.h"
@@ -86,6 +87,49 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
 	return 0;
 }
 
+void blk_mq_adjust_tag_set(struct blk_mq_tag_set *set,
+			   const struct cpumask *online_mask)
+{
+	unsigned int nr_cpus, nr_uniq_cpus, first_sibling;
+	cpumask_var_t cpus;
+	int i;
+
+	/*
+	 * If a crashdump is active, then we are potentially in a very
+	 * memory constrained environment. Limit us to 1 queue.
+	 */
+	if (is_kdump_kernel())
+		goto default_map;
+
+	if (!alloc_cpumask_var(&cpus, GFP_ATOMIC))
+		goto default_map;
+
+	cpumask_clear(cpus);
+	nr_cpus = nr_uniq_cpus = 0;
+
+	for_each_cpu(i, online_mask) {
+		nr_cpus++;
+		first_sibling = get_first_sibling(i);
+		if (!cpumask_test_cpu(first_sibling, cpus))
+			nr_uniq_cpus++;
+		cpumask_set_cpu(i, cpus);
+	}
+
+	free_cpumask_var(cpus);
+
+	if (set->nr_hw_queues < nr_uniq_cpus) {
+default_map:
+		set->nr_co_queues = set->nr_hw_queues;
+		set->co_queue_size = 1;
+	} else if (set->nr_hw_queues < nr_cpus) {
+		set->nr_co_queues = nr_uniq_cpus;
+		set->co_queue_size = set->nr_hw_queues / nr_uniq_cpus;
+	} else {
+		set->nr_co_queues = nr_cpus;
+		set->co_queue_size = set->nr_hw_queues / nr_cpus;
+	}
+}
+
 /*
  * We have no quick way of doing reverse lookups. This is only used at
  * queue init time, so runtime isn't important.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 450a3ed..ee05ea9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -21,7 +21,6 @@
 #include <linux/cache.h>
 #include <linux/sched/sysctl.h>
 #include <linux/delay.h>
-#include <linux/crash_dump.h>
 
 #include <trace/events/block.h>
 
@@ -2286,24 +2285,13 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	 * TODO	Restore original queue depth and count limits
 	 */
 
-	/*
-	 * If a crashdump is active, then we are potentially in a very
-	 * memory constrained environment. Limit us to 1 queue.
-	 */
-	set->nr_co_queues = is_kdump_kernel() ? 1 : set->nr_hw_queues;
-	set->co_queue_size = 1;
+	blk_mq_adjust_tag_set(set, cpu_online_mask);
 
 	if (queue_depth(set) < set->reserved_tags + BLK_MQ_TAG_MIN)
 		return -EINVAL;
 	if (queue_depth(set) > BLK_MQ_MAX_DEPTH)
 		return -EINVAL;
 
-	/*
-	 * There is no use for more h/w queues than cpus.
-	 */
-	if (set->nr_co_queues > nr_cpu_ids)
-		set->nr_co_queues = nr_cpu_ids;
-
 	set->tags = kzalloc_node(set->nr_co_queues * sizeof(*set->tags),
 				 GFP_KERNEL, set->numa_node);
 	if (!set->tags)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 592e308..70704f7 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -49,6 +49,8 @@ void blk_mq_disable_hotplug(void);
  */
 extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
 				   const struct cpumask *online_mask);
+extern void blk_mq_adjust_tag_set(struct blk_mq_tag_set *set,
+				  const struct cpumask *online_mask);
 extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
 
 /*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH RFC 20/21] blk-mq: Allow combined hardware queues
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


This is 4th and last step change in a bid to enable mapping
of multiple device hardware queues to a single CPU.

Available hardware queues are evenly distributed to CPUs.
Still, there might some number of queues left spared, but no
more than (number of queues) % (number of CPUs) in the worst
case.

CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 block/blk-mq-cpumap.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
 block/blk-mq.c        | 14 +-------------
 block/blk-mq.h        |  2 ++
 3 files changed, 47 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq-cpumap.c b/block/blk-mq-cpumap.c
index ee553a4..0b49f30 100644
--- a/block/blk-mq-cpumap.c
+++ b/block/blk-mq-cpumap.c
@@ -9,6 +9,7 @@
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/cpu.h>
+#include <linux/crash_dump.h>
 
 #include <linux/blk-mq.h>
 #include "blk.h"
@@ -86,6 +87,49 @@ int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
 	return 0;
 }
 
+void blk_mq_adjust_tag_set(struct blk_mq_tag_set *set,
+			   const struct cpumask *online_mask)
+{
+	unsigned int nr_cpus, nr_uniq_cpus, first_sibling;
+	cpumask_var_t cpus;
+	int i;
+
+	/*
+	 * If a crashdump is active, then we are potentially in a very
+	 * memory constrained environment. Limit us to 1 queue.
+	 */
+	if (is_kdump_kernel())
+		goto default_map;
+
+	if (!alloc_cpumask_var(&cpus, GFP_ATOMIC))
+		goto default_map;
+
+	cpumask_clear(cpus);
+	nr_cpus = nr_uniq_cpus = 0;
+
+	for_each_cpu(i, online_mask) {
+		nr_cpus++;
+		first_sibling = get_first_sibling(i);
+		if (!cpumask_test_cpu(first_sibling, cpus))
+			nr_uniq_cpus++;
+		cpumask_set_cpu(i, cpus);
+	}
+
+	free_cpumask_var(cpus);
+
+	if (set->nr_hw_queues < nr_uniq_cpus) {
+default_map:
+		set->nr_co_queues = set->nr_hw_queues;
+		set->co_queue_size = 1;
+	} else if (set->nr_hw_queues < nr_cpus) {
+		set->nr_co_queues = nr_uniq_cpus;
+		set->co_queue_size = set->nr_hw_queues / nr_uniq_cpus;
+	} else {
+		set->nr_co_queues = nr_cpus;
+		set->co_queue_size = set->nr_hw_queues / nr_cpus;
+	}
+}
+
 /*
  * We have no quick way of doing reverse lookups. This is only used at
  * queue init time, so runtime isn't important.
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 450a3ed..ee05ea9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -21,7 +21,6 @@
 #include <linux/cache.h>
 #include <linux/sched/sysctl.h>
 #include <linux/delay.h>
-#include <linux/crash_dump.h>
 
 #include <trace/events/block.h>
 
@@ -2286,24 +2285,13 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	 * TODO	Restore original queue depth and count limits
 	 */
 
-	/*
-	 * If a crashdump is active, then we are potentially in a very
-	 * memory constrained environment. Limit us to 1 queue.
-	 */
-	set->nr_co_queues = is_kdump_kernel() ? 1 : set->nr_hw_queues;
-	set->co_queue_size = 1;
+	blk_mq_adjust_tag_set(set, cpu_online_mask);
 
 	if (queue_depth(set) < set->reserved_tags + BLK_MQ_TAG_MIN)
 		return -EINVAL;
 	if (queue_depth(set) > BLK_MQ_MAX_DEPTH)
 		return -EINVAL;
 
-	/*
-	 * There is no use for more h/w queues than cpus.
-	 */
-	if (set->nr_co_queues > nr_cpu_ids)
-		set->nr_co_queues = nr_cpu_ids;
-
 	set->tags = kzalloc_node(set->nr_co_queues * sizeof(*set->tags),
 				 GFP_KERNEL, set->numa_node);
 	if (!set->tags)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 592e308..70704f7 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -49,6 +49,8 @@ void blk_mq_disable_hotplug(void);
  */
 extern int blk_mq_update_queue_map(unsigned int *map, unsigned int nr_queues,
 				   const struct cpumask *online_mask);
+extern void blk_mq_adjust_tag_set(struct blk_mq_tag_set *set,
+				  const struct cpumask *online_mask);
 extern int blk_mq_hw_queue_to_node(unsigned int *map, unsigned int);
 
 /*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 21/21] null_blk: Do not limit # of hardware queues to # of CPUs
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  8:51   ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: Alexander Gordeev, Jens Axboe, linux-nvme

It is not responsibility of device driver to assume number of
hardware queues used. Let the block layer decide instead.

CC: Jens Axboe <axboe@kernel.dk>
CC: linux-nvme@lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
---
 drivers/block/null_blk.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 1747040..8c5cf88 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -793,9 +793,7 @@ static int __init null_init(void)
 							nr_online_nodes);
 			submit_queues = nr_online_nodes;
 		}
-	} else if (submit_queues > nr_cpu_ids)
-		submit_queues = nr_cpu_ids;
-	else if (!submit_queues)
+	} else if (!submit_queues)
 		submit_queues = 1;
 
 	mutex_init(&lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 21/21] null_blk: Do not limit # of hardware queues to # of CPUs
@ 2016-09-16  8:51   ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16  8:51 UTC (permalink / raw)


It is not responsibility of device driver to assume number of
hardware queues used. Let the block layer decide instead.

CC: Jens Axboe <axboe at kernel.dk>
CC: linux-nvme at lists.infradead.org
Signed-off-by: Alexander Gordeev <agordeev at redhat.com>
---
 drivers/block/null_blk.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 1747040..8c5cf88 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -793,9 +793,7 @@ static int __init null_init(void)
 							nr_online_nodes);
 			submit_queues = nr_online_nodes;
 		}
-	} else if (submit_queues > nr_cpu_ids)
-		submit_queues = nr_cpu_ids;
-	else if (!submit_queues)
+	} else if (!submit_queues)
 		submit_queues = 1;
 
 	mutex_init(&lock);
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16  9:27   ` Christoph Hellwig
  -1 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2016-09-16  9:27 UTC (permalink / raw)
  To: Alexander Gordeev; +Cc: linux-kernel, Jens Axboe, linux-nvme

Hi Alex,

this clashes badly with the my queue mapping rework that went into
Jens tree recently.

But in the meantime: there seem to be lots of little bugfixes and
cleanups in the series, any chance you could send them out as a first
series while updating the rest?

Also please Cc the linux-block list for block layer patches that don't
even seem to touch the nvme driver.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
@ 2016-09-16  9:27   ` Christoph Hellwig
  0 siblings, 0 replies; 57+ messages in thread
From: Christoph Hellwig @ 2016-09-16  9:27 UTC (permalink / raw)


Hi Alex,

this clashes badly with the my queue mapping rework that went into
Jens tree recently.

But in the meantime: there seem to be lots of little bugfixes and
cleanups in the series, any chance you could send them out as a first
series while updating the rest?

Also please Cc the linux-block list for block layer patches that don't
even seem to touch the nvme driver.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
  2016-09-16  9:27   ` Christoph Hellwig
@ 2016-09-16 10:10     ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16 10:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, Jens Axboe, linux-nvme

On Fri, Sep 16, 2016 at 02:27:43AM -0700, Christoph Hellwig wrote:
> Hi Alex,
> 
> this clashes badly with the my queue mapping rework that went into
> Jens tree recently.

Yeah, I fully aware the RFC-marked patches would clash with your
works. I will surely rework them if the proposal considered worthwhile.

> But in the meantime: there seem to be lots of little bugfixes and
> cleanups in the series, any chance you could send them out as a first
> series while updating the rest?

[He-he :) I can even see you removed map_queue() as well]
I will rebase the cleanups on top of your tree.

> Also please Cc the linux-block list for block layer patches that don't
> even seem to touch the nvme driver.

Just wanted let NVMe people know, as this h/w is presumably main
beneficiary.

Thanks!

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
@ 2016-09-16 10:10     ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-16 10:10 UTC (permalink / raw)


On Fri, Sep 16, 2016@02:27:43AM -0700, Christoph Hellwig wrote:
> Hi Alex,
> 
> this clashes badly with the my queue mapping rework that went into
> Jens tree recently.

Yeah, I fully aware the RFC-marked patches would clash with your
works. I will surely rework them if the proposal considered worthwhile.

> But in the meantime: there seem to be lots of little bugfixes and
> cleanups in the series, any chance you could send them out as a first
> series while updating the rest?

[He-he :) I can even see you removed map_queue() as well]
I will rebase the cleanups on top of your tree.

> Also please Cc the linux-block list for block layer patches that don't
> even seem to touch the nvme driver.

Just wanted let NVMe people know, as this h/w is presumably main
beneficiary.

Thanks!

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
  2016-09-16  8:51 ` Alexander Gordeev
@ 2016-09-16 21:04   ` Keith Busch
  -1 siblings, 0 replies; 57+ messages in thread
From: Keith Busch @ 2016-09-16 21:04 UTC (permalink / raw)
  To: Alexander Gordeev; +Cc: linux-kernel, Jens Axboe, linux-nvme

On Fri, Sep 16, 2016 at 10:51:11AM +0200, Alexander Gordeev wrote:
> Linux block device layer limits number of hardware contexts queues
> to number of CPUs in the system. That looks like suboptimal hardware
> utilization in systems where number of CPUs is (significantly) less
> than number of hardware queues.
> 
> In addition, there is a need to deal with tag starvation (see commit
> 0d2602ca "blk-mq: improve support for shared tags maps"). While unused
> hardware queues stay idle, extra efforts are taken to maintain a notion
> of fairness between queue users. Deeper queue depth could probably
> mitigate the whole issue sometimes.
> 
> That all brings a straightforward idea that hardware queues provided by
> a device should be utilized as much as possible.

Hi Alex,

I'm not sure I see how this helps. That probably means I'm not considering
the right scenario. Could you elaborate on when having multiple hardware
queues to choose from a given CPU will provide a benefit?

If we're out of avaliable h/w tags, having more queues shouldn't
improve performance. The tag depth on each nvme hw context is already
deep enough that it should mean even one full queue has saturated the
device capabilities.

Having a 1:1 already seemed like the ideal solution since you can't
simultaneously utilize more than that from the host, so there's no more
h/w parallelisms from we can exploit. On the controller side, fetching
commands is serialized memory reads, so I don't think spreading IO
among more h/w queues helps the target over posting more commands to a
single queue.

If a CPU has more than one to choose from, a command sent to a less
used queue would be serviced ahead of previously issued commands on a
more heavily used one from the same CPU thread due to how NVMe command
arbitraration works, so it sounds like this would create odd latency
outliers.

Thanks,
Keith

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
@ 2016-09-16 21:04   ` Keith Busch
  0 siblings, 0 replies; 57+ messages in thread
From: Keith Busch @ 2016-09-16 21:04 UTC (permalink / raw)


On Fri, Sep 16, 2016@10:51:11AM +0200, Alexander Gordeev wrote:
> Linux block device layer limits number of hardware contexts queues
> to number of CPUs in the system. That looks like suboptimal hardware
> utilization in systems where number of CPUs is (significantly) less
> than number of hardware queues.
> 
> In addition, there is a need to deal with tag starvation (see commit
> 0d2602ca "blk-mq: improve support for shared tags maps"). While unused
> hardware queues stay idle, extra efforts are taken to maintain a notion
> of fairness between queue users. Deeper queue depth could probably
> mitigate the whole issue sometimes.
> 
> That all brings a straightforward idea that hardware queues provided by
> a device should be utilized as much as possible.

Hi Alex,

I'm not sure I see how this helps. That probably means I'm not considering
the right scenario. Could you elaborate on when having multiple hardware
queues to choose from a given CPU will provide a benefit?

If we're out of avaliable h/w tags, having more queues shouldn't
improve performance. The tag depth on each nvme hw context is already
deep enough that it should mean even one full queue has saturated the
device capabilities.

Having a 1:1 already seemed like the ideal solution since you can't
simultaneously utilize more than that from the host, so there's no more
h/w parallelisms from we can exploit. On the controller side, fetching
commands is serialized memory reads, so I don't think spreading IO
among more h/w queues helps the target over posting more commands to a
single queue.

If a CPU has more than one to choose from, a command sent to a less
used queue would be serviced ahead of previously issued commands on a
more heavily used one from the same CPU thread due to how NVMe command
arbitraration works, so it sounds like this would create odd latency
outliers.

Thanks,
Keith

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
  2016-09-16 21:04   ` Keith Busch
@ 2016-09-19 10:38     ` Alexander Gordeev
  -1 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-19 10:38 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-kernel, Jens Axboe, linux-nvme, linux-block

On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote:

CC-ing linux-block@vger.kernel.org

> I'm not sure I see how this helps. That probably means I'm not considering
> the right scenario. Could you elaborate on when having multiple hardware
> queues to choose from a given CPU will provide a benefit?

No, I do not keep in mind any particular scenario besides common
sense. Just an assumption deeper queues are better (in this RFC
a virtual combined queue consisting of multipe h/w queues).

Apparently, there could be positive effects only in systems where
# of queues / # of CPUs > 1 or # of queues / # of cores > 1. But
I do not happen to have ones. If I had numbers this would not be
the RFC and I probably would not have posted in the first place ;)

Would it be possible to give it a try on your hardware?

> If we're out of avaliable h/w tags, having more queues shouldn't
> improve performance. The tag depth on each nvme hw context is already
> deep enough that it should mean even one full queue has saturated the
> device capabilities.

Am I getting you right - a single full nvme hardware queue makes
other queues stalled?

> Having a 1:1 already seemed like the ideal solution since you can't
> simultaneously utilize more than that from the host, so there's no more
> h/w parallelisms from we can exploit. On the controller side, fetching
> commands is serialized memory reads, so I don't think spreading IO
> among more h/w queues helps the target over posting more commands to a
> single queue.

I take a notion of un-ordered commands completion you described below.
But I fail to realize why a CPU would not simultaneously utilize more
than one queue by posting to multiple. Is it due to nvme specifics or
you assume the host would not issue that many commands?

Besides, blk-mq-tag re-uses the latest freed tag and IO should not
actually get spred. Instead, if only currently used hardware queue is
full, the next available queue is chosen. But this is a speculation
without real benchmarks, of course.

> If a CPU has more than one to choose from, a command sent to a less
> used queue would be serviced ahead of previously issued commands on a
> more heavily used one from the same CPU thread due to how NVMe command
> arbitraration works, so it sounds like this would create odd latency
> outliers.

Yep, that sounds scary indeed. Still, any hints on benchmarking
are welcomed.

Many thanks!

> Thanks,
> Keith

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
@ 2016-09-19 10:38     ` Alexander Gordeev
  0 siblings, 0 replies; 57+ messages in thread
From: Alexander Gordeev @ 2016-09-19 10:38 UTC (permalink / raw)


On Fri, Sep 16, 2016@05:04:48PM -0400, Keith Busch wrote:

CC-ing linux-block at vger.kernel.org

> I'm not sure I see how this helps. That probably means I'm not considering
> the right scenario. Could you elaborate on when having multiple hardware
> queues to choose from a given CPU will provide a benefit?

No, I do not keep in mind any particular scenario besides common
sense. Just an assumption deeper queues are better (in this RFC
a virtual combined queue consisting of multipe h/w queues).

Apparently, there could be positive effects only in systems where
# of queues / # of CPUs > 1 or # of queues / # of cores > 1. But
I do not happen to have ones. If I had numbers this would not be
the RFC and I probably would not have posted in the first place ;)

Would it be possible to give it a try on your hardware?

> If we're out of avaliable h/w tags, having more queues shouldn't
> improve performance. The tag depth on each nvme hw context is already
> deep enough that it should mean even one full queue has saturated the
> device capabilities.

Am I getting you right - a single full nvme hardware queue makes
other queues stalled?

> Having a 1:1 already seemed like the ideal solution since you can't
> simultaneously utilize more than that from the host, so there's no more
> h/w parallelisms from we can exploit. On the controller side, fetching
> commands is serialized memory reads, so I don't think spreading IO
> among more h/w queues helps the target over posting more commands to a
> single queue.

I take a notion of un-ordered commands completion you described below.
But I fail to realize why a CPU would not simultaneously utilize more
than one queue by posting to multiple. Is it due to nvme specifics or
you assume the host would not issue that many commands?

Besides, blk-mq-tag re-uses the latest freed tag and IO should not
actually get spred. Instead, if only currently used hardware queue is
full, the next available queue is chosen. But this is a speculation
without real benchmarks, of course.

> If a CPU has more than one to choose from, a command sent to a less
> used queue would be serviced ahead of previously issued commands on a
> more heavily used one from the same CPU thread due to how NVMe command
> arbitraration works, so it sounds like this would create odd latency
> outliers.

Yep, that sounds scary indeed. Still, any hints on benchmarking
are welcomed.

Many thanks!

> Thanks,
> Keith

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
  2016-09-19 10:38     ` Alexander Gordeev
  (?)
@ 2016-09-19 13:33       ` Bart Van Assche
  -1 siblings, 0 replies; 57+ messages in thread
From: Bart Van Assche @ 2016-09-19 13:33 UTC (permalink / raw)
  To: Alexander Gordeev, Keith Busch
  Cc: linux-kernel, Jens Axboe, linux-nvme, linux-block

On 09/19/16 03:38, Alexander Gordeev wrote:=0A=
> On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote:=0A=
>=0A=
> CC-ing linux-block@vger.kernel.org=0A=
>=0A=
>> I'm not sure I see how this helps. That probably means I'm not consideri=
ng=0A=
>> the right scenario. Could you elaborate on when having multiple hardware=
=0A=
>> queues to choose from a given CPU will provide a benefit?=0A=
>=0A=
> No, I do not keep in mind any particular scenario besides common=0A=
> sense. Just an assumption deeper queues are better (in this RFC=0A=
> a virtual combined queue consisting of multipe h/w queues).=0A=
>=0A=
> Apparently, there could be positive effects only in systems where=0A=
> # of queues / # of CPUs > 1 or # of queues / # of cores > 1. But=0A=
> I do not happen to have ones. If I had numbers this would not be=0A=
> the RFC and I probably would not have posted in the first place ;)=0A=
>=0A=
> Would it be possible to give it a try on your hardware?=0A=
=0A=
Hello Alexander,=0A=
=0A=
It is your task to measure the performance impact of these patches and =0A=
not Keith's task. BTW, I'm not convinced that multiple hardware queues =0A=
per CPU will result in a performance improvement. I have not yet seen =0A=
any SSD for which a queue depth above 512 results in better performance =0A=
than queue depth equal to 512. Which applications do you think will =0A=
generate and sustain a queue depth above 512? Additionally, my =0A=
experience from another high performance context (RDMA) is that reducing =
=0A=
the number of queues can result in higher IOPS due to fewer interrupts =0A=
per I/O.=0A=
=0A=
Bart.=0A=

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
@ 2016-09-19 13:33       ` Bart Van Assche
  0 siblings, 0 replies; 57+ messages in thread
From: Bart Van Assche @ 2016-09-19 13:33 UTC (permalink / raw)
  To: Alexander Gordeev, Keith Busch
  Cc: linux-kernel, Jens Axboe, linux-nvme, linux-block

On 09/19/16 03:38, Alexander Gordeev wrote:
> On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote:
>
> CC-ing linux-block@vger.kernel.org
>
>> I'm not sure I see how this helps. That probably means I'm not considering
>> the right scenario. Could you elaborate on when having multiple hardware
>> queues to choose from a given CPU will provide a benefit?
>
> No, I do not keep in mind any particular scenario besides common
> sense. Just an assumption deeper queues are better (in this RFC
> a virtual combined queue consisting of multipe h/w queues).
>
> Apparently, there could be positive effects only in systems where
> # of queues / # of CPUs > 1 or # of queues / # of cores > 1. But
> I do not happen to have ones. If I had numbers this would not be
> the RFC and I probably would not have posted in the first place ;)
>
> Would it be possible to give it a try on your hardware?

Hello Alexander,

It is your task to measure the performance impact of these patches and 
not Keith's task. BTW, I'm not convinced that multiple hardware queues 
per CPU will result in a performance improvement. I have not yet seen 
any SSD for which a queue depth above 512 results in better performance 
than queue depth equal to 512. Which applications do you think will 
generate and sustain a queue depth above 512? Additionally, my 
experience from another high performance context (RDMA) is that reducing 
the number of queues can result in higher IOPS due to fewer interrupts 
per I/O.

Bart.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
@ 2016-09-19 13:33       ` Bart Van Assche
  0 siblings, 0 replies; 57+ messages in thread
From: Bart Van Assche @ 2016-09-19 13:33 UTC (permalink / raw)


On 09/19/16 03:38, Alexander Gordeev wrote:
> On Fri, Sep 16, 2016@05:04:48PM -0400, Keith Busch wrote:
>
> CC-ing linux-block at vger.kernel.org
>
>> I'm not sure I see how this helps. That probably means I'm not considering
>> the right scenario. Could you elaborate on when having multiple hardware
>> queues to choose from a given CPU will provide a benefit?
>
> No, I do not keep in mind any particular scenario besides common
> sense. Just an assumption deeper queues are better (in this RFC
> a virtual combined queue consisting of multipe h/w queues).
>
> Apparently, there could be positive effects only in systems where
> # of queues / # of CPUs > 1 or # of queues / # of cores > 1. But
> I do not happen to have ones. If I had numbers this would not be
> the RFC and I probably would not have posted in the first place ;)
>
> Would it be possible to give it a try on your hardware?

Hello Alexander,

It is your task to measure the performance impact of these patches and 
not Keith's task. BTW, I'm not convinced that multiple hardware queues 
per CPU will result in a performance improvement. I have not yet seen 
any SSD for which a queue depth above 512 results in better performance 
than queue depth equal to 512. Which applications do you think will 
generate and sustain a queue depth above 512? Additionally, my 
experience from another high performance context (RDMA) is that reducing 
the number of queues can result in higher IOPS due to fewer interrupts 
per I/O.

Bart.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
  2016-09-19 10:38     ` Alexander Gordeev
@ 2016-09-20 15:00       ` Keith Busch
  -1 siblings, 0 replies; 57+ messages in thread
From: Keith Busch @ 2016-09-20 15:00 UTC (permalink / raw)
  To: Alexander Gordeev; +Cc: linux-kernel, Jens Axboe, linux-nvme, linux-block

On Mon, Sep 19, 2016 at 12:38:05PM +0200, Alexander Gordeev wrote:
> On Fri, Sep 16, 2016 at 05:04:48PM -0400, Keith Busch wrote:
> 
> > Having a 1:1 already seemed like the ideal solution since you can't
> > simultaneously utilize more than that from the host, so there's no more
> > h/w parallelisms from we can exploit. On the controller side, fetching
> > commands is serialized memory reads, so I don't think spreading IO
> > among more h/w queues helps the target over posting more commands to a
> > single queue.
> 
> I take a notion of un-ordered commands completion you described below.
> But I fail to realize why a CPU would not simultaneously utilize more
> than one queue by posting to multiple. Is it due to nvme specifics or
> you assume the host would not issue that many commands?

What I mean is that if you have N CPUs, you can't possibly simultaneously
write more than N submission queue entries. The benefit of having 1:1
for the queue <-> CPU mapping is that each CPU can post a command to
its queue without lock contention at the same time as another thread.
Having more to choose from doesn't let the host post commands any faster
than we can today.

When we're out of tags, the request currently just waits for one to
become available, increasing submission latency. You can fix that by
increasing the available tags with deeper or more h/w queues, but that
just increases completion latency since the device can't process them
any faster. It's six of one, half dozen of the other.

The depth per queue defaults to 1k. If your process really is able to use
all those resources, the hardware is completely saturated and you're not
going to benefit from introducing more tags [1]. It could conceivably
be worse by reducing cache-hits, or hit inappropriate timeout handling
with the increased completion latency.

 [1] http://lists.infradead.org/pipermail/linux-nvme/2014-July/001064.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues
@ 2016-09-20 15:00       ` Keith Busch
  0 siblings, 0 replies; 57+ messages in thread
From: Keith Busch @ 2016-09-20 15:00 UTC (permalink / raw)


On Mon, Sep 19, 2016@12:38:05PM +0200, Alexander Gordeev wrote:
> On Fri, Sep 16, 2016@05:04:48PM -0400, Keith Busch wrote:
> 
> > Having a 1:1 already seemed like the ideal solution since you can't
> > simultaneously utilize more than that from the host, so there's no more
> > h/w parallelisms from we can exploit. On the controller side, fetching
> > commands is serialized memory reads, so I don't think spreading IO
> > among more h/w queues helps the target over posting more commands to a
> > single queue.
> 
> I take a notion of un-ordered commands completion you described below.
> But I fail to realize why a CPU would not simultaneously utilize more
> than one queue by posting to multiple. Is it due to nvme specifics or
> you assume the host would not issue that many commands?

What I mean is that if you have N CPUs, you can't possibly simultaneously
write more than N submission queue entries. The benefit of having 1:1
for the queue <-> CPU mapping is that each CPU can post a command to
its queue without lock contention at the same time as another thread.
Having more to choose from doesn't let the host post commands any faster
than we can today.

When we're out of tags, the request currently just waits for one to
become available, increasing submission latency. You can fix that by
increasing the available tags with deeper or more h/w queues, but that
just increases completion latency since the device can't process them
any faster. It's six of one, half dozen of the other.

The depth per queue defaults to 1k. If your process really is able to use
all those resources, the hardware is completely saturated and you're not
going to benefit from introducing more tags [1]. It could conceivably
be worse by reducing cache-hits, or hit inappropriate timeout handling
with the increased completion latency.

 [1] http://lists.infradead.org/pipermail/linux-nvme/2014-July/001064.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2016-09-20 15:00 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-16  8:51 [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues Alexander Gordeev
2016-09-16  8:51 ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH 01/21] blk-mq: Fix memory leaks on a queue cleanup Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH 02/21] blk-mq: Fix a potential NULL pointer assignment to hctx tags Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH 03/21] block: Get rid of unused request_queue::nr_queues member Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH 04/21] blk-mq: Do not limit number of queues to 'nr_cpu_ids' in allocations Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH 05/21] blk-mq: Update hardware queue map after q->nr_hw_queues is set Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH 06/21] block: Remove redundant blk_mq_ops::map_queue() interface Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH 07/21] blk-mq: Remove a redundant assignment Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH 08/21] blk-mq: Cleanup hardware context data node selection Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH 09/21] blk-mq: Cleanup a loop exit condition Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH 10/21] blk-mq: Get rid of unnecessary blk_mq_free_hw_queues() Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH 11/21] blk-mq: Move duplicating code to blk_mq_exit_hctx() Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH 12/21] blk-mq: Uninit hardware context in order reverse to init Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH 13/21] blk-mq: Move hardware context init code into blk_mq_init_hctx() Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH 14/21] blk-mq: Rework blk_mq_init_hctx() function Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH 15/21] blk-mq: Pair blk_mq_hctx_kobj_init() with blk_mq_hctx_kobj_put() Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH 16/21] blk-mq: Set flush_start_tag to BLK_MQ_MAX_DEPTH Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH RFC 17/21] blk-mq: Introduce a 1:N hardware contexts Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH RFC 18/21] blk-mq: Enable tag numbers exceed hardware queue depth Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH RFC 19/21] blk-mq: Enable combined hardware queues Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH RFC 20/21] blk-mq: Allow " Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  8:51 ` [PATCH 21/21] null_blk: Do not limit # of hardware queues to # of CPUs Alexander Gordeev
2016-09-16  8:51   ` Alexander Gordeev
2016-09-16  9:27 ` [PATCH RFC 00/21] blk-mq: Introduce combined hardware queues Christoph Hellwig
2016-09-16  9:27   ` Christoph Hellwig
2016-09-16 10:10   ` Alexander Gordeev
2016-09-16 10:10     ` Alexander Gordeev
2016-09-16 21:04 ` Keith Busch
2016-09-16 21:04   ` Keith Busch
2016-09-19 10:38   ` Alexander Gordeev
2016-09-19 10:38     ` Alexander Gordeev
2016-09-19 13:33     ` Bart Van Assche
2016-09-19 13:33       ` Bart Van Assche
2016-09-19 13:33       ` Bart Van Assche
2016-09-20 15:00     ` Keith Busch
2016-09-20 15:00       ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.