[PATCH 1/3] blk-mq: add an API to estimate hardware queue node

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/3] blk-mq: add an API to estimate hardware queue node
@ 2016-03-25 21:36 Shaohua Li
  2016-03-25 21:36 ` [PATCH 2/3] blk-mq: allocate blk_mq_tags and requests in correct node Shaohua Li
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Shaohua Li @ 2016-03-25 21:36 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: axboe, hch, Kernel-team

we allocate most data structure in device's node, but some data
structures are not for DMA and mostly used by specific cpus/node which
could diff from device's node. Allocating such hot data in device's
node doesn't make sense. Add an API to estimate hardware queue node.
This can be used before blk-mq actually establishes the mapping. This
API runs slow, but it only used in initialization time.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-mq.c         | 21 +++++++++++++++++++++
 include/linux/blk-mq.h |  2 ++
 2 files changed, 23 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 050f7a1..ec214d3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2330,6 +2330,27 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 }
 EXPORT_SYMBOL(blk_mq_free_tag_set);
 
+int blk_mq_estimate_hw_queue_node(unsigned int total_queues,
+	unsigned int index)
+{
+	unsigned int *map;
+	int node;
+
+	if (total_queues == 1)
+		return NUMA_NO_NODE;
+	map = kzalloc(sizeof(*map) * nr_cpu_ids, GFP_KERNEL);
+	if (!map)
+		return NUMA_NO_NODE;
+	if (blk_mq_update_queue_map(map, total_queues, cpu_online_mask)) {
+		kfree(map);
+		return NUMA_NO_NODE;
+	}
+	node = blk_mq_hw_queue_to_node(map, index);
+	kfree(map);
+	return node;
+}
+EXPORT_SYMBOL(blk_mq_estimate_hw_queue_node);
+
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 {
 	struct blk_mq_tag_set *set = q->tag_set;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 15a73d4..892b41b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -188,6 +188,8 @@ void blk_mq_insert_request(struct request *, bool, bool, bool);
 void blk_mq_free_request(struct request *rq);
 void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *, struct request *rq);
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
+int blk_mq_estimate_hw_queue_node(unsigned int total_queues,
+	unsigned int index);
 
 enum {
 	BLK_MQ_REQ_NOWAIT	= (1 << 0), /* return when out of requests */
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/3] blk-mq: allocate blk_mq_tags and requests in correct node
  2016-03-25 21:36 [PATCH 1/3] blk-mq: add an API to estimate hardware queue node Shaohua Li
@ 2016-03-25 21:36 ` Shaohua Li
  2016-03-25 21:36 ` [PATCH 3/3] nvme: allocate nvme_queue " Shaohua Li
  2016-03-29  7:24 ` [PATCH 1/3] blk-mq: add an API to estimate hardware queue node Christoph Hellwig
  2 siblings, 0 replies; 10+ messages in thread
From: Shaohua Li @ 2016-03-25 21:36 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: axboe, hch, Kernel-team

blk_mq_tags/requests of specific hardware queue are mostly used in
specific cpus, which might not be in the same numa node as disk. For
example, a nvme card is in node 0. half hardware queue will be used by
node 0, the other node 1.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-mq.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ec214d3..2e37c28 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1465,9 +1465,15 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 	struct blk_mq_tags *tags;
 	unsigned int i, j, entries_per_page, max_order = 4;
 	size_t rq_size, left;
+	int node;
+
+	node = blk_mq_estimate_hw_queue_node(set->nr_hw_queues,
+		hctx_idx);
+	if (node == NUMA_NO_NODE)
+		node = set->numa_node;
 
 	tags = blk_mq_init_tags(set->queue_depth, set->reserved_tags,
-				set->numa_node,
+				node,
 				BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
 	if (!tags)
 		return NULL;
@@ -1476,7 +1482,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 
 	tags->rqs = kzalloc_node(set->queue_depth * sizeof(struct request *),
 				 GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY,
-				 set->numa_node);
+				 node);
 	if (!tags->rqs) {
 		blk_mq_free_tags(tags);
 		return NULL;
@@ -1500,7 +1506,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 			this_order--;
 
 		do {
-			page = alloc_pages_node(set->numa_node,
+			page = alloc_pages_node(node,
 				GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY | __GFP_ZERO,
 				this_order);
 			if (page)
@@ -1531,7 +1537,7 @@ static struct blk_mq_tags *blk_mq_init_rq_map(struct blk_mq_tag_set *set,
 			if (set->ops->init_request) {
 				if (set->ops->init_request(set->driver_data,
 						tags->rqs[i], hctx_idx, i,
-						set->numa_node)) {
+						node)) {
 					tags->rqs[i] = NULL;
 					goto fail;
 				}
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/3] nvme: allocate nvme_queue in correct node
  2016-03-25 21:36 [PATCH 1/3] blk-mq: add an API to estimate hardware queue node Shaohua Li
  2016-03-25 21:36 ` [PATCH 2/3] blk-mq: allocate blk_mq_tags and requests in correct node Shaohua Li
@ 2016-03-25 21:36 ` Shaohua Li
  2016-03-29  7:24 ` [PATCH 1/3] blk-mq: add an API to estimate hardware queue node Christoph Hellwig
  2 siblings, 0 replies; 10+ messages in thread
From: Shaohua Li @ 2016-03-25 21:36 UTC (permalink / raw)
  To: linux-block, linux-kernel; +Cc: axboe, hch, Kernel-team

nvme_queue is per-cpu queue (mostly). Allocating it in node where blk-mq
will use it.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/nvme/host/pci.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f8db70a..8d24701 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1138,9 +1138,9 @@ static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
 }
 
 static struct nvme_queue *nvme_alloc_queue(struct nvme_dev *dev, int qid,
-							int depth)
+							int depth, int node)
 {
-	struct nvme_queue *nvmeq = kzalloc(sizeof(*nvmeq), GFP_KERNEL);
+	struct nvme_queue *nvmeq = kzalloc_node(sizeof(*nvmeq), GFP_KERNEL, node);
 	if (!nvmeq)
 		return NULL;
 
@@ -1318,7 +1318,8 @@ static int nvme_configure_admin_queue(struct nvme_dev *dev)
 
 	nvmeq = dev->queues[0];
 	if (!nvmeq) {
-		nvmeq = nvme_alloc_queue(dev, 0, NVME_AQ_DEPTH);
+		nvmeq = nvme_alloc_queue(dev, 0, NVME_AQ_DEPTH,
+			dev_to_node(dev->dev));
 		if (!nvmeq)
 			return -ENOMEM;
 	}
@@ -1372,11 +1373,15 @@ static void nvme_watchdog_timer(unsigned long data)
 
 static int nvme_create_io_queues(struct nvme_dev *dev)
 {
-	unsigned i, max;
+	unsigned i, max, start;
 	int ret = 0;
+	int node;
 
+	start = dev->queue_count;
 	for (i = dev->queue_count; i <= dev->max_qid; i++) {
-		if (!nvme_alloc_queue(dev, i, dev->q_depth)) {
+		node = blk_mq_estimate_hw_queue_node(
+			dev->max_qid - start + 1, i - start);
+		if (!nvme_alloc_queue(dev, i, dev->q_depth, node)) {
 			ret = -ENOMEM;
 			break;
 		}
-- 
2.8.0.rc2

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] blk-mq: add an API to estimate hardware queue node
  2016-03-25 21:36 [PATCH 1/3] blk-mq: add an API to estimate hardware queue node Shaohua Li
  2016-03-25 21:36 ` [PATCH 2/3] blk-mq: allocate blk_mq_tags and requests in correct node Shaohua Li
  2016-03-25 21:36 ` [PATCH 3/3] nvme: allocate nvme_queue " Shaohua Li
@ 2016-03-29  7:24 ` Christoph Hellwig
  2016-03-29 16:47   ` Shaohua Li
  2 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2016-03-29  7:24 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-block, linux-kernel, axboe, hch, Kernel-team

On Fri, Mar 25, 2016 at 02:36:30PM -0700, Shaohua Li wrote:
> we allocate most data structure in device's node, but some data
> structures are not for DMA and mostly used by specific cpus/node which
> could diff from device's node. Allocating such hot data in device's
> node doesn't make sense. Add an API to estimate hardware queue node.
> This can be used before blk-mq actually establishes the mapping. This
> API runs slow, but it only used in initialization time.

I think this is the wrong way around.  I've got some proprotype code
that just leaves the cpu assignments to the drivers and picks it up
in blk-mq.  Give me a few days to post it..

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] blk-mq: add an API to estimate hardware queue node
  2016-03-29  7:24 ` [PATCH 1/3] blk-mq: add an API to estimate hardware queue node Christoph Hellwig
@ 2016-03-29 16:47   ` Shaohua Li
  2016-03-29 16:50     ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Shaohua Li @ 2016-03-29 16:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-block, linux-kernel, axboe, Kernel-team

On Tue, Mar 29, 2016 at 12:24:43AM -0700, Christoph Hellwig wrote:
> On Fri, Mar 25, 2016 at 02:36:30PM -0700, Shaohua Li wrote:
> > we allocate most data structure in device's node, but some data
> > structures are not for DMA and mostly used by specific cpus/node which
> > could diff from device's node. Allocating such hot data in device's
> > node doesn't make sense. Add an API to estimate hardware queue node.
> > This can be used before blk-mq actually establishes the mapping. This
> > API runs slow, but it only used in initialization time.
> 
> I think this is the wrong way around.  I've got some proprotype code
> that just leaves the cpu assignments to the drivers and picks it up
> in blk-mq.  Give me a few days to post it..

This looks weird, shouldn't the cpu assignment be determined by block
core (blk-mq) because block core decides how to use the queue?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] blk-mq: add an API to estimate hardware queue node
  2016-03-29 16:47   ` Shaohua Li
@ 2016-03-29 16:50     ` Jens Axboe
  2016-03-29 17:44       ` Christoph Hellwig
  0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2016-03-29 16:50 UTC (permalink / raw)
  To: Shaohua Li, Christoph Hellwig; +Cc: linux-block, linux-kernel, Kernel-team

On 03/29/2016 10:47 AM, Shaohua Li wrote:
> On Tue, Mar 29, 2016 at 12:24:43AM -0700, Christoph Hellwig wrote:
>> On Fri, Mar 25, 2016 at 02:36:30PM -0700, Shaohua Li wrote:
>>> we allocate most data structure in device's node, but some data
>>> structures are not for DMA and mostly used by specific cpus/node which
>>> could diff from device's node. Allocating such hot data in device's
>>> node doesn't make sense. Add an API to estimate hardware queue node.
>>> This can be used before blk-mq actually establishes the mapping. This
>>> API runs slow, but it only used in initialization time.
>>
>> I think this is the wrong way around.  I've got some proprotype code
>> that just leaves the cpu assignments to the drivers and picks it up
>> in blk-mq.  Give me a few days to post it..
>
> This looks weird, shouldn't the cpu assignment be determined by block
> core (blk-mq) because block core decides how to use the queue?

I agree, that belongs in the blk-mq proper, the driver should just 
follow the rules outlined, not impose their own in this regard. It'll 
also help with irq affinity mappings, once we get that in.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] blk-mq: add an API to estimate hardware queue node
  2016-03-29 16:50     ` Jens Axboe
@ 2016-03-29 17:44       ` Christoph Hellwig
  2016-03-29 20:51         ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2016-03-29 17:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Shaohua Li, Christoph Hellwig, linux-block, linux-kernel, Kernel-team

On Tue, Mar 29, 2016 at 10:50:11AM -0600, Jens Axboe wrote:
> >This looks weird, shouldn't the cpu assignment be determined by block
> >core (blk-mq) because block core decides how to use the queue?
> 
> I agree, that belongs in the blk-mq proper, the driver should just follow
> the rules outlined, not impose their own in this regard. It'll also help
> with irq affinity mappings, once we get that in.

It's not going to work that way unfortunately.  Lots of driver simply
have no control over the underlying interrupts.  Think of any RDMA
storage or other layer drivers - they get low level queues from a layer
they don't control and need a block queue for each of them.

My plan is to make the block layer follow what the networking layer
does - get the low level queues / MSI-X pairs and then use the
infrastructure in lib/cpu_rmap.c to figure out the number of queues
and queue placement for them.

The lower half of that is about to get rewritten by Thomas after
disccussion between him, me and a few others to provide drivers
nice APIs for spreading MSI-X vectors over CPUs or nodes.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] blk-mq: add an API to estimate hardware queue node
  2016-03-29 17:44       ` Christoph Hellwig
@ 2016-03-29 20:51         ` Jens Axboe
  2016-03-29 21:20           ` Christoph Hellwig
  0 siblings, 1 reply; 10+ messages in thread
From: Jens Axboe @ 2016-03-29 20:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Shaohua Li, linux-block, linux-kernel, Kernel-team

On 03/29/2016 11:44 AM, Christoph Hellwig wrote:
> On Tue, Mar 29, 2016 at 10:50:11AM -0600, Jens Axboe wrote:
>>> This looks weird, shouldn't the cpu assignment be determined by block
>>> core (blk-mq) because block core decides how to use the queue?
>>
>> I agree, that belongs in the blk-mq proper, the driver should just follow
>> the rules outlined, not impose their own in this regard. It'll also help
>> with irq affinity mappings, once we get that in.
>
> It's not going to work that way unfortunately.  Lots of driver simply
> have no control over the underlying interrupts.  Think of any RDMA
> storage or other layer drivers - they get low level queues from a layer
> they don't control and need a block queue for each of them.
>
> My plan is to make the block layer follow what the networking layer
> does - get the low level queues / MSI-X pairs and then use the
> infrastructure in lib/cpu_rmap.c to figure out the number of queues
> and queue placement for them.

That sounds fine, as long as we get it in general infrastructure. Note 
that we need to be able to support sets of queues for blk-mq, once we 
start splitting them up (for scheduling). As long as we can map sets of 
queues to sets of CPUs, then that's not a concern.

For Shaohua's patches, not binding per-device data to the home node 
makes a lot of sense, though.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] blk-mq: add an API to estimate hardware queue node
  2016-03-29 20:51         ` Jens Axboe
@ 2016-03-29 21:20           ` Christoph Hellwig
  2016-03-29 21:22             ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2016-03-29 21:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Shaohua Li, linux-block, linux-kernel, Kernel-team

On Tue, Mar 29, 2016 at 02:51:37PM -0600, Jens Axboe wrote:
> For Shaohua's patches, not binding per-device data to the home node makes a
> lot of sense, though.

It does.  But exposing the node estimation function just to work around
the fact that the way the cpu/node mapping is done is upside down
currently doesn't seem helpful.  Let's try to revamp that first and the
rebase the rest of the series on top of that.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/3] blk-mq: add an API to estimate hardware queue node
  2016-03-29 21:20           ` Christoph Hellwig
@ 2016-03-29 21:22             ` Jens Axboe
  0 siblings, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2016-03-29 21:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Shaohua Li, linux-block, linux-kernel, Kernel-team

On 03/29/2016 03:20 PM, Christoph Hellwig wrote:
> On Tue, Mar 29, 2016 at 02:51:37PM -0600, Jens Axboe wrote:
>> For Shaohua's patches, not binding per-device data to the home node makes a
>> lot of sense, though.
>
> It does.  But exposing the node estimation function just to work around
> the fact that the way the cpu/node mapping is done is upside down
> currently doesn't seem helpful.  Let's try to revamp that first and the
> rebase the rest of the series on top of that.

That's fine with me, it's not exactly urgent.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-03-29 21:27 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-25 21:36 [PATCH 1/3] blk-mq: add an API to estimate hardware queue node Shaohua Li
2016-03-25 21:36 ` [PATCH 2/3] blk-mq: allocate blk_mq_tags and requests in correct node Shaohua Li
2016-03-25 21:36 ` [PATCH 3/3] nvme: allocate nvme_queue " Shaohua Li
2016-03-29  7:24 ` [PATCH 1/3] blk-mq: add an API to estimate hardware queue node Christoph Hellwig
2016-03-29 16:47   ` Shaohua Li
2016-03-29 16:50     ` Jens Axboe
2016-03-29 17:44       ` Christoph Hellwig
2016-03-29 20:51         ` Jens Axboe
2016-03-29 21:20           ` Christoph Hellwig
2016-03-29 21:22             ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.