All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] blk-mq: queue mapping fix & improvement
@ 2018-12-16  2:25 Ming Lei
  2018-12-16  2:25 ` [PATCH 1/4] nvme-pci: correct mapping for poll queue(s) Ming Lei
                   ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Ming Lei @ 2018-12-16  2:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Jeff Moyer, Mike Snitzer, Christoph Hellwig

Hi,

The 1st patch fixes queue mapping when there isn't poll/write queues,
and turns out IO hang can be caused by this bad queue mapping.

The 2nd patch fixes allocation for queue mapping table.

The 3rd patch improves blk_mq_map_swqueue() a bit, and makes it
more robust for dealing with shared queue mapping.

The 4th patch exports hctx->type in debugfs, so that we can write
debugfs based test for verifying if queue mapping is valid.

Ming Lei (4):
  nvme-pci: correct mapping for poll queue(s)
  blk-mq: fix allocation for queue mapping table
  blk-mq: deal with shared queue mapping reliably
  blk-mq-debugfs: export hctx->type

 block/blk-mq-debugfs.c  |  9 +++++++++
 block/blk-mq.c          |  5 ++++-
 drivers/nvme/host/pci.c | 36 ++++++++++++++++++++----------------
 3 files changed, 33 insertions(+), 17 deletions(-)

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>

-- 
2.9.5


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/4] nvme-pci: correct mapping for poll queue(s)
  2018-12-16  2:25 [PATCH 0/4] blk-mq: queue mapping fix & improvement Ming Lei
@ 2018-12-16  2:25 ` Ming Lei
  2018-12-16 16:11   ` Christoph Hellwig
  2018-12-16  2:25 ` [PATCH 2/4] blk-mq: fix allocation for queue mapping table Ming Lei
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2018-12-16  2:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Jeff Moyer, Mike Snitzer, Christoph Hellwig

If we don't have poll queue, its mapping should share default type's,
instead of setting up one new mapping via blk_mq_map_queues().

blk_mq_map_swqueue() is actually fragile to deal with shared mapping, then
one same ctx can be mapped to two hctxes with same queue type, this
way may cause IO hang easily.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/pci.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fb9d8270f32c..95bd68be2078 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -505,10 +505,11 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 
 		/*
 		 * The poll queue(s) doesn't have an IRQ (and hence IRQ
-		 * affinity), so use the regular blk-mq cpu mapping
+		 * affinity), so use the regular blk-mq cpu mapping if
+		 * poll queue(s) don't share mapping with TYPE_DEFAULT.
 		 */
 		map->queue_offset = qoff;
-		if (i != HCTX_TYPE_POLL)
+		if (i != HCTX_TYPE_POLL || !qoff)
 			blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
 		else
 			blk_mq_map_queues(map);
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/4] blk-mq: fix allocation for queue mapping table
  2018-12-16  2:25 [PATCH 0/4] blk-mq: queue mapping fix & improvement Ming Lei
  2018-12-16  2:25 ` [PATCH 1/4] nvme-pci: correct mapping for poll queue(s) Ming Lei
@ 2018-12-16  2:25 ` Ming Lei
  2018-12-16 16:15   ` Christoph Hellwig
  2018-12-16  2:25 ` [PATCH 3/4] blk-mq: deal with shared queue mapping reliably Ming Lei
  2018-12-16  2:25 ` [PATCH 4/4] blk-mq-debugfs: export hctx->type Ming Lei
  3 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2018-12-16  2:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Jeff Moyer, Mike Snitzer, Christoph Hellwig

Type of each element in queue mapping table is 'unsigned int,
intead of 'struct blk_mq_queue_map)', so fix it.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9690f4f8de7e..a4a0895dae65 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3023,7 +3023,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	ret = -ENOMEM;
 	for (i = 0; i < set->nr_maps; i++) {
 		set->map[i].mq_map = kcalloc_node(nr_cpu_ids,
-						  sizeof(struct blk_mq_queue_map),
+						  sizeof(set->map[i].mq_map[0]),
 						  GFP_KERNEL, set->numa_node);
 		if (!set->map[i].mq_map)
 			goto out_free_mq_map;
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/4] blk-mq: deal with shared queue mapping reliably
  2018-12-16  2:25 [PATCH 0/4] blk-mq: queue mapping fix & improvement Ming Lei
  2018-12-16  2:25 ` [PATCH 1/4] nvme-pci: correct mapping for poll queue(s) Ming Lei
  2018-12-16  2:25 ` [PATCH 2/4] blk-mq: fix allocation for queue mapping table Ming Lei
@ 2018-12-16  2:25 ` Ming Lei
  2018-12-16 16:16   ` Christoph Hellwig
  2018-12-16  2:25 ` [PATCH 4/4] blk-mq-debugfs: export hctx->type Ming Lei
  3 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2018-12-16  2:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Jeff Moyer, Mike Snitzer, Christoph Hellwig

This patch sets map->nr_queues as zero explictly if there is zero
queues for such queue type, then blk_mq_map_swqueue() can become
more robust to deal with shared mappings.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c          |  3 +++
 drivers/nvme/host/pci.c | 37 ++++++++++++++++++++-----------------
 2 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a4a0895dae65..a737d912c46b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2435,6 +2435,9 @@ static void blk_mq_map_swqueue(struct request_queue *q)
 		for (j = 0; j < set->nr_maps; j++) {
 			hctx = blk_mq_map_queue_type(q, j, i);
 
+			if (!set->map[j].nr_queues)
+					continue;
+
 			/*
 			 * If the CPU is already set in the mask, then we've
 			 * mapped this one already. This can happen if
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 95bd68be2078..43074c54279e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -492,29 +492,32 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 	offset = queue_irq_offset(dev);
 	for (i = 0, qoff = 0; i < set->nr_maps; i++) {
 		struct blk_mq_queue_map *map = &set->map[i];
+		unsigned nr_queues;
 
-		map->nr_queues = dev->io_queues[i];
-		if (!map->nr_queues) {
+		nr_queues = map->nr_queues = dev->io_queues[i];
+		if (!nr_queues) {
 			BUG_ON(i == HCTX_TYPE_DEFAULT);
 
-			/* shared set, resuse read set parameters */
-			map->nr_queues = dev->io_queues[HCTX_TYPE_DEFAULT];
+			/* shared set, resuse default set parameters and table */
+			nr_queues = dev->io_queues[HCTX_TYPE_DEFAULT];
 			qoff = 0;
 			offset = queue_irq_offset(dev);
-		}
 
-		/*
-		 * The poll queue(s) doesn't have an IRQ (and hence IRQ
-		 * affinity), so use the regular blk-mq cpu mapping if
-		 * poll queue(s) don't share mapping with TYPE_DEFAULT.
-		 */
-		map->queue_offset = qoff;
-		if (i != HCTX_TYPE_POLL || !qoff)
-			blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
-		else
-			blk_mq_map_queues(map);
-		qoff += map->nr_queues;
-		offset += map->nr_queues;
+			memcpy(map->mq_map, set->map[HCTX_TYPE_DEFAULT].mq_map,
+				   nr_cpu_ids * sizeof(map->mq_map[0]));
+		} else {
+			/*
+			 * The poll queue(s) doesn't have an IRQ (and hence IRQ
+			 * affinity), so use the regular blk-mq cpu mapping.
+			 */
+			map->queue_offset = qoff;
+			if (i != HCTX_TYPE_POLL)
+				blk_mq_pci_map_queues(map, to_pci_dev(dev->dev), offset);
+			else
+				blk_mq_map_queues(map);
+		}
+		qoff += nr_queues;
+		offset += nr_queues;
 	}
 
 	return 0;
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/4] blk-mq-debugfs: export hctx->type
  2018-12-16  2:25 [PATCH 0/4] blk-mq: queue mapping fix & improvement Ming Lei
                   ` (2 preceding siblings ...)
  2018-12-16  2:25 ` [PATCH 3/4] blk-mq: deal with shared queue mapping reliably Ming Lei
@ 2018-12-16  2:25 ` Ming Lei
  2018-12-16 16:17   ` Christoph Hellwig
  3 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2018-12-16  2:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Ming Lei, Jeff Moyer, Mike Snitzer, Christoph Hellwig

Now we only export hctx->type via sysfs, and there isn't such info
in hctx entry under debugfs. We often use debugfs only to diagnose
queue mapping issue, so add the support in debugfs.

Queue mapping becomes a bit more complicated after multiple queue
mapping is supported, we may write blktest to verify if queue mapping
is valid based on blk-mq-debug.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-debugfs.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index a32bb79d6c95..219e29d49a49 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -636,6 +636,14 @@ static int hctx_dispatch_busy_show(void *data, struct seq_file *m)
 	return 0;
 }
 
+static int hctx_type_show(void *data, struct seq_file *m)
+{
+	struct blk_mq_hw_ctx *hctx = data;
+
+	seq_printf(m, "%u\n", hctx->type);
+	return 0;
+}
+
 static void *ctx_rq_list_start(struct seq_file *m, loff_t *pos)
 	__acquires(&ctx->lock)
 {
@@ -798,6 +806,7 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
 	{"run", 0600, hctx_run_show, hctx_run_write},
 	{"active", 0400, hctx_active_show},
 	{"dispatch_busy", 0400, hctx_dispatch_busy_show},
+	{"type", 0400, hctx_type_show},
 	{},
 };
 
-- 
2.9.5


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] nvme-pci: correct mapping for poll queue(s)
  2018-12-16  2:25 ` [PATCH 1/4] nvme-pci: correct mapping for poll queue(s) Ming Lei
@ 2018-12-16 16:11   ` Christoph Hellwig
  2018-12-17  1:00     ` Ming Lei
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2018-12-16 16:11 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Jeff Moyer, Mike Snitzer, Christoph Hellwig

On Sun, Dec 16, 2018 at 10:25:14AM +0800, Ming Lei wrote:
> If we don't have poll queue, its mapping should share default type's,
> instead of setting up one new mapping via blk_mq_map_queues().

No, if we don't have poll queues nr_maps should be 2.  I've already
posted a patch for that for nvme-pci which got it wrong (my fault..)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/4] blk-mq: fix allocation for queue mapping table
  2018-12-16  2:25 ` [PATCH 2/4] blk-mq: fix allocation for queue mapping table Ming Lei
@ 2018-12-16 16:15   ` Christoph Hellwig
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2018-12-16 16:15 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Jeff Moyer, Mike Snitzer, Christoph Hellwig

On Sun, Dec 16, 2018 at 10:25:15AM +0800, Ming Lei wrote:
> Type of each element in queue mapping table is 'unsigned int,
> intead of 'struct blk_mq_queue_map)', so fix it.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] blk-mq: deal with shared queue mapping reliably
  2018-12-16  2:25 ` [PATCH 3/4] blk-mq: deal with shared queue mapping reliably Ming Lei
@ 2018-12-16 16:16   ` Christoph Hellwig
  2018-12-16 18:39     ` Mike Snitzer
  2018-12-17  1:27     ` Ming Lei
  0 siblings, 2 replies; 20+ messages in thread
From: Christoph Hellwig @ 2018-12-16 16:16 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Jeff Moyer, Mike Snitzer, Christoph Hellwig

On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote:
> This patch sets map->nr_queues as zero explictly if there is zero
> queues for such queue type, then blk_mq_map_swqueue() can become
> more robust to deal with shared mappings.

This looks a lot more clumsy than what we had before, can you explain
what additional robustnes it buys us?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] blk-mq-debugfs: export hctx->type
  2018-12-16  2:25 ` [PATCH 4/4] blk-mq-debugfs: export hctx->type Ming Lei
@ 2018-12-16 16:17   ` Christoph Hellwig
  2018-12-16 19:12     ` Jens Axboe
  2018-12-17  1:06     ` Ming Lei
  0 siblings, 2 replies; 20+ messages in thread
From: Christoph Hellwig @ 2018-12-16 16:17 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Jeff Moyer, Mike Snitzer, Christoph Hellwig

On Sun, Dec 16, 2018 at 10:25:17AM +0800, Ming Lei wrote:
> Now we only export hctx->type via sysfs, and there isn't such info
> in hctx entry under debugfs. We often use debugfs only to diagnose
> queue mapping issue, so add the support in debugfs.
> 
> Queue mapping becomes a bit more complicated after multiple queue
> mapping is supported, we may write blktest to verify if queue mapping
> is valid based on blk-mq-debug.

I'll let Jens decide if we really want to double export information
like this (or maybe even move it to debugfs only?)

> +static int hctx_type_show(void *data, struct seq_file *m)
> +{
> +	struct blk_mq_hw_ctx *hctx = data;
> +
> +	seq_printf(m, "%u\n", hctx->type);
> +	return 0;

But if we export it we should probably export it as a text, just
like we do for sysfs now.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] blk-mq: deal with shared queue mapping reliably
  2018-12-16 16:16   ` Christoph Hellwig
@ 2018-12-16 18:39     ` Mike Snitzer
  2018-12-16 19:12       ` Jens Axboe
  2018-12-17  1:27     ` Ming Lei
  1 sibling, 1 reply; 20+ messages in thread
From: Mike Snitzer @ 2018-12-16 18:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Ming Lei, Jens Axboe, linux-block, Jeff Moyer

On Sun, Dec 16 2018 at 11:16am -0500,
Christoph Hellwig <hch@lst.de> wrote:

> On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote:
> > This patch sets map->nr_queues as zero explictly if there is zero
> > queues for such queue type, then blk_mq_map_swqueue() can become
> > more robust to deal with shared mappings.
> 
> This looks a lot more clumsy than what we had before, can you explain
> what additional robustnes it buys us?

It enables nvme IO to complete on my testbed with for-4.21/block
changes, this NUMA layout is what triggered Ming's work:

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14
node 0 size: 128605 MB
node 0 free: 128092 MB
node 1 cpus: 1 3 5 7 9 11 13 15
node 1 size: 128997 MB
node 1 free: 128529 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

Without the aggregate changes from this patchset (1-3 anyway) I get IO
hangs in blkdev_fsync().

With that in mind, Jens needs these fixes (or something comparable)
ASAP.

Tested-by: Mike Snitzer <snitzer@redhat.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] blk-mq: deal with shared queue mapping reliably
  2018-12-16 18:39     ` Mike Snitzer
@ 2018-12-16 19:12       ` Jens Axboe
  2018-12-17  1:04         ` Ming Lei
  0 siblings, 1 reply; 20+ messages in thread
From: Jens Axboe @ 2018-12-16 19:12 UTC (permalink / raw)
  To: Mike Snitzer, Christoph Hellwig; +Cc: Ming Lei, linux-block, Jeff Moyer

On 12/16/18 11:39 AM, Mike Snitzer wrote:
> On Sun, Dec 16 2018 at 11:16am -0500,
> Christoph Hellwig <hch@lst.de> wrote:
> 
>> On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote:
>>> This patch sets map->nr_queues as zero explictly if there is zero
>>> queues for such queue type, then blk_mq_map_swqueue() can become
>>> more robust to deal with shared mappings.
>>
>> This looks a lot more clumsy than what we had before, can you explain
>> what additional robustnes it buys us?
> 
> It enables nvme IO to complete on my testbed with for-4.21/block
> changes, this NUMA layout is what triggered Ming's work:
> 
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 2 4 6 8 10 12 14
> node 0 size: 128605 MB
> node 0 free: 128092 MB
> node 1 cpus: 1 3 5 7 9 11 13 15
> node 1 size: 128997 MB
> node 1 free: 128529 MB
> node distances:
> node   0   1
>   0:  10  21
>   1:  21  10
> 
> Without the aggregate changes from this patchset (1-3 anyway) I get IO
> hangs in blkdev_fsync().

Still puzzled. I'm not against the change, but the commit message has
NOTHING in the way of justification. "Make X more robust" doesn't
mean anything. Your followup brings no extra info to the table in
terms of what the bug is here. What exact bug is it fixing? Why is
fsync currently hanging?


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] blk-mq-debugfs: export hctx->type
  2018-12-16 16:17   ` Christoph Hellwig
@ 2018-12-16 19:12     ` Jens Axboe
  2018-12-17  1:06     ` Ming Lei
  1 sibling, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2018-12-16 19:12 UTC (permalink / raw)
  To: Christoph Hellwig, Ming Lei; +Cc: linux-block, Jeff Moyer, Mike Snitzer

On 12/16/18 9:17 AM, Christoph Hellwig wrote:
> On Sun, Dec 16, 2018 at 10:25:17AM +0800, Ming Lei wrote:
>> Now we only export hctx->type via sysfs, and there isn't such info
>> in hctx entry under debugfs. We often use debugfs only to diagnose
>> queue mapping issue, so add the support in debugfs.
>>
>> Queue mapping becomes a bit more complicated after multiple queue
>> mapping is supported, we may write blktest to verify if queue mapping
>> is valid based on blk-mq-debug.
> 
> I'll let Jens decide if we really want to double export information
> like this (or maybe even move it to debugfs only?)

Let's just move it to debugfs, we don't need it twice.

>> +static int hctx_type_show(void *data, struct seq_file *m)
>> +{
>> +	struct blk_mq_hw_ctx *hctx = data;
>> +
>> +	seq_printf(m, "%u\n", hctx->type);
>> +	return 0;
> 
> But if we export it we should probably export it as a text, just
> like we do for sysfs now.

Indeed

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] nvme-pci: correct mapping for poll queue(s)
  2018-12-16 16:11   ` Christoph Hellwig
@ 2018-12-17  1:00     ` Ming Lei
  2018-12-17  7:06       ` Christoph Hellwig
  0 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2018-12-17  1:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-block, Jeff Moyer, Mike Snitzer

On Sun, Dec 16, 2018 at 05:11:59PM +0100, Christoph Hellwig wrote:
> On Sun, Dec 16, 2018 at 10:25:14AM +0800, Ming Lei wrote:
> > If we don't have poll queue, its mapping should share default type's,
> > instead of setting up one new mapping via blk_mq_map_queues().
> 
> No, if we don't have poll queues nr_maps should be 2.  I've already
> posted a patch for that for nvme-pci which got it wrong (my fault..)

This patch doesn't touch nr_maps.

The issue is that if poll_queues/write_queues are zero, all three mapping
should be same, however they aren't actually because we still use blk_mq_map_queues()
to build a new mapping for poll queue.

This difference causes broken mapping especially with help of blk_mq_map_swqueue(), since
hctx->type could be re-written.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] blk-mq: deal with shared queue mapping reliably
  2018-12-16 19:12       ` Jens Axboe
@ 2018-12-17  1:04         ` Ming Lei
  2018-12-17  7:44           ` Christoph Hellwig
  0 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2018-12-17  1:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Mike Snitzer, Christoph Hellwig, linux-block, Jeff Moyer

On Sun, Dec 16, 2018 at 12:12:17PM -0700, Jens Axboe wrote:
> On 12/16/18 11:39 AM, Mike Snitzer wrote:
> > On Sun, Dec 16 2018 at 11:16am -0500,
> > Christoph Hellwig <hch@lst.de> wrote:
> > 
> >> On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote:
> >>> This patch sets map->nr_queues as zero explictly if there is zero
> >>> queues for such queue type, then blk_mq_map_swqueue() can become
> >>> more robust to deal with shared mappings.
> >>
> >> This looks a lot more clumsy than what we had before, can you explain
> >> what additional robustnes it buys us?
> > 
> > It enables nvme IO to complete on my testbed with for-4.21/block
> > changes, this NUMA layout is what triggered Ming's work:
> > 
> > # numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 2 4 6 8 10 12 14
> > node 0 size: 128605 MB
> > node 0 free: 128092 MB
> > node 1 cpus: 1 3 5 7 9 11 13 15
> > node 1 size: 128997 MB
> > node 1 free: 128529 MB
> > node distances:
> > node   0   1
> >   0:  10  21
> >   1:  21  10
> > 
> > Without the aggregate changes from this patchset (1-3 anyway) I get IO
> > hangs in blkdev_fsync().
> 
> Still puzzled. I'm not against the change, but the commit message has
> NOTHING in the way of justification. "Make X more robust" doesn't
> mean anything. Your followup brings no extra info to the table in
> terms of what the bug is here. What exact bug is it fixing? Why is
> fsync currently hanging?

As I explained in previous mail, poll queue uses new mapping via
blk_mq_map_queues(), then blk_mq_map_swqueue() may over-write hctx->type
by this new mapping, finally the queue mapping is totally broken.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/4] blk-mq-debugfs: export hctx->type
  2018-12-16 16:17   ` Christoph Hellwig
  2018-12-16 19:12     ` Jens Axboe
@ 2018-12-17  1:06     ` Ming Lei
  1 sibling, 0 replies; 20+ messages in thread
From: Ming Lei @ 2018-12-17  1:06 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-block, Jeff Moyer, Mike Snitzer

On Sun, Dec 16, 2018 at 05:17:44PM +0100, Christoph Hellwig wrote:
> On Sun, Dec 16, 2018 at 10:25:17AM +0800, Ming Lei wrote:
> > Now we only export hctx->type via sysfs, and there isn't such info
> > in hctx entry under debugfs. We often use debugfs only to diagnose
> > queue mapping issue, so add the support in debugfs.
> > 
> > Queue mapping becomes a bit more complicated after multiple queue
> > mapping is supported, we may write blktest to verify if queue mapping
> > is valid based on blk-mq-debug.
> 
> I'll let Jens decide if we really want to double export information
> like this (or maybe even move it to debugfs only?)

It is pretty friendly to export all in one place, given users may
collect log just by one single command or familiar way.

> 
> > +static int hctx_type_show(void *data, struct seq_file *m)
> > +{
> > +	struct blk_mq_hw_ctx *hctx = data;
> > +
> > +	seq_printf(m, "%u\n", hctx->type);
> > +	return 0;
> 
> But if we export it we should probably export it as a text, just
> like we do for sysfs now.

OK.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] blk-mq: deal with shared queue mapping reliably
  2018-12-16 16:16   ` Christoph Hellwig
  2018-12-16 18:39     ` Mike Snitzer
@ 2018-12-17  1:27     ` Ming Lei
  1 sibling, 0 replies; 20+ messages in thread
From: Ming Lei @ 2018-12-17  1:27 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-block, Jeff Moyer, Mike Snitzer

On Sun, Dec 16, 2018 at 05:16:50PM +0100, Christoph Hellwig wrote:
> On Sun, Dec 16, 2018 at 10:25:16AM +0800, Ming Lei wrote:
> > This patch sets map->nr_queues as zero explictly if there is zero
> > queues for such queue type, then blk_mq_map_swqueue() can become
> > more robust to deal with shared mappings.
> 
> This looks a lot more clumsy than what we had before, can you explain
> what additional robustnes it buys us?

If there isn't one mapping type, its .nr_queues is set as zero explicitly,
then we don't need to touch the related hctx/ctx which is retrieved via the
shared mapping table.

thanks,
Ming

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] nvme-pci: correct mapping for poll queue(s)
  2018-12-17  1:00     ` Ming Lei
@ 2018-12-17  7:06       ` Christoph Hellwig
  2018-12-17  7:22         ` Ming Lei
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2018-12-17  7:06 UTC (permalink / raw)
  To: Ming Lei
  Cc: Christoph Hellwig, Jens Axboe, linux-block, Jeff Moyer, Mike Snitzer

On Mon, Dec 17, 2018 at 09:00:52AM +0800, Ming Lei wrote:
> On Sun, Dec 16, 2018 at 05:11:59PM +0100, Christoph Hellwig wrote:
> > On Sun, Dec 16, 2018 at 10:25:14AM +0800, Ming Lei wrote:
> > > If we don't have poll queue, its mapping should share default type's,
> > > instead of setting up one new mapping via blk_mq_map_queues().
> > 
> > No, if we don't have poll queues nr_maps should be 2.  I've already
> > posted a patch for that for nvme-pci which got it wrong (my fault..)
> 
> This patch doesn't touch nr_maps.
> 
> The issue is that if poll_queues/write_queues are zero, all three mapping
> should be same, however they aren't actually because we still use blk_mq_map_queues()
> to build a new mapping for poll queue.

In Jens' for-4.21 we only support polling with explicit poll queues,
and thus an explicit poll map.  Which means nr_maps must be 3 if you
support poll maps, and < 3 if you don't.  As said, the only user actually
got that wrong (my fault), but the fix is already queue up in the nvme
tree and on its way to Jens.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/4] nvme-pci: correct mapping for poll queue(s)
  2018-12-17  7:06       ` Christoph Hellwig
@ 2018-12-17  7:22         ` Ming Lei
  0 siblings, 0 replies; 20+ messages in thread
From: Ming Lei @ 2018-12-17  7:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-block, Jeff Moyer, Mike Snitzer

On Mon, Dec 17, 2018 at 08:06:04AM +0100, Christoph Hellwig wrote:
> On Mon, Dec 17, 2018 at 09:00:52AM +0800, Ming Lei wrote:
> > On Sun, Dec 16, 2018 at 05:11:59PM +0100, Christoph Hellwig wrote:
> > > On Sun, Dec 16, 2018 at 10:25:14AM +0800, Ming Lei wrote:
> > > > If we don't have poll queue, its mapping should share default type's,
> > > > instead of setting up one new mapping via blk_mq_map_queues().
> > > 
> > > No, if we don't have poll queues nr_maps should be 2.  I've already
> > > posted a patch for that for nvme-pci which got it wrong (my fault..)
> > 
> > This patch doesn't touch nr_maps.
> > 
> > The issue is that if poll_queues/write_queues are zero, all three mapping
> > should be same, however they aren't actually because we still use blk_mq_map_queues()
> > to build a new mapping for poll queue.
> 
> In Jens' for-4.21 we only support polling with explicit poll queues,
> and thus an explicit poll map.  Which means nr_maps must be 3 if you
> support poll maps, and < 3 if you don't.  As said, the only user actually
> got that wrong (my fault), but the fix is already queue up in the nvme
> tree and on its way to Jens.

OK, just not see it in Jens's tree.

However, IMO, the fix in this patchset may be better, given we still can
pass HCTX_MAX_TYPES no matter if poll_queues is zero or other, even in
future 'poll_queues/write_queues' might become a per-controller attribute.

For example, there are two NVMe controllers, one is slow and another is
fast, we may just want to enable poll for fast one.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] blk-mq: deal with shared queue mapping reliably
  2018-12-17  1:04         ` Ming Lei
@ 2018-12-17  7:44           ` Christoph Hellwig
  2018-12-17  8:38             ` Ming Lei
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2018-12-17  7:44 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Mike Snitzer, Christoph Hellwig, linux-block, Jeff Moyer

I suspect we want something like this to make sure we never look
at queue maps that don't have nr_queues set, and then just don't
have to initialize them in nvme:

diff --git a/block/blk-mq.h b/block/blk-mq.h
index b63a0de8a07a..d1ed096723fb 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -105,14 +105,17 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
 {
 	enum hctx_type type = HCTX_TYPE_DEFAULT;
 
-	if (q->tag_set->nr_maps > HCTX_TYPE_POLL &&
-	    ((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags)))
+	if ((flags & REQ_HIPRI) &&
+	    q->tag_set->nr_maps > HCTX_TYPE_POLL && 
+	    q->tag_set->map[HCTX_TYPE_POLL].nr_queues &&
+	    test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
 		type = HCTX_TYPE_POLL;
 
-	else if (q->tag_set->nr_maps > HCTX_TYPE_READ &&
-		 ((flags & REQ_OP_MASK) == REQ_OP_READ))
+	else if (((flags & REQ_OP_MASK) == REQ_OP_READ) &&
+	         q->tag_set->nr_maps > HCTX_TYPE_READ &&
+		 q->tag_set->map[HCTX_TYPE_READ].nr_queues)
 		type = HCTX_TYPE_READ;
-
+	
 	return blk_mq_map_queue_type(q, type, cpu);
 }
 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fb9d8270f32c..698b350b38cf 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -496,11 +496,7 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 		map->nr_queues = dev->io_queues[i];
 		if (!map->nr_queues) {
 			BUG_ON(i == HCTX_TYPE_DEFAULT);
-
-			/* shared set, resuse read set parameters */
-			map->nr_queues = dev->io_queues[HCTX_TYPE_DEFAULT];
-			qoff = 0;
-			offset = queue_irq_offset(dev);
+			continue;
 		}
 
 		/*

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/4] blk-mq: deal with shared queue mapping reliably
  2018-12-17  7:44           ` Christoph Hellwig
@ 2018-12-17  8:38             ` Ming Lei
  0 siblings, 0 replies; 20+ messages in thread
From: Ming Lei @ 2018-12-17  8:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, Mike Snitzer, linux-block, Jeff Moyer

On Mon, Dec 17, 2018 at 08:44:42AM +0100, Christoph Hellwig wrote:
> I suspect we want something like this to make sure we never look
> at queue maps that don't have nr_queues set, and then just don't
> have to initialize them in nvme:
> 
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index b63a0de8a07a..d1ed096723fb 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -105,14 +105,17 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
>  {
>  	enum hctx_type type = HCTX_TYPE_DEFAULT;
>  
> -	if (q->tag_set->nr_maps > HCTX_TYPE_POLL &&
> -	    ((flags & REQ_HIPRI) && test_bit(QUEUE_FLAG_POLL, &q->queue_flags)))
> +	if ((flags & REQ_HIPRI) &&
> +	    q->tag_set->nr_maps > HCTX_TYPE_POLL && 
> +	    q->tag_set->map[HCTX_TYPE_POLL].nr_queues &&
> +	    test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
>  		type = HCTX_TYPE_POLL;
>  
> -	else if (q->tag_set->nr_maps > HCTX_TYPE_READ &&
> -		 ((flags & REQ_OP_MASK) == REQ_OP_READ))
> +	else if (((flags & REQ_OP_MASK) == REQ_OP_READ) &&
> +	         q->tag_set->nr_maps > HCTX_TYPE_READ &&
> +		 q->tag_set->map[HCTX_TYPE_READ].nr_queues)
>  		type = HCTX_TYPE_READ;
> -
> +	
>  	return blk_mq_map_queue_type(q, type, cpu);
>  }


>  
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index fb9d8270f32c..698b350b38cf 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -496,11 +496,7 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
>  		map->nr_queues = dev->io_queues[i];
>  		if (!map->nr_queues) {
>  			BUG_ON(i == HCTX_TYPE_DEFAULT);
> -
> -			/* shared set, resuse read set parameters */
> -			map->nr_queues = dev->io_queues[HCTX_TYPE_DEFAULT];
> -			qoff = 0;
> -			offset = queue_irq_offset(dev);
> +			continue;
>  		}
>  
>  		/*

This way works just with a bit cost in blk_mq_map_queue(), given
we have cached hctx in rq->mq_hctx, I think this approach is good.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-12-17  8:38 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-16  2:25 [PATCH 0/4] blk-mq: queue mapping fix & improvement Ming Lei
2018-12-16  2:25 ` [PATCH 1/4] nvme-pci: correct mapping for poll queue(s) Ming Lei
2018-12-16 16:11   ` Christoph Hellwig
2018-12-17  1:00     ` Ming Lei
2018-12-17  7:06       ` Christoph Hellwig
2018-12-17  7:22         ` Ming Lei
2018-12-16  2:25 ` [PATCH 2/4] blk-mq: fix allocation for queue mapping table Ming Lei
2018-12-16 16:15   ` Christoph Hellwig
2018-12-16  2:25 ` [PATCH 3/4] blk-mq: deal with shared queue mapping reliably Ming Lei
2018-12-16 16:16   ` Christoph Hellwig
2018-12-16 18:39     ` Mike Snitzer
2018-12-16 19:12       ` Jens Axboe
2018-12-17  1:04         ` Ming Lei
2018-12-17  7:44           ` Christoph Hellwig
2018-12-17  8:38             ` Ming Lei
2018-12-17  1:27     ` Ming Lei
2018-12-16  2:25 ` [PATCH 4/4] blk-mq-debugfs: export hctx->type Ming Lei
2018-12-16 16:17   ` Christoph Hellwig
2018-12-16 19:12     ` Jens Axboe
2018-12-17  1:06     ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.