linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx
@ 2021-08-18 14:44 Ming Lei
  2021-08-18 14:44 ` [PATCH V7 1/3] genirq: add device_has_managed_msi_irq Ming Lei
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Ming Lei @ 2021-08-18 14:44 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner
  Cc: John Garry, Sagi Grimberg, Daniel Wagner, Wen Xiong, Ming Lei

Hi,

blk_mq_alloc_request_hctx() is used by NVMe fc/rdma/tcp/loop to connect
io queue. Also the sw ctx is chosen as the 1st online cpu in hctx->cpumask.
However, all cpus in hctx->cpumask may be offline.

This usage model isn't well supported by blk-mq which supposes allocator is
always done on one online CPU in hctx->cpumask. This assumption is
related with managed irq, which also requires blk-mq to drain inflight
request in this hctx when the last cpu in hctx->cpumask is going to
offline.

However, NVMe fc/rdma/tcp/loop don't use managed irq, so we should allow
them to ask for request allocation when the specified hctx is inactive
(all cpus in hctx->cpumask are offline). Fix blk_mq_alloc_request_hctx() by
allowing to allocate request when all CPUs of this hctx are offline.

Wen Xiong has verified V4 in her nvmef test.

V7:
	- move blk_mq_hctx_use_managed_irq() into block/blk-mq.c, 3/3

V6:
	- move device_has_managed_msi_irq() into kernel/irq/msi.c

V5:
	- take John Garry's suggestion to replace device field with
	new helper of device_has_managed_msi_irq()
V4:
	- remove patches for cleanup queue map helpers
	- take Christoph's suggestion to add field into 'struct device' for
	describing if managed irq is allocated from one device

V3:
	- cleanup map queues helpers, and remove pci/virtio/rdma queue
	  helpers
	- store use managed irq info into qmap

V2:
	- use flag of BLK_MQ_F_MANAGED_IRQ
	- pass BLK_MQ_F_MANAGED_IRQ from driver explicitly
	- kill BLK_MQ_F_STACKING


Ming Lei (3):
  genirq: add device_has_managed_msi_irq
  blk-mq: mark if one queue map uses managed irq
  blk-mq: don't deactivate hctx if managed irq isn't used

 block/blk-mq-pci.c                     |  2 ++
 block/blk-mq-rdma.c                    |  7 ++++++
 block/blk-mq-virtio.c                  |  2 ++
 block/blk-mq.c                         | 35 ++++++++++++++++++--------
 drivers/scsi/hisi_sas/hisi_sas_v2_hw.c |  1 +
 include/linux/blk-mq.h                 |  3 ++-
 include/linux/msi.h                    |  5 ++++
 kernel/irq/msi.c                       | 18 +++++++++++++
 8 files changed, 62 insertions(+), 11 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH V7 1/3] genirq: add device_has_managed_msi_irq
  2021-08-18 14:44 [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei
@ 2021-08-18 14:44 ` Ming Lei
  2021-10-11 18:23   ` Varad Gautam
  2021-08-18 14:44 ` [PATCH V7 2/3] blk-mq: mark if one queue map uses managed irq Ming Lei
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2021-08-18 14:44 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner
  Cc: John Garry, Sagi Grimberg, Daniel Wagner, Wen Xiong, Ming Lei

irq vector allocation with managed affinity may be used by driver, and
blk-mq needs this info for draining queue because genirq core will shutdown
managed irq when all CPUs in the affinity mask are offline.

The info of using managed irq is often produced by drivers, and it is
consumed by blk-mq, so different subsystems are involved in this info flow.

Address this issue by adding one helper of device_has_managed_msi_irq()
which is suggested by John Garry.

Tested-by: Wen Xiong <wenxiong@us.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 include/linux/msi.h |  5 +++++
 kernel/irq/msi.c    | 18 ++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/include/linux/msi.h b/include/linux/msi.h
index a20dc66b9946..b4511c606072 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -59,10 +59,15 @@ struct platform_msi_priv_data;
 void __get_cached_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
 #ifdef CONFIG_GENERIC_MSI_IRQ
 void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg);
+bool device_has_managed_msi_irq(struct device *dev);
 #else
 static inline void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg)
 {
 }
+static inline bool device_has_managed_msi_irq(struct device *dev)
+{
+	return false;
+}
 #endif
 
 typedef void (*irq_write_msi_msg_t)(struct msi_desc *desc,
diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
index 00f89d5bd6f6..167bc1d8bf4a 100644
--- a/kernel/irq/msi.c
+++ b/kernel/irq/msi.c
@@ -71,6 +71,24 @@ void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg)
 }
 EXPORT_SYMBOL_GPL(get_cached_msi_msg);
 
+/**
+ * device_has_managed_msi_irq - Query if device has managed irq entry
+ * @dev:	Pointer to the device for which we want to query
+ *
+ * Return true if there is managed irq vector allocated on this device
+ */
+bool device_has_managed_msi_irq(struct device *dev)
+{
+	struct msi_desc *desc;
+
+	for_each_msi_entry(desc, dev) {
+		if (desc->affinity && desc->affinity->is_managed)
+			return true;
+	}
+	return false;
+}
+EXPORT_SYMBOL_GPL(device_has_managed_msi_irq);
+
 #ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN
 static inline void irq_chip_write_msi_msg(struct irq_data *data,
 					  struct msi_msg *msg)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH V7 2/3] blk-mq: mark if one queue map uses managed irq
  2021-08-18 14:44 [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei
  2021-08-18 14:44 ` [PATCH V7 1/3] genirq: add device_has_managed_msi_irq Ming Lei
@ 2021-08-18 14:44 ` Ming Lei
  2021-08-23 17:17   ` Sagi Grimberg
  2021-08-18 14:44 ` [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used Ming Lei
  2021-08-19 22:38 ` [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei
  3 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2021-08-18 14:44 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner
  Cc: John Garry, Sagi Grimberg, Daniel Wagner, Wen Xiong, Ming Lei

Retrieve this info via new added helper of device_has_managed_msi_irq,
then we can decide if one hctx needs to be drained before all its CPUs
become offline.

Tested-by: Wen Xiong <wenxiong@us.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-pci.c                     | 2 ++
 block/blk-mq-rdma.c                    | 7 +++++++
 block/blk-mq-virtio.c                  | 2 ++
 drivers/scsi/hisi_sas/hisi_sas_v2_hw.c | 1 +
 include/linux/blk-mq.h                 | 3 ++-
 5 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c
index b595a94c4d16..e452cda0896a 100644
--- a/block/blk-mq-pci.c
+++ b/block/blk-mq-pci.c
@@ -8,6 +8,7 @@
 #include <linux/blk-mq-pci.h>
 #include <linux/pci.h>
 #include <linux/module.h>
+#include <linux/msi.h>
 
 #include "blk-mq.h"
 
@@ -37,6 +38,7 @@ int blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev *pdev,
 		for_each_cpu(cpu, mask)
 			qmap->mq_map[cpu] = qmap->queue_offset + queue;
 	}
+	qmap->use_managed_irq = device_has_managed_msi_irq(&pdev->dev);
 
 	return 0;
 
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 14f968e58b8f..19ad31c44eab 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -36,6 +36,13 @@ int blk_mq_rdma_map_queues(struct blk_mq_queue_map *map,
 			map->mq_map[cpu] = map->queue_offset + queue;
 	}
 
+	/*
+	 * RDMA doesn't use managed irq, and nvme rdma driver can allocate
+	 * and submit requests on specified hctx via
+	 * blk_mq_alloc_request_hctx
+	 */
+	map->use_managed_irq = false;
+
 	return 0;
 
 fallback:
diff --git a/block/blk-mq-virtio.c b/block/blk-mq-virtio.c
index 7b8a42c35102..2ce39fb77dce 100644
--- a/block/blk-mq-virtio.c
+++ b/block/blk-mq-virtio.c
@@ -7,6 +7,7 @@
 #include <linux/blk-mq-virtio.h>
 #include <linux/virtio_config.h>
 #include <linux/module.h>
+#include <linux/msi.h>
 #include "blk-mq.h"
 
 /**
@@ -38,6 +39,7 @@ int blk_mq_virtio_map_queues(struct blk_mq_queue_map *qmap,
 		for_each_cpu(cpu, mask)
 			qmap->mq_map[cpu] = qmap->queue_offset + queue;
 	}
+	qmap->use_managed_irq = device_has_managed_msi_irq(&vdev->dev);
 
 	return 0;
 fallback:
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
index b0b2361e63fe..7d7df261d346 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
@@ -3562,6 +3562,7 @@ static int map_queues_v2_hw(struct Scsi_Host *shost)
 		for_each_cpu(cpu, mask)
 			qmap->mq_map[cpu] = qmap->queue_offset + queue;
 	}
+	qmap->use_managed_irq = true;
 
 	return 0;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 22215db36122..fd5540fba4ef 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -192,7 +192,8 @@ struct blk_mq_hw_ctx {
 struct blk_mq_queue_map {
 	unsigned int *mq_map;
 	unsigned int nr_queues;
-	unsigned int queue_offset;
+	unsigned int queue_offset:31;
+	unsigned int use_managed_irq:1;
 };
 
 /**
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used
  2021-08-18 14:44 [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei
  2021-08-18 14:44 ` [PATCH V7 1/3] genirq: add device_has_managed_msi_irq Ming Lei
  2021-08-18 14:44 ` [PATCH V7 2/3] blk-mq: mark if one queue map uses managed irq Ming Lei
@ 2021-08-18 14:44 ` Ming Lei
  2021-08-23 17:18   ` Sagi Grimberg
  2021-09-15 16:14   ` Daniel Wagner
  2021-08-19 22:38 ` [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei
  3 siblings, 2 replies; 13+ messages in thread
From: Ming Lei @ 2021-08-18 14:44 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner
  Cc: John Garry, Sagi Grimberg, Daniel Wagner, Wen Xiong, Ming Lei

blk-mq deactivates one hctx when the last CPU in hctx->cpumask become
offline by draining all requests originated from this hctx and moving new
allocation to other active hctx. This way is for avoiding inflight IO in
case of managed irq because managed irq is shutdown when the last CPU in
the irq's affinity becomes offline.

However, lots of drivers(nvme fc, rdma, tcp, loop, ...) don't use managed
irq, so they needn't to deactivate hctx when the last CPU becomes offline.
Also, some of them are the only user of blk_mq_alloc_request_hctx() which
is used for connecting io queue. And their requirement is that the connect
request needs to be submitted successfully via one specified hctx even
though all CPUs in this hctx->cpumask have become offline.

Addressing the requirement for nvme fc/rdma/loop by allowing to
allocate request from one hctx when all CPUs in this hctx are offline,
since these drivers don't use managed irq.

Finally don't deactivate one hctx when it doesn't use managed irq.

Tested-by: Wen Xiong <wenxiong@us.ibm.com>
Reviewed-by: John Garry <john.garry@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 35 +++++++++++++++++++++++++----------
 1 file changed, 25 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3ab4540320ca..9abfad3d5b48 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -427,6 +427,23 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 }
 EXPORT_SYMBOL(blk_mq_alloc_request);
 
+static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx)
+{
+	int cpu = cpumask_first_and(hctx->cpumask, cpu_online_mask);
+
+	if (cpu >= nr_cpu_ids)
+		cpu = cpumask_first(hctx->cpumask);
+	return cpu;
+}
+
+static bool blk_mq_hctx_use_managed_irq(struct blk_mq_hw_ctx *hctx)
+{
+	if (hctx->type == HCTX_TYPE_POLL)
+		return false;
+
+	return hctx->queue->tag_set->map[hctx->type].use_managed_irq;
+}
+
 struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx)
 {
@@ -468,7 +485,10 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	data.hctx = q->queue_hw_ctx[hctx_idx];
 	if (!blk_mq_hw_queue_mapped(data.hctx))
 		goto out_queue_exit;
-	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
+
+	WARN_ON_ONCE(blk_mq_hctx_use_managed_irq(data.hctx));
+
+	cpu = blk_mq_first_mapped_cpu(data.hctx);
 	data.ctx = __blk_mq_get_ctx(q, cpu);
 
 	if (!q->elevator)
@@ -1501,15 +1521,6 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 	hctx_unlock(hctx, srcu_idx);
 }
 
-static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx)
-{
-	int cpu = cpumask_first_and(hctx->cpumask, cpu_online_mask);
-
-	if (cpu >= nr_cpu_ids)
-		cpu = cpumask_first(hctx->cpumask);
-	return cpu;
-}
-
 /*
  * It'd be great if the workqueue API had a way to pass
  * in a mask and had some smarts for more clever placement.
@@ -2556,6 +2567,10 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
 	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
 			struct blk_mq_hw_ctx, cpuhp_online);
 
+	/* hctx needn't to be deactivated in case managed irq isn't used */
+	if (!blk_mq_hctx_use_managed_irq(hctx))
+		return 0;
+
 	if (!cpumask_test_cpu(cpu, hctx->cpumask) ||
 	    !blk_mq_last_cpu_in_hctx(cpu, hctx))
 		return 0;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx
  2021-08-18 14:44 [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei
                   ` (2 preceding siblings ...)
  2021-08-18 14:44 ` [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used Ming Lei
@ 2021-08-19 22:38 ` Ming Lei
  3 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2021-08-19 22:38 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner
  Cc: John Garry, Sagi Grimberg, Daniel Wagner, Wen Xiong

On Wed, Aug 18, 2021 at 10:44:25PM +0800, Ming Lei wrote:
> Hi,
> 
> blk_mq_alloc_request_hctx() is used by NVMe fc/rdma/tcp/loop to connect
> io queue. Also the sw ctx is chosen as the 1st online cpu in hctx->cpumask.
> However, all cpus in hctx->cpumask may be offline.
> 
> This usage model isn't well supported by blk-mq which supposes allocator is
> always done on one online CPU in hctx->cpumask. This assumption is
> related with managed irq, which also requires blk-mq to drain inflight
> request in this hctx when the last cpu in hctx->cpumask is going to
> offline.
> 
> However, NVMe fc/rdma/tcp/loop don't use managed irq, so we should allow
> them to ask for request allocation when the specified hctx is inactive
> (all cpus in hctx->cpumask are offline). Fix blk_mq_alloc_request_hctx() by
> allowing to allocate request when all CPUs of this hctx are offline.
> 
> Wen Xiong has verified V4 in her nvmef test.
> 
> V7:
> 	- move blk_mq_hctx_use_managed_irq() into block/blk-mq.c, 3/3

Hello Jens,

NVMe TCP and others have been a bit popular recent days, and the kernel panic
of blk_mq_alloc_request_hctx() has annoyed people for a bit long.

Any chance to pull the three patches in so we can fix them in 5.15?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V7 2/3] blk-mq: mark if one queue map uses managed irq
  2021-08-18 14:44 ` [PATCH V7 2/3] blk-mq: mark if one queue map uses managed irq Ming Lei
@ 2021-08-23 17:17   ` Sagi Grimberg
  0 siblings, 0 replies; 13+ messages in thread
From: Sagi Grimberg @ 2021-08-23 17:17 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner
  Cc: John Garry, Daniel Wagner, Wen Xiong

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used
  2021-08-18 14:44 ` [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used Ming Lei
@ 2021-08-23 17:18   ` Sagi Grimberg
  2021-09-15 16:14   ` Daniel Wagner
  1 sibling, 0 replies; 13+ messages in thread
From: Sagi Grimberg @ 2021-08-23 17:18 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner
  Cc: John Garry, Daniel Wagner, Wen Xiong

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used
  2021-08-18 14:44 ` [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used Ming Lei
  2021-08-23 17:18   ` Sagi Grimberg
@ 2021-09-15 16:14   ` Daniel Wagner
  2021-09-16  2:17     ` Ming Lei
  1 sibling, 1 reply; 13+ messages in thread
From: Daniel Wagner @ 2021-09-15 16:14 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner,
	John Garry, Sagi Grimberg, Wen Xiong

On Wed, Aug 18, 2021 at 10:44:28PM +0800, Ming Lei wrote:
>  struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>  	unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx)
>  {
> @@ -468,7 +485,10 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>  	data.hctx = q->queue_hw_ctx[hctx_idx];
>  	if (!blk_mq_hw_queue_mapped(data.hctx))
>  		goto out_queue_exit;
> -	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
> +
> +	WARN_ON_ONCE(blk_mq_hctx_use_managed_irq(data.hctx));
> +
> +	cpu = blk_mq_first_mapped_cpu(data.hctx);
>  	data.ctx = __blk_mq_get_ctx(q, cpu);

I was pondering how we could address the issue that the qla2xxx driver
is using managed IRQs which makes nvme-fc depending as class on managed
IRQ.

blk_mq_alloc_request_hctx() is the only place where we really need to
distinguish between managed and !managed IRQs. As far I undertand the
situation, if all CPUs for a hctx are going offline, the driver wont use
this context. So there is only the case we end up in this code path is
when the driver tries to reconnect the queues, e.g. after
devloss. Couldn't we in this case not just return an error and go into
error recovery? Something like this:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a2db50886a26..52fc8592c72e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -486,9 +486,13 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
        if (!blk_mq_hw_queue_mapped(data.hctx))
                goto out_queue_exit;
 
-       WARN_ON_ONCE(blk_mq_hctx_use_managed_irq(data.hctx));
-
-       cpu = blk_mq_first_mapped_cpu(data.hctx);
+       if (blk_mq_hctx_use_managed_irq(data.hctx)) {
+               cpu = cpumask_first_and(hctx->cpumask, cpu_online_mask);
+               if (cpu >= nr_cpu_ids)
+                       return ERR_PTR(-EINVAL);
+       } else {
+               cpu = blk_mq_first_mapped_cpu(data.hctx);
+       }
        data.ctx = __blk_mq_get_ctx(q, cpu);


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used
  2021-09-15 16:14   ` Daniel Wagner
@ 2021-09-16  2:17     ` Ming Lei
  2021-09-16  7:42       ` Daniel Wagner
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2021-09-16  2:17 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner,
	John Garry, Sagi Grimberg, Wen Xiong, James Smart

On Wed, Sep 15, 2021 at 06:14:59PM +0200, Daniel Wagner wrote:
> On Wed, Aug 18, 2021 at 10:44:28PM +0800, Ming Lei wrote:
> >  struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
> >  	unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx)
> >  {
> > @@ -468,7 +485,10 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
> >  	data.hctx = q->queue_hw_ctx[hctx_idx];
> >  	if (!blk_mq_hw_queue_mapped(data.hctx))
> >  		goto out_queue_exit;
> > -	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
> > +
> > +	WARN_ON_ONCE(blk_mq_hctx_use_managed_irq(data.hctx));
> > +
> > +	cpu = blk_mq_first_mapped_cpu(data.hctx);
> >  	data.ctx = __blk_mq_get_ctx(q, cpu);
> 
> I was pondering how we could address the issue that the qla2xxx driver
> is using managed IRQs which makes nvme-fc depending as class on managed
> IRQ.
> 
> blk_mq_alloc_request_hctx() is the only place where we really need to
> distinguish between managed and !managed IRQs. As far I undertand the
> situation, if all CPUs for a hctx are going offline, the driver wont use
> this context. So there is only the case we end up in this code path is
> when the driver tries to reconnect the queues, e.g. after
> devloss. Couldn't we in this case not just return an error and go into
> error recovery? Something like this:
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index a2db50886a26..52fc8592c72e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -486,9 +486,13 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>         if (!blk_mq_hw_queue_mapped(data.hctx))
>                 goto out_queue_exit;
>  
> -       WARN_ON_ONCE(blk_mq_hctx_use_managed_irq(data.hctx));
> -
> -       cpu = blk_mq_first_mapped_cpu(data.hctx);
> +       if (blk_mq_hctx_use_managed_irq(data.hctx)) {

Firstly, even with patches of 'qla2xxx - add nvme map_queues support',
the knowledge if managed irq is used in nvmef LLD is still missed, so
blk_mq_hctx_use_managed_irq() may always return false, but that
shouldn't be hard to solve.

The problem is that we still should make connect io queue completed
when all CPUs of this hctx is offline in case of managed irq.

One solution might be to use io polling for connecting io queue, but nvme fc
doesn't support polling, all the other nvme hosts do support it.



Thanks,
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used
  2021-09-16  2:17     ` Ming Lei
@ 2021-09-16  7:42       ` Daniel Wagner
  2021-09-16  8:13         ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Daniel Wagner @ 2021-09-16  7:42 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner,
	John Garry, Sagi Grimberg, Wen Xiong, James Smart

On Thu, Sep 16, 2021 at 10:17:18AM +0800, Ming Lei wrote:
> Firstly, even with patches of 'qla2xxx - add nvme map_queues support',
> the knowledge if managed irq is used in nvmef LLD is still missed, so
> blk_mq_hctx_use_managed_irq() may always return false, but that
> shouldn't be hard to solve.

Yes, that's pretty simple:

--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -7914,6 +7914,9 @@ static int qla2xxx_map_queues(struct Scsi_Host *shost)
                rc = blk_mq_map_queues(qmap);
        else
                rc = blk_mq_pci_map_queues(qmap, vha->hw->pdev, vha->irq_offset);
+
+       qmap->use_managed_irq = true;
+
        return rc;
 }

> The problem is that we still should make connect io queue completed
> when all CPUs of this hctx is offline in case of managed irq.

I agree, though if I understand this right, the scenario where all CPUs
are offline in a hctx and we want to use this htcx is only happening
after an initial setup and then reconnect attempt happens. That is
during the first connect attempt only online CPUs are assigned to the
hctx. When the CPUs are taken offline the block layer makes sure not to
use those queues anymore (no problem for the hctx so far). Then for some
reason the nmve-fc layer decides to reconnect and we end up in the
situation where we don't have any online CPU in given hctx.

> One solution might be to use io polling for connecting io queue, but nvme fc
> doesn't support polling, all the other nvme hosts do support it.

No idea, something to explore for sure :)

My point is that your series is fixing existing bugs and doesn't
introduce a new one. qla2xxx is already depending on managed IRQs. I
would like to see your series accepted with my hack as stop gap solution
until we have a proper fix.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used
  2021-09-16  7:42       ` Daniel Wagner
@ 2021-09-16  8:13         ` Ming Lei
  2021-10-04 12:25           ` [RFC] nvme-fc: Allow managed IRQs Daniel Wagner
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2021-09-16  8:13 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner,
	John Garry, Sagi Grimberg, Wen Xiong, James Smart

On Thu, Sep 16, 2021 at 09:42:29AM +0200, Daniel Wagner wrote:
> On Thu, Sep 16, 2021 at 10:17:18AM +0800, Ming Lei wrote:
> > Firstly, even with patches of 'qla2xxx - add nvme map_queues support',
> > the knowledge if managed irq is used in nvmef LLD is still missed, so
> > blk_mq_hctx_use_managed_irq() may always return false, but that
> > shouldn't be hard to solve.
> 
> Yes, that's pretty simple:
> 
> --- a/drivers/scsi/qla2xxx/qla_os.c
> +++ b/drivers/scsi/qla2xxx/qla_os.c
> @@ -7914,6 +7914,9 @@ static int qla2xxx_map_queues(struct Scsi_Host *shost)
>                 rc = blk_mq_map_queues(qmap);
>         else
>                 rc = blk_mq_pci_map_queues(qmap, vha->hw->pdev, vha->irq_offset);
> +
> +       qmap->use_managed_irq = true;
> +
>         return rc;
>  }

blk_mq_alloc_request_hctx() won't be called into qla2xxx queue, what we
need is to mark the nvmef queue as .use_managed_irq if the LLD uses
managed irq.

> 
> > The problem is that we still should make connect io queue completed
> > when all CPUs of this hctx is offline in case of managed irq.
> 
> I agree, though if I understand this right, the scenario where all CPUs
> are offline in a hctx and we want to use this htcx is only happening
> after an initial setup and then reconnect attempt happens. That is
> during the first connect attempt only online CPUs are assigned to the
> hctx. When the CPUs are taken offline the block layer makes sure not to
> use those queues anymore (no problem for the hctx so far). Then for some
> reason the nmve-fc layer decides to reconnect and we end up in the
> situation where we don't have any online CPU in given hctx.

It is simply that blk_mq_alloc_request_hctx() allocates request from one
specified hctx, and the specified hctx can be offline any time.

> 
> > One solution might be to use io polling for connecting io queue, but nvme fc
> > doesn't support polling, all the other nvme hosts do support it.
> 
> No idea, something to explore for sure :)

It is totally a raw idea, something like: start each queue in poll mode,
and run connect IO queue command via polling. Once the connect io queue command
is done, switch the queue into normal mode. Then
blk_mq_alloc_request_hctx() is guaranteed to be successful.

> 
> My point is that your series is fixing existing bugs and doesn't
> introduce a new one. qla2xxx is already depending on managed IRQs. I
> would like to see your series accepted with my hack as stop gap solution
> until we have a proper fix.

I am fine to work this way first if no one objects.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC] nvme-fc: Allow managed IRQs
  2021-09-16  8:13         ` Ming Lei
@ 2021-10-04 12:25           ` Daniel Wagner
  0 siblings, 0 replies; 13+ messages in thread
From: Daniel Wagner @ 2021-10-04 12:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner,
	John Garry, Sagi Grimberg, Wen Xiong, James Smart,
	Hannes Reinecke, Daniel Wagner

nvme-fc is currently the only user of blk_mq_alloc_request_hctx().
With the recent changes to teach the nvme subsystem to honor managed
IRQs, the assumption was the complete nvme-fc doesn't use managed
IRQs. Unfortunately, the qla2xxx driver uses the managed IRQs.

Add an interface the nvme-fc drivers to update the mapping and also to
set the use_managed_irq flag. This is very ugly as we have to pass
down struct blk_mq_tag_set. I haven't found any better way so far.

Relax the requirement in the blk_mq_alloc_request_hctx() that only
!managed IRQs are supported. As long one CPU is online in the
requested hctx all is good. If this is not the case we return an
error which allows the upper layer to start the reconnect loop.

As the current qla2xxx already depends on managed IRQs the main
difference with and without this patch is, that we see

  nvme nvme8: Connect command failed, error wo/DNR bit: -16402
  nvme nvme8: NVME-FC{8}: reset: Reconnect attempt failed (-18)

instead of just timeouts such as

  qla2xxx [0000:81:00.0]-5032:1: ABT_IOCB: Invalid completion handle (1da) -- timed-out.

In both cases the system recovers as soon at least one CPUs is online
in all hctx. Also note, this is only for admin request. As long
no FC reset happens and a reconnect attempt is triggered, user space
is able to issue I/Os do the target.

Signed-off-by: Daniel Wagner <dwagner@suse.de>
---

Hi,

I've played a bit with this patch to figure out what the impact is for
the qla2xxx driver. Basically, the situation doesn't change a lot with
Ming's patches. If we happen to run into the situation that all CPUs
are offline in one hctx and a reconnect attempt is triggered all
traffic to the target cease. But as soon we have at least one CPU
online in all hctx the system recovers.

This patch just makes it a bit more verbose (maybe a warning could be
added to blk_mq_alloc_request_hctx()).

Thanks,
Daniel

 block/blk-mq.c                  | 10 +++++++---
 drivers/nvme/host/fc.c          | 13 +++++++++++++
 drivers/scsi/qla2xxx/qla_nvme.c | 14 ++++++++++++++
 drivers/scsi/qla2xxx/qla_os.c   |  3 +++
 include/linux/nvme-fc-driver.h  |  4 ++++
 5 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index a2db50886a26..df65690bf649 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -486,9 +486,13 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	if (!blk_mq_hw_queue_mapped(data.hctx))
 		goto out_queue_exit;
 
-	WARN_ON_ONCE(blk_mq_hctx_use_managed_irq(data.hctx));
-
-	cpu = blk_mq_first_mapped_cpu(data.hctx);
+	if (blk_mq_hctx_use_managed_irq(data.hctx)) {
+		cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
+		if (cpu >= nr_cpu_ids)
+			return ERR_PTR(ret);
+	} else {
+		cpu = blk_mq_first_mapped_cpu(data.hctx);
+	}
 	data.ctx = __blk_mq_get_ctx(q, cpu);
 
 	if (!q->elevator)
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index aa14ad963d91..001ba8f0776b 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2841,6 +2841,17 @@ nvme_fc_complete_rq(struct request *rq)
 	nvme_fc_ctrl_put(ctrl);
 }
 
+static int
+nvme_fc_map_queues(struct blk_mq_tag_set *set)
+{
+	struct nvme_fc_ctrl *ctrl = set->driver_data;
+
+	if (ctrl->lport->ops->map_queues)
+		return ctrl->lport->ops->map_queues(&ctrl->lport->localport, set);
+
+	return blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]);
+}
+
 
 static const struct blk_mq_ops nvme_fc_mq_ops = {
 	.queue_rq	= nvme_fc_queue_rq,
@@ -2849,6 +2860,7 @@ static const struct blk_mq_ops nvme_fc_mq_ops = {
 	.exit_request	= nvme_fc_exit_request,
 	.init_hctx	= nvme_fc_init_hctx,
 	.timeout	= nvme_fc_timeout,
+	.map_queues	= nvme_fc_map_queues,
 };
 
 static int
@@ -3392,6 +3404,7 @@ static const struct blk_mq_ops nvme_fc_admin_mq_ops = {
 	.exit_request	= nvme_fc_exit_request,
 	.init_hctx	= nvme_fc_init_admin_hctx,
 	.timeout	= nvme_fc_timeout,
+	.map_queues	= nvme_fc_map_queues,
 };
 
 
diff --git a/drivers/scsi/qla2xxx/qla_nvme.c b/drivers/scsi/qla2xxx/qla_nvme.c
index 1c5da2dbd6f9..b7681f66d05b 100644
--- a/drivers/scsi/qla2xxx/qla_nvme.c
+++ b/drivers/scsi/qla2xxx/qla_nvme.c
@@ -667,6 +667,19 @@ static void qla_nvme_remoteport_delete(struct nvme_fc_remote_port *rport)
 	complete(&fcport->nvme_del_done);
 }
 
+static int qla_nvme_map_queues(struct nvme_fc_local_port *lport,
+				struct blk_mq_tag_set *set)
+{
+
+	struct blk_mq_queue_map *qmap = &set->map[HCTX_TYPE_DEFAULT];
+	int ret;
+
+	ret = blk_mq_map_queues(qmap);
+	qmap->use_managed_irq = true;
+
+	return ret;
+}
+
 static struct nvme_fc_port_template qla_nvme_fc_transport = {
 	.localport_delete = qla_nvme_localport_delete,
 	.remoteport_delete = qla_nvme_remoteport_delete,
@@ -676,6 +689,7 @@ static struct nvme_fc_port_template qla_nvme_fc_transport = {
 	.ls_abort	= qla_nvme_ls_abort,
 	.fcp_io		= qla_nvme_post_cmd,
 	.fcp_abort	= qla_nvme_fcp_abort,
+	.map_queues	= qla_nvme_map_queues,
 	.max_hw_queues  = 8,
 	.max_sgl_segments = 1024,
 	.max_dif_sgl_segments = 64,
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index d2e40aaba734..188ce9b7f407 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -7914,6 +7914,9 @@ static int qla2xxx_map_queues(struct Scsi_Host *shost)
 		rc = blk_mq_map_queues(qmap);
 	else
 		rc = blk_mq_pci_map_queues(qmap, vha->hw->pdev, vha->irq_offset);
+
+	qmap->use_managed_irq = true;
+
 	return rc;
 }
 
diff --git a/include/linux/nvme-fc-driver.h b/include/linux/nvme-fc-driver.h
index 2a38f2b477a5..afae40d8c347 100644
--- a/include/linux/nvme-fc-driver.h
+++ b/include/linux/nvme-fc-driver.h
@@ -471,6 +471,8 @@ struct nvme_fc_remote_port {
  *       specified by the fcp_request->private pointer.
  *       Value is Mandatory. Allowed to be zero.
  */
+struct blk_mq_tag_set;
+
 struct nvme_fc_port_template {
 	/* initiator-based functions */
 	void	(*localport_delete)(struct nvme_fc_local_port *);
@@ -497,6 +499,8 @@ struct nvme_fc_port_template {
 	int	(*xmt_ls_rsp)(struct nvme_fc_local_port *localport,
 				struct nvme_fc_remote_port *rport,
 				struct nvmefc_ls_rsp *ls_rsp);
+	int	(*map_queues)(struct nvme_fc_local_port *localport,
+				struct blk_mq_tag_set *set);
 
 	u32	max_hw_queues;
 	u16	max_sgl_segments;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH V7 1/3] genirq: add device_has_managed_msi_irq
  2021-08-18 14:44 ` [PATCH V7 1/3] genirq: add device_has_managed_msi_irq Ming Lei
@ 2021-10-11 18:23   ` Varad Gautam
  0 siblings, 0 replies; 13+ messages in thread
From: Varad Gautam @ 2021-10-11 18:23 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner
  Cc: John Garry, Sagi Grimberg, Daniel Wagner, Wen Xiong

Hi Ming,

On 8/18/21 4:44 PM, Ming Lei wrote:
> irq vector allocation with managed affinity may be used by driver, and
> blk-mq needs this info for draining queue because genirq core will shutdown
> managed irq when all CPUs in the affinity mask are offline.
> 
> The info of using managed irq is often produced by drivers, and it is
> consumed by blk-mq, so different subsystems are involved in this info flow.
> 
> Address this issue by adding one helper of device_has_managed_msi_irq()
> which is suggested by John Garry.
> 
> Tested-by: Wen Xiong <wenxiong@us.ibm.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Suggested-by: John Garry <john.garry@huawei.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  include/linux/msi.h |  5 +++++
>  kernel/irq/msi.c    | 18 ++++++++++++++++++
>  2 files changed, 23 insertions(+)
> 
> diff --git a/include/linux/msi.h b/include/linux/msi.h
> index a20dc66b9946..b4511c606072 100644
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -59,10 +59,15 @@ struct platform_msi_priv_data;
>  void __get_cached_msi_msg(struct msi_desc *entry, struct msi_msg *msg);
>  #ifdef CONFIG_GENERIC_MSI_IRQ
>  void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg);
> +bool device_has_managed_msi_irq(struct device *dev);

This led me to the following build warning:

In file included from ../drivers/iommu/irq_remapping.c:6:0:
../include/linux/msi.h:23:40: warning: 'struct device' declared inside parameter list will not be visible outside of this definition or declaration

A forward declaration for struct device before the #ifdef
would fix this.

Regards,
Varad

>  #else
>  static inline void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg)
>  {
>  }
> +static inline bool device_has_managed_msi_irq(struct device *dev)
> +{
> +	return false;
> +}
>  #endif
>  
>  typedef void (*irq_write_msi_msg_t)(struct msi_desc *desc,
> diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c
> index 00f89d5bd6f6..167bc1d8bf4a 100644
> --- a/kernel/irq/msi.c
> +++ b/kernel/irq/msi.c
> @@ -71,6 +71,24 @@ void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg)
>  }
>  EXPORT_SYMBOL_GPL(get_cached_msi_msg);
>  
> +/**
> + * device_has_managed_msi_irq - Query if device has managed irq entry
> + * @dev:	Pointer to the device for which we want to query
> + *
> + * Return true if there is managed irq vector allocated on this device
> + */
> +bool device_has_managed_msi_irq(struct device *dev)
> +{
> +	struct msi_desc *desc;
> +
> +	for_each_msi_entry(desc, dev) {
> +		if (desc->affinity && desc->affinity->is_managed)
> +			return true;
> +	}
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(device_has_managed_msi_irq);
> +
>  #ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN
>  static inline void irq_chip_write_msi_msg(struct irq_data *data,
>  					  struct msi_msg *msg)
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-10-11 18:23 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-18 14:44 [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei
2021-08-18 14:44 ` [PATCH V7 1/3] genirq: add device_has_managed_msi_irq Ming Lei
2021-10-11 18:23   ` Varad Gautam
2021-08-18 14:44 ` [PATCH V7 2/3] blk-mq: mark if one queue map uses managed irq Ming Lei
2021-08-23 17:17   ` Sagi Grimberg
2021-08-18 14:44 ` [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used Ming Lei
2021-08-23 17:18   ` Sagi Grimberg
2021-09-15 16:14   ` Daniel Wagner
2021-09-16  2:17     ` Ming Lei
2021-09-16  7:42       ` Daniel Wagner
2021-09-16  8:13         ` Ming Lei
2021-10-04 12:25           ` [RFC] nvme-fc: Allow managed IRQs Daniel Wagner
2021-08-19 22:38 ` [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).