* [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx @ 2021-08-18 14:44 Ming Lei 2021-08-18 14:44 ` [PATCH V7 1/3] genirq: add device_has_managed_msi_irq Ming Lei ` (3 more replies) 0 siblings, 4 replies; 13+ messages in thread From: Ming Lei @ 2021-08-18 14:44 UTC (permalink / raw) To: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner Cc: John Garry, Sagi Grimberg, Daniel Wagner, Wen Xiong, Ming Lei Hi, blk_mq_alloc_request_hctx() is used by NVMe fc/rdma/tcp/loop to connect io queue. Also the sw ctx is chosen as the 1st online cpu in hctx->cpumask. However, all cpus in hctx->cpumask may be offline. This usage model isn't well supported by blk-mq which supposes allocator is always done on one online CPU in hctx->cpumask. This assumption is related with managed irq, which also requires blk-mq to drain inflight request in this hctx when the last cpu in hctx->cpumask is going to offline. However, NVMe fc/rdma/tcp/loop don't use managed irq, so we should allow them to ask for request allocation when the specified hctx is inactive (all cpus in hctx->cpumask are offline). Fix blk_mq_alloc_request_hctx() by allowing to allocate request when all CPUs of this hctx are offline. Wen Xiong has verified V4 in her nvmef test. V7: - move blk_mq_hctx_use_managed_irq() into block/blk-mq.c, 3/3 V6: - move device_has_managed_msi_irq() into kernel/irq/msi.c V5: - take John Garry's suggestion to replace device field with new helper of device_has_managed_msi_irq() V4: - remove patches for cleanup queue map helpers - take Christoph's suggestion to add field into 'struct device' for describing if managed irq is allocated from one device V3: - cleanup map queues helpers, and remove pci/virtio/rdma queue helpers - store use managed irq info into qmap V2: - use flag of BLK_MQ_F_MANAGED_IRQ - pass BLK_MQ_F_MANAGED_IRQ from driver explicitly - kill BLK_MQ_F_STACKING Ming Lei (3): genirq: add device_has_managed_msi_irq blk-mq: mark if one queue map uses managed irq blk-mq: don't deactivate hctx if managed irq isn't used block/blk-mq-pci.c | 2 ++ block/blk-mq-rdma.c | 7 ++++++ block/blk-mq-virtio.c | 2 ++ block/blk-mq.c | 35 ++++++++++++++++++-------- drivers/scsi/hisi_sas/hisi_sas_v2_hw.c | 1 + include/linux/blk-mq.h | 3 ++- include/linux/msi.h | 5 ++++ kernel/irq/msi.c | 18 +++++++++++++ 8 files changed, 62 insertions(+), 11 deletions(-) -- 2.31.1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH V7 1/3] genirq: add device_has_managed_msi_irq 2021-08-18 14:44 [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei @ 2021-08-18 14:44 ` Ming Lei 2021-10-11 18:23 ` Varad Gautam 2021-08-18 14:44 ` [PATCH V7 2/3] blk-mq: mark if one queue map uses managed irq Ming Lei ` (2 subsequent siblings) 3 siblings, 1 reply; 13+ messages in thread From: Ming Lei @ 2021-08-18 14:44 UTC (permalink / raw) To: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner Cc: John Garry, Sagi Grimberg, Daniel Wagner, Wen Xiong, Ming Lei irq vector allocation with managed affinity may be used by driver, and blk-mq needs this info for draining queue because genirq core will shutdown managed irq when all CPUs in the affinity mask are offline. The info of using managed irq is often produced by drivers, and it is consumed by blk-mq, so different subsystems are involved in this info flow. Address this issue by adding one helper of device_has_managed_msi_irq() which is suggested by John Garry. Tested-by: Wen Xiong <wenxiong@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Suggested-by: John Garry <john.garry@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- include/linux/msi.h | 5 +++++ kernel/irq/msi.c | 18 ++++++++++++++++++ 2 files changed, 23 insertions(+) diff --git a/include/linux/msi.h b/include/linux/msi.h index a20dc66b9946..b4511c606072 100644 --- a/include/linux/msi.h +++ b/include/linux/msi.h @@ -59,10 +59,15 @@ struct platform_msi_priv_data; void __get_cached_msi_msg(struct msi_desc *entry, struct msi_msg *msg); #ifdef CONFIG_GENERIC_MSI_IRQ void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg); +bool device_has_managed_msi_irq(struct device *dev); #else static inline void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg) { } +static inline bool device_has_managed_msi_irq(struct device *dev) +{ + return false; +} #endif typedef void (*irq_write_msi_msg_t)(struct msi_desc *desc, diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c index 00f89d5bd6f6..167bc1d8bf4a 100644 --- a/kernel/irq/msi.c +++ b/kernel/irq/msi.c @@ -71,6 +71,24 @@ void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg) } EXPORT_SYMBOL_GPL(get_cached_msi_msg); +/** + * device_has_managed_msi_irq - Query if device has managed irq entry + * @dev: Pointer to the device for which we want to query + * + * Return true if there is managed irq vector allocated on this device + */ +bool device_has_managed_msi_irq(struct device *dev) +{ + struct msi_desc *desc; + + for_each_msi_entry(desc, dev) { + if (desc->affinity && desc->affinity->is_managed) + return true; + } + return false; +} +EXPORT_SYMBOL_GPL(device_has_managed_msi_irq); + #ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN static inline void irq_chip_write_msi_msg(struct irq_data *data, struct msi_msg *msg) -- 2.31.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH V7 1/3] genirq: add device_has_managed_msi_irq 2021-08-18 14:44 ` [PATCH V7 1/3] genirq: add device_has_managed_msi_irq Ming Lei @ 2021-10-11 18:23 ` Varad Gautam 0 siblings, 0 replies; 13+ messages in thread From: Varad Gautam @ 2021-10-11 18:23 UTC (permalink / raw) To: Ming Lei, Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner Cc: John Garry, Sagi Grimberg, Daniel Wagner, Wen Xiong Hi Ming, On 8/18/21 4:44 PM, Ming Lei wrote: > irq vector allocation with managed affinity may be used by driver, and > blk-mq needs this info for draining queue because genirq core will shutdown > managed irq when all CPUs in the affinity mask are offline. > > The info of using managed irq is often produced by drivers, and it is > consumed by blk-mq, so different subsystems are involved in this info flow. > > Address this issue by adding one helper of device_has_managed_msi_irq() > which is suggested by John Garry. > > Tested-by: Wen Xiong <wenxiong@us.ibm.com> > Reviewed-by: Christoph Hellwig <hch@lst.de> > Suggested-by: John Garry <john.garry@huawei.com> > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > include/linux/msi.h | 5 +++++ > kernel/irq/msi.c | 18 ++++++++++++++++++ > 2 files changed, 23 insertions(+) > > diff --git a/include/linux/msi.h b/include/linux/msi.h > index a20dc66b9946..b4511c606072 100644 > --- a/include/linux/msi.h > +++ b/include/linux/msi.h > @@ -59,10 +59,15 @@ struct platform_msi_priv_data; > void __get_cached_msi_msg(struct msi_desc *entry, struct msi_msg *msg); > #ifdef CONFIG_GENERIC_MSI_IRQ > void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg); > +bool device_has_managed_msi_irq(struct device *dev); This led me to the following build warning: In file included from ../drivers/iommu/irq_remapping.c:6:0: ../include/linux/msi.h:23:40: warning: 'struct device' declared inside parameter list will not be visible outside of this definition or declaration A forward declaration for struct device before the #ifdef would fix this. Regards, Varad > #else > static inline void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg) > { > } > +static inline bool device_has_managed_msi_irq(struct device *dev) > +{ > + return false; > +} > #endif > > typedef void (*irq_write_msi_msg_t)(struct msi_desc *desc, > diff --git a/kernel/irq/msi.c b/kernel/irq/msi.c > index 00f89d5bd6f6..167bc1d8bf4a 100644 > --- a/kernel/irq/msi.c > +++ b/kernel/irq/msi.c > @@ -71,6 +71,24 @@ void get_cached_msi_msg(unsigned int irq, struct msi_msg *msg) > } > EXPORT_SYMBOL_GPL(get_cached_msi_msg); > > +/** > + * device_has_managed_msi_irq - Query if device has managed irq entry > + * @dev: Pointer to the device for which we want to query > + * > + * Return true if there is managed irq vector allocated on this device > + */ > +bool device_has_managed_msi_irq(struct device *dev) > +{ > + struct msi_desc *desc; > + > + for_each_msi_entry(desc, dev) { > + if (desc->affinity && desc->affinity->is_managed) > + return true; > + } > + return false; > +} > +EXPORT_SYMBOL_GPL(device_has_managed_msi_irq); > + > #ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN > static inline void irq_chip_write_msi_msg(struct irq_data *data, > struct msi_msg *msg) > ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH V7 2/3] blk-mq: mark if one queue map uses managed irq 2021-08-18 14:44 [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei 2021-08-18 14:44 ` [PATCH V7 1/3] genirq: add device_has_managed_msi_irq Ming Lei @ 2021-08-18 14:44 ` Ming Lei 2021-08-23 17:17 ` Sagi Grimberg 2021-08-18 14:44 ` [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used Ming Lei 2021-08-19 22:38 ` [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei 3 siblings, 1 reply; 13+ messages in thread From: Ming Lei @ 2021-08-18 14:44 UTC (permalink / raw) To: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner Cc: John Garry, Sagi Grimberg, Daniel Wagner, Wen Xiong, Ming Lei Retrieve this info via new added helper of device_has_managed_msi_irq, then we can decide if one hctx needs to be drained before all its CPUs become offline. Tested-by: Wen Xiong <wenxiong@us.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.garry@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- block/blk-mq-pci.c | 2 ++ block/blk-mq-rdma.c | 7 +++++++ block/blk-mq-virtio.c | 2 ++ drivers/scsi/hisi_sas/hisi_sas_v2_hw.c | 1 + include/linux/blk-mq.h | 3 ++- 5 files changed, 14 insertions(+), 1 deletion(-) diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c index b595a94c4d16..e452cda0896a 100644 --- a/block/blk-mq-pci.c +++ b/block/blk-mq-pci.c @@ -8,6 +8,7 @@ #include <linux/blk-mq-pci.h> #include <linux/pci.h> #include <linux/module.h> +#include <linux/msi.h> #include "blk-mq.h" @@ -37,6 +38,7 @@ int blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev *pdev, for_each_cpu(cpu, mask) qmap->mq_map[cpu] = qmap->queue_offset + queue; } + qmap->use_managed_irq = device_has_managed_msi_irq(&pdev->dev); return 0; diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c index 14f968e58b8f..19ad31c44eab 100644 --- a/block/blk-mq-rdma.c +++ b/block/blk-mq-rdma.c @@ -36,6 +36,13 @@ int blk_mq_rdma_map_queues(struct blk_mq_queue_map *map, map->mq_map[cpu] = map->queue_offset + queue; } + /* + * RDMA doesn't use managed irq, and nvme rdma driver can allocate + * and submit requests on specified hctx via + * blk_mq_alloc_request_hctx + */ + map->use_managed_irq = false; + return 0; fallback: diff --git a/block/blk-mq-virtio.c b/block/blk-mq-virtio.c index 7b8a42c35102..2ce39fb77dce 100644 --- a/block/blk-mq-virtio.c +++ b/block/blk-mq-virtio.c @@ -7,6 +7,7 @@ #include <linux/blk-mq-virtio.h> #include <linux/virtio_config.h> #include <linux/module.h> +#include <linux/msi.h> #include "blk-mq.h" /** @@ -38,6 +39,7 @@ int blk_mq_virtio_map_queues(struct blk_mq_queue_map *qmap, for_each_cpu(cpu, mask) qmap->mq_map[cpu] = qmap->queue_offset + queue; } + qmap->use_managed_irq = device_has_managed_msi_irq(&vdev->dev); return 0; fallback: diff --git a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c index b0b2361e63fe..7d7df261d346 100644 --- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c +++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c @@ -3562,6 +3562,7 @@ static int map_queues_v2_hw(struct Scsi_Host *shost) for_each_cpu(cpu, mask) qmap->mq_map[cpu] = qmap->queue_offset + queue; } + qmap->use_managed_irq = true; return 0; diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 22215db36122..fd5540fba4ef 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -192,7 +192,8 @@ struct blk_mq_hw_ctx { struct blk_mq_queue_map { unsigned int *mq_map; unsigned int nr_queues; - unsigned int queue_offset; + unsigned int queue_offset:31; + unsigned int use_managed_irq:1; }; /** -- 2.31.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH V7 2/3] blk-mq: mark if one queue map uses managed irq 2021-08-18 14:44 ` [PATCH V7 2/3] blk-mq: mark if one queue map uses managed irq Ming Lei @ 2021-08-23 17:17 ` Sagi Grimberg 0 siblings, 0 replies; 13+ messages in thread From: Sagi Grimberg @ 2021-08-23 17:17 UTC (permalink / raw) To: Ming Lei, Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner Cc: John Garry, Daniel Wagner, Wen Xiong Reviewed-by: Sagi Grimberg <sagi@grimberg.me> ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used 2021-08-18 14:44 [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei 2021-08-18 14:44 ` [PATCH V7 1/3] genirq: add device_has_managed_msi_irq Ming Lei 2021-08-18 14:44 ` [PATCH V7 2/3] blk-mq: mark if one queue map uses managed irq Ming Lei @ 2021-08-18 14:44 ` Ming Lei 2021-08-23 17:18 ` Sagi Grimberg 2021-09-15 16:14 ` Daniel Wagner 2021-08-19 22:38 ` [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei 3 siblings, 2 replies; 13+ messages in thread From: Ming Lei @ 2021-08-18 14:44 UTC (permalink / raw) To: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner Cc: John Garry, Sagi Grimberg, Daniel Wagner, Wen Xiong, Ming Lei blk-mq deactivates one hctx when the last CPU in hctx->cpumask become offline by draining all requests originated from this hctx and moving new allocation to other active hctx. This way is for avoiding inflight IO in case of managed irq because managed irq is shutdown when the last CPU in the irq's affinity becomes offline. However, lots of drivers(nvme fc, rdma, tcp, loop, ...) don't use managed irq, so they needn't to deactivate hctx when the last CPU becomes offline. Also, some of them are the only user of blk_mq_alloc_request_hctx() which is used for connecting io queue. And their requirement is that the connect request needs to be submitted successfully via one specified hctx even though all CPUs in this hctx->cpumask have become offline. Addressing the requirement for nvme fc/rdma/loop by allowing to allocate request from one hctx when all CPUs in this hctx are offline, since these drivers don't use managed irq. Finally don't deactivate one hctx when it doesn't use managed irq. Tested-by: Wen Xiong <wenxiong@us.ibm.com> Reviewed-by: John Garry <john.garry@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- block/blk-mq.c | 35 +++++++++++++++++++++++++---------- 1 file changed, 25 insertions(+), 10 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 3ab4540320ca..9abfad3d5b48 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -427,6 +427,23 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op, } EXPORT_SYMBOL(blk_mq_alloc_request); +static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx) +{ + int cpu = cpumask_first_and(hctx->cpumask, cpu_online_mask); + + if (cpu >= nr_cpu_ids) + cpu = cpumask_first(hctx->cpumask); + return cpu; +} + +static bool blk_mq_hctx_use_managed_irq(struct blk_mq_hw_ctx *hctx) +{ + if (hctx->type == HCTX_TYPE_POLL) + return false; + + return hctx->queue->tag_set->map[hctx->type].use_managed_irq; +} + struct request *blk_mq_alloc_request_hctx(struct request_queue *q, unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx) { @@ -468,7 +485,10 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, data.hctx = q->queue_hw_ctx[hctx_idx]; if (!blk_mq_hw_queue_mapped(data.hctx)) goto out_queue_exit; - cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask); + + WARN_ON_ONCE(blk_mq_hctx_use_managed_irq(data.hctx)); + + cpu = blk_mq_first_mapped_cpu(data.hctx); data.ctx = __blk_mq_get_ctx(q, cpu); if (!q->elevator) @@ -1501,15 +1521,6 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) hctx_unlock(hctx, srcu_idx); } -static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx) -{ - int cpu = cpumask_first_and(hctx->cpumask, cpu_online_mask); - - if (cpu >= nr_cpu_ids) - cpu = cpumask_first(hctx->cpumask); - return cpu; -} - /* * It'd be great if the workqueue API had a way to pass * in a mask and had some smarts for more clever placement. @@ -2556,6 +2567,10 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node) struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_online); + /* hctx needn't to be deactivated in case managed irq isn't used */ + if (!blk_mq_hctx_use_managed_irq(hctx)) + return 0; + if (!cpumask_test_cpu(cpu, hctx->cpumask) || !blk_mq_last_cpu_in_hctx(cpu, hctx)) return 0; -- 2.31.1 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used 2021-08-18 14:44 ` [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used Ming Lei @ 2021-08-23 17:18 ` Sagi Grimberg 2021-09-15 16:14 ` Daniel Wagner 1 sibling, 0 replies; 13+ messages in thread From: Sagi Grimberg @ 2021-08-23 17:18 UTC (permalink / raw) To: Ming Lei, Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner Cc: John Garry, Daniel Wagner, Wen Xiong Reviewed-by: Sagi Grimberg <sagi@grimberg.me> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used 2021-08-18 14:44 ` [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used Ming Lei 2021-08-23 17:18 ` Sagi Grimberg @ 2021-09-15 16:14 ` Daniel Wagner 2021-09-16 2:17 ` Ming Lei 1 sibling, 1 reply; 13+ messages in thread From: Daniel Wagner @ 2021-09-15 16:14 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner, John Garry, Sagi Grimberg, Wen Xiong On Wed, Aug 18, 2021 at 10:44:28PM +0800, Ming Lei wrote: > struct request *blk_mq_alloc_request_hctx(struct request_queue *q, > unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx) > { > @@ -468,7 +485,10 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, > data.hctx = q->queue_hw_ctx[hctx_idx]; > if (!blk_mq_hw_queue_mapped(data.hctx)) > goto out_queue_exit; > - cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask); > + > + WARN_ON_ONCE(blk_mq_hctx_use_managed_irq(data.hctx)); > + > + cpu = blk_mq_first_mapped_cpu(data.hctx); > data.ctx = __blk_mq_get_ctx(q, cpu); I was pondering how we could address the issue that the qla2xxx driver is using managed IRQs which makes nvme-fc depending as class on managed IRQ. blk_mq_alloc_request_hctx() is the only place where we really need to distinguish between managed and !managed IRQs. As far I undertand the situation, if all CPUs for a hctx are going offline, the driver wont use this context. So there is only the case we end up in this code path is when the driver tries to reconnect the queues, e.g. after devloss. Couldn't we in this case not just return an error and go into error recovery? Something like this: diff --git a/block/blk-mq.c b/block/blk-mq.c index a2db50886a26..52fc8592c72e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -486,9 +486,13 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, if (!blk_mq_hw_queue_mapped(data.hctx)) goto out_queue_exit; - WARN_ON_ONCE(blk_mq_hctx_use_managed_irq(data.hctx)); - - cpu = blk_mq_first_mapped_cpu(data.hctx); + if (blk_mq_hctx_use_managed_irq(data.hctx)) { + cpu = cpumask_first_and(hctx->cpumask, cpu_online_mask); + if (cpu >= nr_cpu_ids) + return ERR_PTR(-EINVAL); + } else { + cpu = blk_mq_first_mapped_cpu(data.hctx); + } data.ctx = __blk_mq_get_ctx(q, cpu); ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used 2021-09-15 16:14 ` Daniel Wagner @ 2021-09-16 2:17 ` Ming Lei 2021-09-16 7:42 ` Daniel Wagner 0 siblings, 1 reply; 13+ messages in thread From: Ming Lei @ 2021-09-16 2:17 UTC (permalink / raw) To: Daniel Wagner Cc: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner, John Garry, Sagi Grimberg, Wen Xiong, James Smart On Wed, Sep 15, 2021 at 06:14:59PM +0200, Daniel Wagner wrote: > On Wed, Aug 18, 2021 at 10:44:28PM +0800, Ming Lei wrote: > > struct request *blk_mq_alloc_request_hctx(struct request_queue *q, > > unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx) > > { > > @@ -468,7 +485,10 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, > > data.hctx = q->queue_hw_ctx[hctx_idx]; > > if (!blk_mq_hw_queue_mapped(data.hctx)) > > goto out_queue_exit; > > - cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask); > > + > > + WARN_ON_ONCE(blk_mq_hctx_use_managed_irq(data.hctx)); > > + > > + cpu = blk_mq_first_mapped_cpu(data.hctx); > > data.ctx = __blk_mq_get_ctx(q, cpu); > > I was pondering how we could address the issue that the qla2xxx driver > is using managed IRQs which makes nvme-fc depending as class on managed > IRQ. > > blk_mq_alloc_request_hctx() is the only place where we really need to > distinguish between managed and !managed IRQs. As far I undertand the > situation, if all CPUs for a hctx are going offline, the driver wont use > this context. So there is only the case we end up in this code path is > when the driver tries to reconnect the queues, e.g. after > devloss. Couldn't we in this case not just return an error and go into > error recovery? Something like this: > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index a2db50886a26..52fc8592c72e 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -486,9 +486,13 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, > if (!blk_mq_hw_queue_mapped(data.hctx)) > goto out_queue_exit; > > - WARN_ON_ONCE(blk_mq_hctx_use_managed_irq(data.hctx)); > - > - cpu = blk_mq_first_mapped_cpu(data.hctx); > + if (blk_mq_hctx_use_managed_irq(data.hctx)) { Firstly, even with patches of 'qla2xxx - add nvme map_queues support', the knowledge if managed irq is used in nvmef LLD is still missed, so blk_mq_hctx_use_managed_irq() may always return false, but that shouldn't be hard to solve. The problem is that we still should make connect io queue completed when all CPUs of this hctx is offline in case of managed irq. One solution might be to use io polling for connecting io queue, but nvme fc doesn't support polling, all the other nvme hosts do support it. Thanks, Ming ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used 2021-09-16 2:17 ` Ming Lei @ 2021-09-16 7:42 ` Daniel Wagner 2021-09-16 8:13 ` Ming Lei 0 siblings, 1 reply; 13+ messages in thread From: Daniel Wagner @ 2021-09-16 7:42 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner, John Garry, Sagi Grimberg, Wen Xiong, James Smart On Thu, Sep 16, 2021 at 10:17:18AM +0800, Ming Lei wrote: > Firstly, even with patches of 'qla2xxx - add nvme map_queues support', > the knowledge if managed irq is used in nvmef LLD is still missed, so > blk_mq_hctx_use_managed_irq() may always return false, but that > shouldn't be hard to solve. Yes, that's pretty simple: --- a/drivers/scsi/qla2xxx/qla_os.c +++ b/drivers/scsi/qla2xxx/qla_os.c @@ -7914,6 +7914,9 @@ static int qla2xxx_map_queues(struct Scsi_Host *shost) rc = blk_mq_map_queues(qmap); else rc = blk_mq_pci_map_queues(qmap, vha->hw->pdev, vha->irq_offset); + + qmap->use_managed_irq = true; + return rc; } > The problem is that we still should make connect io queue completed > when all CPUs of this hctx is offline in case of managed irq. I agree, though if I understand this right, the scenario where all CPUs are offline in a hctx and we want to use this htcx is only happening after an initial setup and then reconnect attempt happens. That is during the first connect attempt only online CPUs are assigned to the hctx. When the CPUs are taken offline the block layer makes sure not to use those queues anymore (no problem for the hctx so far). Then for some reason the nmve-fc layer decides to reconnect and we end up in the situation where we don't have any online CPU in given hctx. > One solution might be to use io polling for connecting io queue, but nvme fc > doesn't support polling, all the other nvme hosts do support it. No idea, something to explore for sure :) My point is that your series is fixing existing bugs and doesn't introduce a new one. qla2xxx is already depending on managed IRQs. I would like to see your series accepted with my hack as stop gap solution until we have a proper fix. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used 2021-09-16 7:42 ` Daniel Wagner @ 2021-09-16 8:13 ` Ming Lei 2021-10-04 12:25 ` [RFC] nvme-fc: Allow managed IRQs Daniel Wagner 0 siblings, 1 reply; 13+ messages in thread From: Ming Lei @ 2021-09-16 8:13 UTC (permalink / raw) To: Daniel Wagner Cc: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner, John Garry, Sagi Grimberg, Wen Xiong, James Smart On Thu, Sep 16, 2021 at 09:42:29AM +0200, Daniel Wagner wrote: > On Thu, Sep 16, 2021 at 10:17:18AM +0800, Ming Lei wrote: > > Firstly, even with patches of 'qla2xxx - add nvme map_queues support', > > the knowledge if managed irq is used in nvmef LLD is still missed, so > > blk_mq_hctx_use_managed_irq() may always return false, but that > > shouldn't be hard to solve. > > Yes, that's pretty simple: > > --- a/drivers/scsi/qla2xxx/qla_os.c > +++ b/drivers/scsi/qla2xxx/qla_os.c > @@ -7914,6 +7914,9 @@ static int qla2xxx_map_queues(struct Scsi_Host *shost) > rc = blk_mq_map_queues(qmap); > else > rc = blk_mq_pci_map_queues(qmap, vha->hw->pdev, vha->irq_offset); > + > + qmap->use_managed_irq = true; > + > return rc; > } blk_mq_alloc_request_hctx() won't be called into qla2xxx queue, what we need is to mark the nvmef queue as .use_managed_irq if the LLD uses managed irq. > > > The problem is that we still should make connect io queue completed > > when all CPUs of this hctx is offline in case of managed irq. > > I agree, though if I understand this right, the scenario where all CPUs > are offline in a hctx and we want to use this htcx is only happening > after an initial setup and then reconnect attempt happens. That is > during the first connect attempt only online CPUs are assigned to the > hctx. When the CPUs are taken offline the block layer makes sure not to > use those queues anymore (no problem for the hctx so far). Then for some > reason the nmve-fc layer decides to reconnect and we end up in the > situation where we don't have any online CPU in given hctx. It is simply that blk_mq_alloc_request_hctx() allocates request from one specified hctx, and the specified hctx can be offline any time. > > > One solution might be to use io polling for connecting io queue, but nvme fc > > doesn't support polling, all the other nvme hosts do support it. > > No idea, something to explore for sure :) It is totally a raw idea, something like: start each queue in poll mode, and run connect IO queue command via polling. Once the connect io queue command is done, switch the queue into normal mode. Then blk_mq_alloc_request_hctx() is guaranteed to be successful. > > My point is that your series is fixing existing bugs and doesn't > introduce a new one. qla2xxx is already depending on managed IRQs. I > would like to see your series accepted with my hack as stop gap solution > until we have a proper fix. I am fine to work this way first if no one objects. Thanks, Ming ^ permalink raw reply [flat|nested] 13+ messages in thread
* [RFC] nvme-fc: Allow managed IRQs 2021-09-16 8:13 ` Ming Lei @ 2021-10-04 12:25 ` Daniel Wagner 0 siblings, 0 replies; 13+ messages in thread From: Daniel Wagner @ 2021-10-04 12:25 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner, John Garry, Sagi Grimberg, Wen Xiong, James Smart, Hannes Reinecke, Daniel Wagner nvme-fc is currently the only user of blk_mq_alloc_request_hctx(). With the recent changes to teach the nvme subsystem to honor managed IRQs, the assumption was the complete nvme-fc doesn't use managed IRQs. Unfortunately, the qla2xxx driver uses the managed IRQs. Add an interface the nvme-fc drivers to update the mapping and also to set the use_managed_irq flag. This is very ugly as we have to pass down struct blk_mq_tag_set. I haven't found any better way so far. Relax the requirement in the blk_mq_alloc_request_hctx() that only !managed IRQs are supported. As long one CPU is online in the requested hctx all is good. If this is not the case we return an error which allows the upper layer to start the reconnect loop. As the current qla2xxx already depends on managed IRQs the main difference with and without this patch is, that we see nvme nvme8: Connect command failed, error wo/DNR bit: -16402 nvme nvme8: NVME-FC{8}: reset: Reconnect attempt failed (-18) instead of just timeouts such as qla2xxx [0000:81:00.0]-5032:1: ABT_IOCB: Invalid completion handle (1da) -- timed-out. In both cases the system recovers as soon at least one CPUs is online in all hctx. Also note, this is only for admin request. As long no FC reset happens and a reconnect attempt is triggered, user space is able to issue I/Os do the target. Signed-off-by: Daniel Wagner <dwagner@suse.de> --- Hi, I've played a bit with this patch to figure out what the impact is for the qla2xxx driver. Basically, the situation doesn't change a lot with Ming's patches. If we happen to run into the situation that all CPUs are offline in one hctx and a reconnect attempt is triggered all traffic to the target cease. But as soon we have at least one CPU online in all hctx the system recovers. This patch just makes it a bit more verbose (maybe a warning could be added to blk_mq_alloc_request_hctx()). Thanks, Daniel block/blk-mq.c | 10 +++++++--- drivers/nvme/host/fc.c | 13 +++++++++++++ drivers/scsi/qla2xxx/qla_nvme.c | 14 ++++++++++++++ drivers/scsi/qla2xxx/qla_os.c | 3 +++ include/linux/nvme-fc-driver.h | 4 ++++ 5 files changed, 41 insertions(+), 3 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index a2db50886a26..df65690bf649 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -486,9 +486,13 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, if (!blk_mq_hw_queue_mapped(data.hctx)) goto out_queue_exit; - WARN_ON_ONCE(blk_mq_hctx_use_managed_irq(data.hctx)); - - cpu = blk_mq_first_mapped_cpu(data.hctx); + if (blk_mq_hctx_use_managed_irq(data.hctx)) { + cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask); + if (cpu >= nr_cpu_ids) + return ERR_PTR(ret); + } else { + cpu = blk_mq_first_mapped_cpu(data.hctx); + } data.ctx = __blk_mq_get_ctx(q, cpu); if (!q->elevator) diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index aa14ad963d91..001ba8f0776b 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -2841,6 +2841,17 @@ nvme_fc_complete_rq(struct request *rq) nvme_fc_ctrl_put(ctrl); } +static int +nvme_fc_map_queues(struct blk_mq_tag_set *set) +{ + struct nvme_fc_ctrl *ctrl = set->driver_data; + + if (ctrl->lport->ops->map_queues) + return ctrl->lport->ops->map_queues(&ctrl->lport->localport, set); + + return blk_mq_map_queues(&set->map[HCTX_TYPE_DEFAULT]); +} + static const struct blk_mq_ops nvme_fc_mq_ops = { .queue_rq = nvme_fc_queue_rq, @@ -2849,6 +2860,7 @@ static const struct blk_mq_ops nvme_fc_mq_ops = { .exit_request = nvme_fc_exit_request, .init_hctx = nvme_fc_init_hctx, .timeout = nvme_fc_timeout, + .map_queues = nvme_fc_map_queues, }; static int @@ -3392,6 +3404,7 @@ static const struct blk_mq_ops nvme_fc_admin_mq_ops = { .exit_request = nvme_fc_exit_request, .init_hctx = nvme_fc_init_admin_hctx, .timeout = nvme_fc_timeout, + .map_queues = nvme_fc_map_queues, }; diff --git a/drivers/scsi/qla2xxx/qla_nvme.c b/drivers/scsi/qla2xxx/qla_nvme.c index 1c5da2dbd6f9..b7681f66d05b 100644 --- a/drivers/scsi/qla2xxx/qla_nvme.c +++ b/drivers/scsi/qla2xxx/qla_nvme.c @@ -667,6 +667,19 @@ static void qla_nvme_remoteport_delete(struct nvme_fc_remote_port *rport) complete(&fcport->nvme_del_done); } +static int qla_nvme_map_queues(struct nvme_fc_local_port *lport, + struct blk_mq_tag_set *set) +{ + + struct blk_mq_queue_map *qmap = &set->map[HCTX_TYPE_DEFAULT]; + int ret; + + ret = blk_mq_map_queues(qmap); + qmap->use_managed_irq = true; + + return ret; +} + static struct nvme_fc_port_template qla_nvme_fc_transport = { .localport_delete = qla_nvme_localport_delete, .remoteport_delete = qla_nvme_remoteport_delete, @@ -676,6 +689,7 @@ static struct nvme_fc_port_template qla_nvme_fc_transport = { .ls_abort = qla_nvme_ls_abort, .fcp_io = qla_nvme_post_cmd, .fcp_abort = qla_nvme_fcp_abort, + .map_queues = qla_nvme_map_queues, .max_hw_queues = 8, .max_sgl_segments = 1024, .max_dif_sgl_segments = 64, diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c index d2e40aaba734..188ce9b7f407 100644 --- a/drivers/scsi/qla2xxx/qla_os.c +++ b/drivers/scsi/qla2xxx/qla_os.c @@ -7914,6 +7914,9 @@ static int qla2xxx_map_queues(struct Scsi_Host *shost) rc = blk_mq_map_queues(qmap); else rc = blk_mq_pci_map_queues(qmap, vha->hw->pdev, vha->irq_offset); + + qmap->use_managed_irq = true; + return rc; } diff --git a/include/linux/nvme-fc-driver.h b/include/linux/nvme-fc-driver.h index 2a38f2b477a5..afae40d8c347 100644 --- a/include/linux/nvme-fc-driver.h +++ b/include/linux/nvme-fc-driver.h @@ -471,6 +471,8 @@ struct nvme_fc_remote_port { * specified by the fcp_request->private pointer. * Value is Mandatory. Allowed to be zero. */ +struct blk_mq_tag_set; + struct nvme_fc_port_template { /* initiator-based functions */ void (*localport_delete)(struct nvme_fc_local_port *); @@ -497,6 +499,8 @@ struct nvme_fc_port_template { int (*xmt_ls_rsp)(struct nvme_fc_local_port *localport, struct nvme_fc_remote_port *rport, struct nvmefc_ls_rsp *ls_rsp); + int (*map_queues)(struct nvme_fc_local_port *localport, + struct blk_mq_tag_set *set); u32 max_hw_queues; u16 max_sgl_segments; -- 2.29.2 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx 2021-08-18 14:44 [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei ` (2 preceding siblings ...) 2021-08-18 14:44 ` [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used Ming Lei @ 2021-08-19 22:38 ` Ming Lei 3 siblings, 0 replies; 13+ messages in thread From: Ming Lei @ 2021-08-19 22:38 UTC (permalink / raw) To: Jens Axboe, Christoph Hellwig, linux-block, Thomas Gleixner Cc: John Garry, Sagi Grimberg, Daniel Wagner, Wen Xiong On Wed, Aug 18, 2021 at 10:44:25PM +0800, Ming Lei wrote: > Hi, > > blk_mq_alloc_request_hctx() is used by NVMe fc/rdma/tcp/loop to connect > io queue. Also the sw ctx is chosen as the 1st online cpu in hctx->cpumask. > However, all cpus in hctx->cpumask may be offline. > > This usage model isn't well supported by blk-mq which supposes allocator is > always done on one online CPU in hctx->cpumask. This assumption is > related with managed irq, which also requires blk-mq to drain inflight > request in this hctx when the last cpu in hctx->cpumask is going to > offline. > > However, NVMe fc/rdma/tcp/loop don't use managed irq, so we should allow > them to ask for request allocation when the specified hctx is inactive > (all cpus in hctx->cpumask are offline). Fix blk_mq_alloc_request_hctx() by > allowing to allocate request when all CPUs of this hctx are offline. > > Wen Xiong has verified V4 in her nvmef test. > > V7: > - move blk_mq_hctx_use_managed_irq() into block/blk-mq.c, 3/3 Hello Jens, NVMe TCP and others have been a bit popular recent days, and the kernel panic of blk_mq_alloc_request_hctx() has annoyed people for a bit long. Any chance to pull the three patches in so we can fix them in 5.15? Thanks, Ming ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2021-10-11 18:23 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-08-18 14:44 [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei 2021-08-18 14:44 ` [PATCH V7 1/3] genirq: add device_has_managed_msi_irq Ming Lei 2021-10-11 18:23 ` Varad Gautam 2021-08-18 14:44 ` [PATCH V7 2/3] blk-mq: mark if one queue map uses managed irq Ming Lei 2021-08-23 17:17 ` Sagi Grimberg 2021-08-18 14:44 ` [PATCH V7 3/3] blk-mq: don't deactivate hctx if managed irq isn't used Ming Lei 2021-08-23 17:18 ` Sagi Grimberg 2021-09-15 16:14 ` Daniel Wagner 2021-09-16 2:17 ` Ming Lei 2021-09-16 7:42 ` Daniel Wagner 2021-09-16 8:13 ` Ming Lei 2021-10-04 12:25 ` [RFC] nvme-fc: Allow managed IRQs Daniel Wagner 2021-08-19 22:38 ` [PATCH V7 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.