linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V4 0/3] blk-mq: fix blk_mq_alloc_request_hctx
@ 2021-07-15 12:08 Ming Lei
  2021-07-15 12:08 ` [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed Ming Lei
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Ming Lei @ 2021-07-15 12:08 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-block, linux-nvme,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-pci
  Cc: Thomas Gleixner, Sagi Grimberg, Daniel Wagner, Wen Xiong,
	John Garry, Hannes Reinecke, Keith Busch, Ming Lei

Hi,

blk_mq_alloc_request_hctx() is used by NVMe fc/rdma/tcp/loop to connect
io queue. Also the sw ctx is chosen as the 1st online cpu in hctx->cpumask.
However, all cpus in hctx->cpumask may be offline.

This usage model isn't well supported by blk-mq which supposes allocator is
always done on one online CPU in hctx->cpumask. This assumption is
related with managed irq, which also requires blk-mq to drain inflight
request in this hctx when the last cpu in hctx->cpumask is going to
offline.

However, NVMe fc/rdma/tcp/loop don't use managed irq, so we should allow
them to ask for request allocation when the specified hctx is inactive
(all cpus in hctx->cpumask are offline). Fix blk_mq_alloc_request_hctx() by
allowing to allocate request when all CPUs of this hctx are offline.


V4:
	- remove patches for cleanup queue map helpers
	- take Christoph's suggestion to add field into 'struct device' for
	describing if managed irq is allocated from one device

V3:
	- cleanup map queues helpers, and remove pci/virtio/rdma queue
	  helpers
	- store use managed irq info into qmap


V2:
	- use flag of BLK_MQ_F_MANAGED_IRQ
	- pass BLK_MQ_F_MANAGED_IRQ from driver explicitly
	- kill BLK_MQ_F_STACKING


Ming Lei (3):
  driver core: mark device as irq affinity managed if any irq is managed
  blk-mq: mark if one queue map uses managed irq
  blk-mq: don't deactivate hctx if managed irq isn't used

 block/blk-mq-pci.c      |  1 +
 block/blk-mq-rdma.c     |  3 +++
 block/blk-mq-virtio.c   |  1 +
 block/blk-mq.c          | 27 +++++++++++++++++----------
 block/blk-mq.h          |  8 ++++++++
 drivers/base/platform.c |  7 +++++++
 drivers/pci/msi.c       |  3 +++
 include/linux/blk-mq.h  |  3 ++-
 include/linux/device.h  |  1 +
 9 files changed, 43 insertions(+), 11 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
  2021-07-15 12:08 [PATCH V4 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei
@ 2021-07-15 12:08 ` Ming Lei
  2021-07-15 12:40   ` Greg Kroah-Hartman
                     ` (2 more replies)
  2021-07-15 12:08 ` [PATCH V4 2/3] blk-mq: mark if one queue map uses managed irq Ming Lei
  2021-07-15 12:08 ` [PATCH V4 3/3] blk-mq: don't deactivate hctx if managed irq isn't used Ming Lei
  2 siblings, 3 replies; 22+ messages in thread
From: Ming Lei @ 2021-07-15 12:08 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-block, linux-nvme,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-pci
  Cc: Thomas Gleixner, Sagi Grimberg, Daniel Wagner, Wen Xiong,
	John Garry, Hannes Reinecke, Keith Busch, Ming Lei

irq vector allocation with managed affinity may be used by driver, and
blk-mq needs this info because managed irq will be shutdown when all
CPUs in the affinity mask are offline.

The info of using managed irq is often produced by drivers(pci subsystem,
platform device, ...), and it is consumed by blk-mq, so different subsystems
are involved in this info flow

Address this issue by adding one field of .irq_affinity_managed into
'struct device'.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/base/platform.c | 7 +++++++
 drivers/pci/msi.c       | 3 +++
 include/linux/device.h  | 1 +
 3 files changed, 11 insertions(+)

diff --git a/drivers/base/platform.c b/drivers/base/platform.c
index 8640578f45e9..d28cb91d5cf9 100644
--- a/drivers/base/platform.c
+++ b/drivers/base/platform.c
@@ -388,6 +388,13 @@ int devm_platform_get_irqs_affinity(struct platform_device *dev,
 				ptr->irq[i], ret);
 			goto err_free_desc;
 		}
+
+		/*
+		 * mark the device as irq affinity managed if any irq affinity
+		 * descriptor is managed
+		 */
+		if (desc[i].is_managed)
+			dev->dev.irq_affinity_managed = true;
 	}
 
 	devres_add(&dev->dev, ptr);
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 3d6db20d1b2b..7ddec90b711d 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -1197,6 +1197,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 	if (flags & PCI_IRQ_AFFINITY) {
 		if (!affd)
 			affd = &msi_default_affd;
+		dev->dev.irq_affinity_managed = true;
 	} else {
 		if (WARN_ON(affd))
 			affd = NULL;
@@ -1215,6 +1216,8 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 			return nvecs;
 	}
 
+	dev->dev.irq_affinity_managed = false;
+
 	/* use legacy IRQ if allowed */
 	if (flags & PCI_IRQ_LEGACY) {
 		if (min_vecs == 1 && dev->irq) {
diff --git a/include/linux/device.h b/include/linux/device.h
index 59940f1744c1..9ec6e671279e 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -569,6 +569,7 @@ struct device {
 #ifdef CONFIG_DMA_OPS_BYPASS
 	bool			dma_ops_bypass : 1;
 #endif
+	bool			irq_affinity_managed : 1;
 };
 
 /**
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V4 2/3] blk-mq: mark if one queue map uses managed irq
  2021-07-15 12:08 [PATCH V4 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei
  2021-07-15 12:08 ` [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed Ming Lei
@ 2021-07-15 12:08 ` Ming Lei
  2021-07-15 12:08 ` [PATCH V4 3/3] blk-mq: don't deactivate hctx if managed irq isn't used Ming Lei
  2 siblings, 0 replies; 22+ messages in thread
From: Ming Lei @ 2021-07-15 12:08 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-block, linux-nvme,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-pci
  Cc: Thomas Gleixner, Sagi Grimberg, Daniel Wagner, Wen Xiong,
	John Garry, Hannes Reinecke, Keith Busch, Ming Lei

Retrieve this info via the field 'irq_affinity_managed' of 'struct
device' in queue map helpers.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-pci.c     | 1 +
 block/blk-mq-rdma.c    | 3 +++
 block/blk-mq-virtio.c  | 1 +
 include/linux/blk-mq.h | 3 ++-
 4 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-pci.c b/block/blk-mq-pci.c
index b595a94c4d16..aa0bdb80d0ce 100644
--- a/block/blk-mq-pci.c
+++ b/block/blk-mq-pci.c
@@ -37,6 +37,7 @@ int blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev *pdev,
 		for_each_cpu(cpu, mask)
 			qmap->mq_map[cpu] = qmap->queue_offset + queue;
 	}
+	qmap->use_managed_irq = pdev->dev.irq_affinity_managed;
 
 	return 0;
 
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 14f968e58b8f..7b10d8bd2a37 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -36,6 +36,9 @@ int blk_mq_rdma_map_queues(struct blk_mq_queue_map *map,
 			map->mq_map[cpu] = map->queue_offset + queue;
 	}
 
+	/* So far RDMA doesn't use managed irq */
+	map->use_managed_irq = false;
+
 	return 0;
 
 fallback:
diff --git a/block/blk-mq-virtio.c b/block/blk-mq-virtio.c
index 7b8a42c35102..b57a0aa6d900 100644
--- a/block/blk-mq-virtio.c
+++ b/block/blk-mq-virtio.c
@@ -38,6 +38,7 @@ int blk_mq_virtio_map_queues(struct blk_mq_queue_map *qmap,
 		for_each_cpu(cpu, mask)
 			qmap->mq_map[cpu] = qmap->queue_offset + queue;
 	}
+	qmap->use_managed_irq = vdev->dev.irq_affinity_managed;
 
 	return 0;
 fallback:
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 1d18447ebebc..d54a795ec971 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -192,7 +192,8 @@ struct blk_mq_hw_ctx {
 struct blk_mq_queue_map {
 	unsigned int *mq_map;
 	unsigned int nr_queues;
-	unsigned int queue_offset;
+	unsigned int queue_offset:31;
+	unsigned int use_managed_irq:1;
 };
 
 /**
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH V4 3/3] blk-mq: don't deactivate hctx if managed irq isn't used
  2021-07-15 12:08 [PATCH V4 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei
  2021-07-15 12:08 ` [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed Ming Lei
  2021-07-15 12:08 ` [PATCH V4 2/3] blk-mq: mark if one queue map uses managed irq Ming Lei
@ 2021-07-15 12:08 ` Ming Lei
  2 siblings, 0 replies; 22+ messages in thread
From: Ming Lei @ 2021-07-15 12:08 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, linux-block, linux-nvme,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-pci
  Cc: Thomas Gleixner, Sagi Grimberg, Daniel Wagner, Wen Xiong,
	John Garry, Hannes Reinecke, Keith Busch, Ming Lei

blk-mq deactivates one hctx when the last CPU in hctx->cpumask become
offline by draining all requests originated from this hctx and moving new
allocation on other active hctx. This way is for avoiding inflight IO in
case of managed irq because managed irq is shutdown when the last CPU in
the irq's affinity becomes offline.

However, lots of drivers(nvme fc, rdma, tcp, loop, ...) don't use managed
irq, so they needn't to deactivate hctx when the last CPU becomes offline.
Also, some of them are the only user of blk_mq_alloc_request_hctx() which
is used for connecting io queue. And their requirement is that the connect
request needs to be submitted successfully via one specified hctx even though
all CPUs in this hctx->cpumask have become offline.

Addressing the requirement for nvme fc/rdma/loop by allowing to
allocate request from one hctx when all CPUs in this hctx are offline,
since these drivers don't use managed irq.

Finally don't deactivate one hctx when it doesn't use managed irq.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c | 27 +++++++++++++++++----------
 block/blk-mq.h |  8 ++++++++
 2 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2c4ac51e54eb..591ab07c64d8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -427,6 +427,15 @@ struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
 }
 EXPORT_SYMBOL(blk_mq_alloc_request);
 
+static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx)
+{
+	int cpu = cpumask_first_and(hctx->cpumask, cpu_online_mask);
+
+	if (cpu >= nr_cpu_ids)
+		cpu = cpumask_first(hctx->cpumask);
+	return cpu;
+}
+
 struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx)
 {
@@ -468,7 +477,10 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	data.hctx = q->queue_hw_ctx[hctx_idx];
 	if (!blk_mq_hw_queue_mapped(data.hctx))
 		goto out_queue_exit;
-	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
+
+	WARN_ON_ONCE(blk_mq_hctx_use_managed_irq(data.hctx));
+
+	cpu = blk_mq_first_mapped_cpu(data.hctx);
 	data.ctx = __blk_mq_get_ctx(q, cpu);
 
 	if (!q->elevator)
@@ -1501,15 +1513,6 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 	hctx_unlock(hctx, srcu_idx);
 }
 
-static inline int blk_mq_first_mapped_cpu(struct blk_mq_hw_ctx *hctx)
-{
-	int cpu = cpumask_first_and(hctx->cpumask, cpu_online_mask);
-
-	if (cpu >= nr_cpu_ids)
-		cpu = cpumask_first(hctx->cpumask);
-	return cpu;
-}
-
 /*
  * It'd be great if the workqueue API had a way to pass
  * in a mask and had some smarts for more clever placement.
@@ -2556,6 +2559,10 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
 	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
 			struct blk_mq_hw_ctx, cpuhp_online);
 
+	/* hctx needn't to be deactivated in case managed irq isn't used */
+	if (!blk_mq_hctx_use_managed_irq(hctx))
+		return 0;
+
 	if (!cpumask_test_cpu(cpu, hctx->cpumask) ||
 	    !blk_mq_last_cpu_in_hctx(cpu, hctx))
 		return 0;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index d08779f77a26..7333b659d8f5 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -119,6 +119,14 @@ static inline struct blk_mq_hw_ctx *blk_mq_map_queue(struct request_queue *q,
 	return ctx->hctxs[type];
 }
 
+static inline bool blk_mq_hctx_use_managed_irq(struct blk_mq_hw_ctx *hctx)
+{
+	if (hctx->type == HCTX_TYPE_POLL)
+		return false;
+
+	return hctx->queue->tag_set->map[hctx->type].use_managed_irq;
+}
+
 /*
  * sysfs helpers
  */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
  2021-07-15 12:08 ` [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed Ming Lei
@ 2021-07-15 12:40   ` Greg Kroah-Hartman
  2021-07-16  2:17     ` Ming Lei
  2021-07-16 20:01   ` Bjorn Helgaas
  2021-07-19  7:51   ` John Garry
  2 siblings, 1 reply; 22+ messages in thread
From: Greg Kroah-Hartman @ 2021-07-15 12:40 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Christoph Hellwig, linux-block, linux-nvme,
	Bjorn Helgaas, linux-pci, Thomas Gleixner, Sagi Grimberg,
	Daniel Wagner, Wen Xiong, John Garry, Hannes Reinecke,
	Keith Busch

On Thu, Jul 15, 2021 at 08:08:42PM +0800, Ming Lei wrote:
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -569,6 +569,7 @@ struct device {
>  #ifdef CONFIG_DMA_OPS_BYPASS
>  	bool			dma_ops_bypass : 1;
>  #endif
> +	bool			irq_affinity_managed : 1;
>  };

No documentation for this new field?

:(

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
  2021-07-15 12:40   ` Greg Kroah-Hartman
@ 2021-07-16  2:17     ` Ming Lei
  0 siblings, 0 replies; 22+ messages in thread
From: Ming Lei @ 2021-07-16  2:17 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Jens Axboe, Christoph Hellwig, linux-block, linux-nvme,
	Bjorn Helgaas, linux-pci, Thomas Gleixner, Sagi Grimberg,
	Daniel Wagner, Wen Xiong, John Garry, Hannes Reinecke,
	Keith Busch

On Thu, Jul 15, 2021 at 02:40:55PM +0200, Greg Kroah-Hartman wrote:
> On Thu, Jul 15, 2021 at 08:08:42PM +0800, Ming Lei wrote:
> > --- a/include/linux/device.h
> > +++ b/include/linux/device.h
> > @@ -569,6 +569,7 @@ struct device {
> >  #ifdef CONFIG_DMA_OPS_BYPASS
> >  	bool			dma_ops_bypass : 1;
> >  #endif
> > +	bool			irq_affinity_managed : 1;
> >  };
> 
> No documentation for this new field?

OK, will fix it in next version.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
  2021-07-15 12:08 ` [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed Ming Lei
  2021-07-15 12:40   ` Greg Kroah-Hartman
@ 2021-07-16 20:01   ` Bjorn Helgaas
  2021-07-17  9:30     ` Ming Lei
  2021-07-19  7:51   ` John Garry
  2 siblings, 1 reply; 22+ messages in thread
From: Bjorn Helgaas @ 2021-07-16 20:01 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Christoph Hellwig, linux-block, linux-nvme,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-pci, Thomas Gleixner,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry,
	Hannes Reinecke, Keith Busch

On Thu, Jul 15, 2021 at 08:08:42PM +0800, Ming Lei wrote:
> irq vector allocation with managed affinity may be used by driver, and
> blk-mq needs this info because managed irq will be shutdown when all
> CPUs in the affinity mask are offline.
> 
> The info of using managed irq is often produced by drivers(pci subsystem,

Add space between "drivers" and "(".
s/pci/PCI/

Does this "managed IRQ" (or "managed affinity", not sure what the
correct terminology is here) have something to do with devm?

> platform device, ...), and it is consumed by blk-mq, so different subsystems
> are involved in this info flow

Add period at end of sentence.

> Address this issue by adding one field of .irq_affinity_managed into
> 'struct device'.
> 
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/base/platform.c | 7 +++++++
>  drivers/pci/msi.c       | 3 +++
>  include/linux/device.h  | 1 +
>  3 files changed, 11 insertions(+)
> 
> diff --git a/drivers/base/platform.c b/drivers/base/platform.c
> index 8640578f45e9..d28cb91d5cf9 100644
> --- a/drivers/base/platform.c
> +++ b/drivers/base/platform.c
> @@ -388,6 +388,13 @@ int devm_platform_get_irqs_affinity(struct platform_device *dev,
>  				ptr->irq[i], ret);
>  			goto err_free_desc;
>  		}
> +
> +		/*
> +		 * mark the device as irq affinity managed if any irq affinity
> +		 * descriptor is managed
> +		 */
> +		if (desc[i].is_managed)
> +			dev->dev.irq_affinity_managed = true;
>  	}
>  
>  	devres_add(&dev->dev, ptr);
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 3d6db20d1b2b..7ddec90b711d 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -1197,6 +1197,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>  	if (flags & PCI_IRQ_AFFINITY) {
>  		if (!affd)
>  			affd = &msi_default_affd;
> +		dev->dev.irq_affinity_managed = true;

This is really opaque to me.  I can't tell what the connection between
PCI_IRQ_AFFINITY and irq_affinity_managed is.

AFAICT the only place irq_affinity_managed is ultimately used is
blk_mq_hctx_notify_offline(), and there's no obvious connection
between that and this code.

>  	} else {
>  		if (WARN_ON(affd))
>  			affd = NULL;
> @@ -1215,6 +1216,8 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>  			return nvecs;
>  	}
>  
> +	dev->dev.irq_affinity_managed = false;
> +
>  	/* use legacy IRQ if allowed */
>  	if (flags & PCI_IRQ_LEGACY) {
>  		if (min_vecs == 1 && dev->irq) {
> diff --git a/include/linux/device.h b/include/linux/device.h
> index 59940f1744c1..9ec6e671279e 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -569,6 +569,7 @@ struct device {
>  #ifdef CONFIG_DMA_OPS_BYPASS
>  	bool			dma_ops_bypass : 1;
>  #endif
> +	bool			irq_affinity_managed : 1;
>  };
>  
>  /**
> -- 
> 2.31.1
> 
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
  2021-07-16 20:01   ` Bjorn Helgaas
@ 2021-07-17  9:30     ` Ming Lei
  2021-07-21  0:30       ` Bjorn Helgaas
  0 siblings, 1 reply; 22+ messages in thread
From: Ming Lei @ 2021-07-17  9:30 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jens Axboe, Christoph Hellwig, linux-block, linux-nvme,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-pci, Thomas Gleixner,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry,
	Hannes Reinecke, Keith Busch

On Fri, Jul 16, 2021 at 03:01:54PM -0500, Bjorn Helgaas wrote:
> On Thu, Jul 15, 2021 at 08:08:42PM +0800, Ming Lei wrote:
> > irq vector allocation with managed affinity may be used by driver, and
> > blk-mq needs this info because managed irq will be shutdown when all
> > CPUs in the affinity mask are offline.
> > 
> > The info of using managed irq is often produced by drivers(pci subsystem,
> 
> Add space between "drivers" and "(".
> s/pci/PCI/

OK.

> 
> Does this "managed IRQ" (or "managed affinity", not sure what the
> correct terminology is here) have something to do with devm?
> 
> > platform device, ...), and it is consumed by blk-mq, so different subsystems
> > are involved in this info flow
> 
> Add period at end of sentence.

OK.

> 
> > Address this issue by adding one field of .irq_affinity_managed into
> > 'struct device'.
> > 
> > Suggested-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/base/platform.c | 7 +++++++
> >  drivers/pci/msi.c       | 3 +++
> >  include/linux/device.h  | 1 +
> >  3 files changed, 11 insertions(+)
> > 
> > diff --git a/drivers/base/platform.c b/drivers/base/platform.c
> > index 8640578f45e9..d28cb91d5cf9 100644
> > --- a/drivers/base/platform.c
> > +++ b/drivers/base/platform.c
> > @@ -388,6 +388,13 @@ int devm_platform_get_irqs_affinity(struct platform_device *dev,
> >  				ptr->irq[i], ret);
> >  			goto err_free_desc;
> >  		}
> > +
> > +		/*
> > +		 * mark the device as irq affinity managed if any irq affinity
> > +		 * descriptor is managed
> > +		 */
> > +		if (desc[i].is_managed)
> > +			dev->dev.irq_affinity_managed = true;
> >  	}
> >  
> >  	devres_add(&dev->dev, ptr);
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > index 3d6db20d1b2b..7ddec90b711d 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -1197,6 +1197,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
> >  	if (flags & PCI_IRQ_AFFINITY) {
> >  		if (!affd)
> >  			affd = &msi_default_affd;
> > +		dev->dev.irq_affinity_managed = true;
> 
> This is really opaque to me.  I can't tell what the connection between
> PCI_IRQ_AFFINITY and irq_affinity_managed is.

Comment for PCI_IRQ_AFFINITY is 'Auto-assign affinity',
'irq_affinity_managed' basically means that irq's affinity is managed by
kernel.

What blk-mq needs is exactly if PCI_IRQ_AFFINITY is applied when
allocating irq vectors. When PCI_IRQ_AFFINITY is used, genirq will
shutdown the irq when all CPUs in the assigned affinity are offline,
then blk-mq has to drain all in-flight IOs which will be completed
via this irq and prevent new IO. That is the connection.

Or you think 'irq_affinity_managed' isn't named well?

> 
> AFAICT the only place irq_affinity_managed is ultimately used is
> blk_mq_hctx_notify_offline(), and there's no obvious connection
> between that and this code.

I believe the connection is described in comment.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
  2021-07-15 12:08 ` [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed Ming Lei
  2021-07-15 12:40   ` Greg Kroah-Hartman
  2021-07-16 20:01   ` Bjorn Helgaas
@ 2021-07-19  7:51   ` John Garry
  2021-07-19  9:44     ` Christoph Hellwig
  2 siblings, 1 reply; 22+ messages in thread
From: John Garry @ 2021-07-19  7:51 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, Christoph Hellwig, linux-block, linux-nvme,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-pci
  Cc: Thomas Gleixner, Sagi Grimberg, Daniel Wagner, Wen Xiong,
	Hannes Reinecke, Keith Busch

On 15/07/2021 13:08, Ming Lei wrote:
> irq vector allocation with managed affinity may be used by driver, and
> blk-mq needs this info because managed irq will be shutdown when all
> CPUs in the affinity mask are offline.
> 
> The info of using managed irq is often produced by drivers(pci subsystem,
> platform device, ...), and it is consumed by blk-mq, so different subsystems
> are involved in this info flow
> 
> Address this issue by adding one field of .irq_affinity_managed into
> 'struct device'.
> 
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Did you consider that for PCI device we effectively have this info already:

bool dev_has_managed_msi_irq(struct device *dev)
{
	struct msi_desc *desc;

	list_for_each_entry(desc, dev_to_msi_list(dev), list) {
		if (desc->affinity && desc->affinity->is_managed)
			return true;
	}

	return false;
}

Thanks,
John

> ---
>   drivers/base/platform.c | 7 +++++++
>   drivers/pci/msi.c       | 3 +++
>   include/linux/device.h  | 1 +
>   3 files changed, 11 insertions(+)
> 
> diff --git a/drivers/base/platform.c b/drivers/base/platform.c
> index 8640578f45e9..d28cb91d5cf9 100644
> --- a/drivers/base/platform.c
> +++ b/drivers/base/platform.c
> @@ -388,6 +388,13 @@ int devm_platform_get_irqs_affinity(struct platform_device *dev,
>   				ptr->irq[i], ret);
>   			goto err_free_desc;
>   		}
> +
> +		/*
> +		 * mark the device as irq affinity managed if any irq affinity
> +		 * descriptor is managed
> +		 */
> +		if (desc[i].is_managed)
> +			dev->dev.irq_affinity_managed = true;
>   	}
>   
>   	devres_add(&dev->dev, ptr);
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 3d6db20d1b2b..7ddec90b711d 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -1197,6 +1197,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>   	if (flags & PCI_IRQ_AFFINITY) {
>   		if (!affd)
>   			affd = &msi_default_affd;
> +		dev->dev.irq_affinity_managed = true;
>   	} else {
>   		if (WARN_ON(affd))
>   			affd = NULL;
> @@ -1215,6 +1216,8 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>   			return nvecs;
>   	}
>   
> +	dev->dev.irq_affinity_managed = false;
> +
>   	/* use legacy IRQ if allowed */
>   	if (flags & PCI_IRQ_LEGACY) {
>   		if (min_vecs == 1 && dev->irq) {
> diff --git a/include/linux/device.h b/include/linux/device.h
> index 59940f1744c1..9ec6e671279e 100644
> --- a/include/linux/device.h
> +++ b/include/linux/device.h
> @@ -569,6 +569,7 @@ struct device {
>   #ifdef CONFIG_DMA_OPS_BYPASS
>   	bool			dma_ops_bypass : 1;
>   #endif
> +	bool			irq_affinity_managed : 1;
>   };
>   
>   /**
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
  2021-07-19  7:51   ` John Garry
@ 2021-07-19  9:44     ` Christoph Hellwig
  2021-07-19 10:39       ` John Garry
  2021-07-21  7:20       ` Thomas Gleixner
  0 siblings, 2 replies; 22+ messages in thread
From: Christoph Hellwig @ 2021-07-19  9:44 UTC (permalink / raw)
  To: John Garry
  Cc: Ming Lei, Jens Axboe, Christoph Hellwig, linux-block, linux-nvme,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-pci, Thomas Gleixner,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, Hannes Reinecke,
	Keith Busch

On Mon, Jul 19, 2021 at 08:51:22AM +0100, John Garry wrote:
>> Address this issue by adding one field of .irq_affinity_managed into
>> 'struct device'.
>>
>> Suggested-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>
> Did you consider that for PCI device we effectively have this info already:
>
> bool dev_has_managed_msi_irq(struct device *dev)
> {
> 	struct msi_desc *desc;
>
> 	list_for_each_entry(desc, dev_to_msi_list(dev), list) {
> 		if (desc->affinity && desc->affinity->is_managed)
> 			return true;
> 	}
>
> 	return false;

Just walking the list seems fine to me given that this is not a
performance criticial path.  But what are the locking implications?

Also does the above imply this won't work for your platform MSI case?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
  2021-07-19  9:44     ` Christoph Hellwig
@ 2021-07-19 10:39       ` John Garry
  2021-07-20  2:38         ` Ming Lei
  2021-07-21  7:20       ` Thomas Gleixner
  1 sibling, 1 reply; 22+ messages in thread
From: John Garry @ 2021-07-19 10:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, Jens Axboe, linux-block, linux-nvme,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-pci, Thomas Gleixner,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, Hannes Reinecke,
	Keith Busch

On 19/07/2021 10:44, Christoph Hellwig wrote:
> On Mon, Jul 19, 2021 at 08:51:22AM +0100, John Garry wrote:
>>> Address this issue by adding one field of .irq_affinity_managed into
>>> 'struct device'.
>>>
>>> Suggested-by: Christoph Hellwig <hch@lst.de>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>
>> Did you consider that for PCI device we effectively have this info already:
>>
>> bool dev_has_managed_msi_irq(struct device *dev)
>> {
>> 	struct msi_desc *desc;
>>
>> 	list_for_each_entry(desc, dev_to_msi_list(dev), list)

I just noticed for_each_msi_entry(), which is the same


>> 		if (desc->affinity && desc->affinity->is_managed)
>> 			return true;
>> 	}
>>
>> 	return false;
> 
> Just walking the list seems fine to me given that this is not a
> performance criticial path.  But what are the locking implications?

Since it would be used for sequential setup code, I didn't think any 
locking was required. But would need to consider where that function 
lived and whether it's public.

> 
> Also does the above imply this won't work for your platform MSI case?
> .
> 

Right. I think that it may be possible to reach into the platform msi 
descriptors to get this info, but I am not sure it's worth it. There is 
only 1x user there and there is no generic .map_queues function, so 
could set the flag directly:

int blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev 
*pdev,
  		for_each_cpu(cpu, mask)
  			qmap->mq_map[cpu] = qmap->queue_offset + queue;
  	}
+	qmap->use_managed_irq = dev_has_managed_msi_irq(&pdev->dev);
}

--- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
@@ -3563,6 +3563,8 @@ static int map_queues_v2_hw(struct Scsi_Host *shost)
                        qmap->mq_map[cpu] = qmap->queue_offset + queue;
        }

+       qmap->use_managed_irq = 1;
+
        return 0;
}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
  2021-07-19 10:39       ` John Garry
@ 2021-07-20  2:38         ` Ming Lei
  0 siblings, 0 replies; 22+ messages in thread
From: Ming Lei @ 2021-07-20  2:38 UTC (permalink / raw)
  To: John Garry
  Cc: Christoph Hellwig, Jens Axboe, linux-block, linux-nvme,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-pci, Thomas Gleixner,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, Hannes Reinecke,
	Keith Busch

On Mon, Jul 19, 2021 at 11:39:53AM +0100, John Garry wrote:
> On 19/07/2021 10:44, Christoph Hellwig wrote:
> > On Mon, Jul 19, 2021 at 08:51:22AM +0100, John Garry wrote:
> > > > Address this issue by adding one field of .irq_affinity_managed into
> > > > 'struct device'.
> > > > 
> > > > Suggested-by: Christoph Hellwig <hch@lst.de>
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > 
> > > Did you consider that for PCI device we effectively have this info already:
> > > 
> > > bool dev_has_managed_msi_irq(struct device *dev)
> > > {
> > > 	struct msi_desc *desc;
> > > 
> > > 	list_for_each_entry(desc, dev_to_msi_list(dev), list)
> 
> I just noticed for_each_msi_entry(), which is the same
> 
> 
> > > 		if (desc->affinity && desc->affinity->is_managed)
> > > 			return true;
> > > 	}
> > > 
> > > 	return false;
> > 
> > Just walking the list seems fine to me given that this is not a
> > performance criticial path.  But what are the locking implications?
> 
> Since it would be used for sequential setup code, I didn't think any locking
> was required. But would need to consider where that function lived and
> whether it's public.

Yeah, the allocated irq vectors should be live when running map queues.

> 
> > 
> > Also does the above imply this won't work for your platform MSI case?
> > .
> > 
> 
> Right. I think that it may be possible to reach into the platform msi
> descriptors to get this info, but I am not sure it's worth it. There is only
> 1x user there and there is no generic .map_queues function, so could set the
> flag directly:
> 
> int blk_mq_pci_map_queues(struct blk_mq_queue_map *qmap, struct pci_dev
> *pdev,
>  		for_each_cpu(cpu, mask)
>  			qmap->mq_map[cpu] = qmap->queue_offset + queue;
>  	}
> +	qmap->use_managed_irq = dev_has_managed_msi_irq(&pdev->dev);
> }
> 
> --- a/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
> +++ b/drivers/scsi/hisi_sas/hisi_sas_v2_hw.c
> @@ -3563,6 +3563,8 @@ static int map_queues_v2_hw(struct Scsi_Host *shost)
>                        qmap->mq_map[cpu] = qmap->queue_offset + queue;
>        }
> 
> +       qmap->use_managed_irq = 1;
> +
>        return 0;

virtio can be populated via platform device too, but managed irq affinity
isn't used, so seems dev_has_managed_msi_irq() is fine.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
  2021-07-17  9:30     ` Ming Lei
@ 2021-07-21  0:30       ` Bjorn Helgaas
  0 siblings, 0 replies; 22+ messages in thread
From: Bjorn Helgaas @ 2021-07-21  0:30 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Christoph Hellwig, linux-block, linux-nvme,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-pci, Thomas Gleixner,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry,
	Hannes Reinecke, Keith Busch

On Sat, Jul 17, 2021 at 05:30:43PM +0800, Ming Lei wrote:
> On Fri, Jul 16, 2021 at 03:01:54PM -0500, Bjorn Helgaas wrote:
> > On Thu, Jul 15, 2021 at 08:08:42PM +0800, Ming Lei wrote:
> > > irq vector allocation with managed affinity may be used by driver, and
> > > blk-mq needs this info because managed irq will be shutdown when all
> > > CPUs in the affinity mask are offline.
> > > 
> > > The info of using managed irq is often produced by drivers(pci subsystem,
> > 
> > Add space between "drivers" and "(".
> > s/pci/PCI/
> 
> OK.
> 
> > Does this "managed IRQ" (or "managed affinity", not sure what the
> > correct terminology is here) have something to do with devm?
> > 
> > > platform device, ...), and it is consumed by blk-mq, so different subsystems
> > > are involved in this info flow
> > 
> > Add period at end of sentence.
> 
> OK.
> 
> > > Address this issue by adding one field of .irq_affinity_managed into
> > > 'struct device'.
> > > 
> > > Suggested-by: Christoph Hellwig <hch@lst.de>
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > >  drivers/base/platform.c | 7 +++++++
> > >  drivers/pci/msi.c       | 3 +++
> > >  include/linux/device.h  | 1 +
> > >  3 files changed, 11 insertions(+)
> > > 
> > > diff --git a/drivers/base/platform.c b/drivers/base/platform.c
> > > index 8640578f45e9..d28cb91d5cf9 100644
> > > --- a/drivers/base/platform.c
> > > +++ b/drivers/base/platform.c
> > > @@ -388,6 +388,13 @@ int devm_platform_get_irqs_affinity(struct platform_device *dev,
> > >  				ptr->irq[i], ret);
> > >  			goto err_free_desc;
> > >  		}
> > > +
> > > +		/*
> > > +		 * mark the device as irq affinity managed if any irq affinity
> > > +		 * descriptor is managed
> > > +		 */
> > > +		if (desc[i].is_managed)
> > > +			dev->dev.irq_affinity_managed = true;
> > >  	}
> > >  
> > >  	devres_add(&dev->dev, ptr);
> > > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> > > index 3d6db20d1b2b..7ddec90b711d 100644
> > > --- a/drivers/pci/msi.c
> > > +++ b/drivers/pci/msi.c
> > > @@ -1197,6 +1197,7 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
> > >  	if (flags & PCI_IRQ_AFFINITY) {
> > >  		if (!affd)
> > >  			affd = &msi_default_affd;
> > > +		dev->dev.irq_affinity_managed = true;
> > 
> > This is really opaque to me.  I can't tell what the connection between
> > PCI_IRQ_AFFINITY and irq_affinity_managed is.
> 
> Comment for PCI_IRQ_AFFINITY is 'Auto-assign affinity',
> 'irq_affinity_managed' basically means that irq's affinity is managed by
> kernel.
> 
> What blk-mq needs is exactly if PCI_IRQ_AFFINITY is applied when
> allocating irq vectors. When PCI_IRQ_AFFINITY is used, genirq will
> shutdown the irq when all CPUs in the assigned affinity are offline,
> then blk-mq has to drain all in-flight IOs which will be completed
> via this irq and prevent new IO. That is the connection.
> 
> Or you think 'irq_affinity_managed' isn't named well?

I've been looking at "devm" management where there is a concept of
devm-managed IRQs, and you mentioned "managed IRQ" in the commit log.
But IIUC this is a completely different sort of management.

> > AFAICT the only place irq_affinity_managed is ultimately used is
> > blk_mq_hctx_notify_offline(), and there's no obvious connection
> > between that and this code.
> 
> I believe the connection is described in comment.

You mean the comment that says "hctx needn't to be deactivated in case
managed irq isn't used"?  Sorry, that really doesn't explain to me why
pci_alloc_irq_vectors_affinity() needs to set irq_affinity_managed.
There's just a lot of magic here that cannot be deduced by reading the
code.

Nit: s/needn't to be/needn't be/ in that comment.  Or maybe even
"Deactivate hctx only when managed IRQ is used"?  I still have no idea
what that means, but at least it's easier to read :)  Or maybe this is
actually "draining a queue" instead of "deactivating"?

Bjorn

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
  2021-07-19  9:44     ` Christoph Hellwig
  2021-07-19 10:39       ` John Garry
@ 2021-07-21  7:20       ` Thomas Gleixner
  2021-07-21  7:24         ` Christoph Hellwig
  1 sibling, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2021-07-21  7:20 UTC (permalink / raw)
  To: Christoph Hellwig, John Garry
  Cc: Ming Lei, Jens Axboe, Christoph Hellwig, linux-block, linux-nvme,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-pci, Sagi Grimberg,
	Daniel Wagner, Wen Xiong, Hannes Reinecke, Keith Busch

On Mon, Jul 19 2021 at 11:44, Christoph Hellwig wrote:
> On Mon, Jul 19, 2021 at 08:51:22AM +0100, John Garry wrote:
>>> Address this issue by adding one field of .irq_affinity_managed into
>>> 'struct device'.
>>>
>>> Suggested-by: Christoph Hellwig <hch@lst.de>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>
>> Did you consider that for PCI device we effectively have this info already:
>>
>> bool dev_has_managed_msi_irq(struct device *dev)
>> {
>> 	struct msi_desc *desc;
>>
>> 	list_for_each_entry(desc, dev_to_msi_list(dev), list) {
>> 		if (desc->affinity && desc->affinity->is_managed)
>> 			return true;
>> 	}
>>
>> 	return false;
>
> Just walking the list seems fine to me given that this is not a
> performance criticial path.  But what are the locking implications?

At the moment there are none because the list is initialized in the
setup path and never modified afterwards. Though that might change
sooner than later to fix the virtio wreckage vs. MSI-X.

> Also does the above imply this won't work for your platform MSI case?

The msi descriptors are attached to struct device and independent of
platform/PCI/whatever.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
  2021-07-21  7:20       ` Thomas Gleixner
@ 2021-07-21  7:24         ` Christoph Hellwig
  2021-07-21  9:44           ` John Garry
  2021-07-21 20:14           ` Thomas Gleixner
  0 siblings, 2 replies; 22+ messages in thread
From: Christoph Hellwig @ 2021-07-21  7:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Christoph Hellwig, John Garry, Ming Lei, Jens Axboe, linux-block,
	linux-nvme, Greg Kroah-Hartman, Bjorn Helgaas, linux-pci,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, Hannes Reinecke,
	Keith Busch

On Wed, Jul 21, 2021 at 09:20:00AM +0200, Thomas Gleixner wrote:
> > Just walking the list seems fine to me given that this is not a
> > performance criticial path.  But what are the locking implications?
> 
> At the moment there are none because the list is initialized in the
> setup path and never modified afterwards. Though that might change
> sooner than later to fix the virtio wreckage vs. MSI-X.

What is the issue there?  Either way, if we keep the helper in the
IRQ code it should be easy to spot for anyone adding the locking.

> > Also does the above imply this won't work for your platform MSI case?
> 
> The msi descriptors are attached to struct device and independent of
> platform/PCI/whatever.

That's what I assumed, but this text from John suggested there is
something odd about the platform case:

"Did you consider that for PCI .."

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
  2021-07-21  7:24         ` Christoph Hellwig
@ 2021-07-21  9:44           ` John Garry
  2021-07-21 20:22             ` Thomas Gleixner
  2021-07-21 20:14           ` Thomas Gleixner
  1 sibling, 1 reply; 22+ messages in thread
From: John Garry @ 2021-07-21  9:44 UTC (permalink / raw)
  To: Christoph Hellwig, Thomas Gleixner
  Cc: Ming Lei, Jens Axboe, linux-block, linux-nvme,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-pci, Sagi Grimberg,
	Daniel Wagner, Wen Xiong, Hannes Reinecke, Keith Busch,
	Marc Zyngier

+ Marc

On 21/07/2021 08:24, Christoph Hellwig wrote:
> On Wed, Jul 21, 2021 at 09:20:00AM +0200, Thomas Gleixner wrote:
>>> Just walking the list seems fine to me given that this is not a
>>> performance criticial path.  But what are the locking implications?
>> At the moment there are none because the list is initialized in the
>> setup path and never modified afterwards. Though that might change
>> sooner than later to fix the virtio wreckage vs. MSI-X.
> What is the issue there?  Either way, if we keep the helper in the
> IRQ code it should be easy to spot for anyone adding the locking.
> 
>>> Also does the above imply this won't work for your platform MSI case?
>> The msi descriptors are attached to struct device and independent of
>> platform/PCI/whatever.
> That's what I assumed, but this text from John suggested there is
> something odd about the platform case:
> 
> "Did you consider that for PCI .."
> .

For this special platform MSI case there is a secondary interrupt 
controller (called mbigen) which generates the MSI on behalf of the 
device, which I think the MSI belongs to (and not the device, itself).

See "latter case" mentioned in commit 91f90daa4fb2.

I think Marc and Thomas can explain this much better than I could.

Anyway, as I mentioned earlier, I think that this specific problem is 
unique and can be solved without using a function which examines the 
struct device.msi_list .

Thanks,
John

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
  2021-07-21  7:24         ` Christoph Hellwig
  2021-07-21  9:44           ` John Garry
@ 2021-07-21 20:14           ` Thomas Gleixner
  2021-07-21 20:32             ` Christoph Hellwig
  1 sibling, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2021-07-21 20:14 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christoph Hellwig, John Garry, Ming Lei, Jens Axboe, linux-block,
	linux-nvme, Greg Kroah-Hartman, Bjorn Helgaas, linux-pci,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, Hannes Reinecke,
	Keith Busch

On Wed, Jul 21 2021 at 09:24, Christoph Hellwig wrote:

> On Wed, Jul 21, 2021 at 09:20:00AM +0200, Thomas Gleixner wrote:
>> > Just walking the list seems fine to me given that this is not a
>> > performance criticial path.  But what are the locking implications?
>> 
>> At the moment there are none because the list is initialized in the
>> setup path and never modified afterwards. Though that might change
>> sooner than later to fix the virtio wreckage vs. MSI-X.
>
> What is the issue there?  Either way, if we keep the helper in the
> IRQ code it should be easy to spot for anyone adding the locking.

  https://lore.kernel.org/r/87o8bxcuxv.ffs@nanos.tec.linutronix.de

TLDR: virtio allocates ONE irq on msix_enable() and then when the guest
actually unmasks another entry (e.g. request_irq()), it tears down the
allocated one and set's up two. On the third one this repeats ....

There are only two options:

  1) allocate everything upfront, which is undesired
  2) append entries, which might need locking, but I'm still trying to
     avoid that

There is another problem vs. vector exhaustion which can't be fixed that
way, but that's a different story.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
  2021-07-21  9:44           ` John Garry
@ 2021-07-21 20:22             ` Thomas Gleixner
  2021-07-22  7:48               ` John Garry
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2021-07-21 20:22 UTC (permalink / raw)
  To: John Garry, Christoph Hellwig
  Cc: Ming Lei, Jens Axboe, linux-block, linux-nvme,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-pci, Sagi Grimberg,
	Daniel Wagner, Wen Xiong, Hannes Reinecke, Keith Busch,
	Marc Zyngier

On Wed, Jul 21 2021 at 10:44, John Garry wrote:
> On 21/07/2021 08:24, Christoph Hellwig wrote:
>> On Wed, Jul 21, 2021 at 09:20:00AM +0200, Thomas Gleixner wrote:
>>>> Also does the above imply this won't work for your platform MSI case?
>>> The msi descriptors are attached to struct device and independent of
>>> platform/PCI/whatever.
>> That's what I assumed, but this text from John suggested there is
>> something odd about the platform case:
>> 
>> "Did you consider that for PCI .."
>> .
>
> For this special platform MSI case there is a secondary interrupt 
> controller (called mbigen) which generates the MSI on behalf of the 
> device, which I think the MSI belongs to (and not the device, itself).

MBIGEN is a different story because it converts wired interrupts into
MSI interrupts, IOW a MSI based interrupt pin extender.

I might be wrong, but I seriously doubt that any multiqueue device which
wants to use affinity managed interrupts is built on top of that.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
  2021-07-21 20:14           ` Thomas Gleixner
@ 2021-07-21 20:32             ` Christoph Hellwig
  2021-07-21 22:38               ` Thomas Gleixner
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2021-07-21 20:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Christoph Hellwig, John Garry, Ming Lei, Jens Axboe, linux-block,
	linux-nvme, Greg Kroah-Hartman, Bjorn Helgaas, linux-pci,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, Hannes Reinecke,
	Keith Busch

On Wed, Jul 21, 2021 at 10:14:25PM +0200, Thomas Gleixner wrote:
>   https://lore.kernel.org/r/87o8bxcuxv.ffs@nanos.tec.linutronix.de
> 
> TLDR: virtio allocates ONE irq on msix_enable() and then when the guest
> actually unmasks another entry (e.g. request_irq()), it tears down the
> allocated one and set's up two. On the third one this repeats ....
> 
> There are only two options:
> 
>   1) allocate everything upfront, which is undesired
>   2) append entries, which might need locking, but I'm still trying to
>      avoid that
> 
> There is another problem vs. vector exhaustion which can't be fixed that
> way, but that's a different story.

FTI, NVMe is similar.  We need one IRQ to setup the admin queue,
which is used to query/set how many I/O queues are supported.  Just
two steps though and not unbound.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
  2021-07-21 20:32             ` Christoph Hellwig
@ 2021-07-21 22:38               ` Thomas Gleixner
  2021-07-22  7:46                 ` Christoph Hellwig
  0 siblings, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2021-07-21 22:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christoph Hellwig, John Garry, Ming Lei, Jens Axboe, linux-block,
	linux-nvme, Greg Kroah-Hartman, Bjorn Helgaas, linux-pci,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, Hannes Reinecke,
	Keith Busch

On Wed, Jul 21 2021 at 22:32, Christoph Hellwig wrote:
> On Wed, Jul 21, 2021 at 10:14:25PM +0200, Thomas Gleixner wrote:
>>   https://lore.kernel.org/r/87o8bxcuxv.ffs@nanos.tec.linutronix.de
>> 
>> TLDR: virtio allocates ONE irq on msix_enable() and then when the
>> guest

OOps, sorry that should have been VFIO not virtio.

>> actually unmasks another entry (e.g. request_irq()), it tears down the
>> allocated one and set's up two. On the third one this repeats ....
>> 
>> There are only two options:
>> 
>>   1) allocate everything upfront, which is undesired
>>   2) append entries, which might need locking, but I'm still trying to
>>      avoid that
>> 
>> There is another problem vs. vector exhaustion which can't be fixed that
>> way, but that's a different story.
>
> FTI, NVMe is similar.  We need one IRQ to setup the admin queue,
> which is used to query/set how many I/O queues are supported.  Just
> two steps though and not unbound.

That's fine because that's controlled by the driver consistently and it
(hopefully) makes sure that the admin queue is quiesced before
everything is torn down after the initial query.

But that's not the case for VFIO. It tears down all in use interrupts
and the guest driver is completely oblivious of that.

Assume the following situation:

 1) VM boots with 8 present CPUs and 16 possible CPUs

 2) The passed through card (PF or VF) supports multiqueue and the
    driver uses managed interrupts which e.g. allocates one queue and
    one interrupt per possible CPU.

    Initial setup requests all the interrupts, but only the first 8
    queue interrupts are unmasked and therefore reallocated by the host
    which works by some definition of works because the device is quiet
    at that point.

 3) Host admin plugs the other 8 CPUs into the guest

    Onlining these CPUs in the guest will unmask the dormant managed
    queue interrupts and cause the host to allocate the remaining 8 per
    queue interrupts one by one thereby tearing down _all_ previously
    allocated ones and then allocating one more than before.

    Assume that while this goes on the guest has I/O running on the
    already online CPUs and their associated queues. Depending on the
    device this either will lose interrupts or reroute them to the
    legacy INTx which is not handled. This might in the best case result
    in a few "timedout" requests, but I managed it at least once to make
    the device go into lala land state, i.e. it did not recover.

The above can be fixed by adding an 'append' mode to the MSI code.

But that does not fix the overcommit issue where the host runs out of
vector space. The result is simply that the guest does not know and just
continues to work on device/queues which will never ever recieve an
interrupt (again).

I got educated that all of this is considered unlikely and my argument
that the concept of unlikely simply does not exist at cloud scale got
ignored. Sure, I know it's VIRT and therefore not subject to common
sense.

Thanks,

        tglx



    
    

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
  2021-07-21 22:38               ` Thomas Gleixner
@ 2021-07-22  7:46                 ` Christoph Hellwig
  0 siblings, 0 replies; 22+ messages in thread
From: Christoph Hellwig @ 2021-07-22  7:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Christoph Hellwig, John Garry, Ming Lei, Jens Axboe, linux-block,
	linux-nvme, Greg Kroah-Hartman, Bjorn Helgaas, linux-pci,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, Hannes Reinecke,
	Keith Busch

On Thu, Jul 22, 2021 at 12:38:07AM +0200, Thomas Gleixner wrote:
> That's fine because that's controlled by the driver consistently and it
> (hopefully) makes sure that the admin queue is quiesced before
> everything is torn down after the initial query.

Yes, it is.

> The above can be fixed by adding an 'append' mode to the MSI code.

So IFF we get that append mode I think it would help to simplify
drivers that have unmanaged pre and post vectors, and/or do the above
proving.

So instead of currently requesting a single unmanaged vector, do
basic setup, tear it down, request N managed vectors with an unmanaged
pre-vector we could just keep the unmanaged vector, never tear it down
and just append the post vectors.  In the long run this could remove
the need to do the pre and post vectors entirely.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed
  2021-07-21 20:22             ` Thomas Gleixner
@ 2021-07-22  7:48               ` John Garry
  0 siblings, 0 replies; 22+ messages in thread
From: John Garry @ 2021-07-22  7:48 UTC (permalink / raw)
  To: Thomas Gleixner, Christoph Hellwig
  Cc: Ming Lei, Jens Axboe, linux-block, linux-nvme,
	Greg Kroah-Hartman, Bjorn Helgaas, linux-pci, Sagi Grimberg,
	Daniel Wagner, Wen Xiong, Hannes Reinecke, Keith Busch,
	Marc Zyngier

On 21/07/2021 21:22, Thomas Gleixner wrote:
>>> That's what I assumed, but this text from John suggested there is
>>> something odd about the platform case:
>>>
>>> "Did you consider that for PCI .."
>>> .
>> For this special platform MSI case there is a secondary interrupt
>> controller (called mbigen) which generates the MSI on behalf of the
>> device, which I think the MSI belongs to (and not the device, itself).
> MBIGEN is a different story because it converts wired interrupts into
> MSI interrupts, IOW a MSI based interrupt pin extender.
> 
> I might be wrong, but I seriously doubt that any multiqueue device which
> wants to use affinity managed interrupts is built on top of that.

Actually the SCSI device for which platform device managed interrupts 
support was added in commit e15f2fa959f2 uses mbigen.

Thanks,
John

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-07-22  7:48 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-15 12:08 [PATCH V4 0/3] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei
2021-07-15 12:08 ` [PATCH V4 1/3] driver core: mark device as irq affinity managed if any irq is managed Ming Lei
2021-07-15 12:40   ` Greg Kroah-Hartman
2021-07-16  2:17     ` Ming Lei
2021-07-16 20:01   ` Bjorn Helgaas
2021-07-17  9:30     ` Ming Lei
2021-07-21  0:30       ` Bjorn Helgaas
2021-07-19  7:51   ` John Garry
2021-07-19  9:44     ` Christoph Hellwig
2021-07-19 10:39       ` John Garry
2021-07-20  2:38         ` Ming Lei
2021-07-21  7:20       ` Thomas Gleixner
2021-07-21  7:24         ` Christoph Hellwig
2021-07-21  9:44           ` John Garry
2021-07-21 20:22             ` Thomas Gleixner
2021-07-22  7:48               ` John Garry
2021-07-21 20:14           ` Thomas Gleixner
2021-07-21 20:32             ` Christoph Hellwig
2021-07-21 22:38               ` Thomas Gleixner
2021-07-22  7:46                 ` Christoph Hellwig
2021-07-15 12:08 ` [PATCH V4 2/3] blk-mq: mark if one queue map uses managed irq Ming Lei
2021-07-15 12:08 ` [PATCH V4 3/3] blk-mq: don't deactivate hctx if managed irq isn't used Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).