Re: PCI MSI issue with reinserting a driver

From: Thomas Gleixner <tglx@linutronix.de>
To: Marc Zyngier <maz@kernel.org>, John Garry <john.garry@huawei.com>
Cc: Zhou Wang <wangzhou1@hisilicon.com>, linux-kernel@vger.kernel.org
Subject: Re: PCI MSI issue with reinserting a driver
Date: Wed, 30 Aug 2023 01:00:38 +0200	[thread overview]
Message-ID: <87350178tl.ffs@tglx> (raw)
In-Reply-To: <3d3d0155e66429968cb4f6b4feeae4b3@kernel.org>

On Wed, Feb 03 2021 at 17:23, Marc Zyngier wrote:
> On 2021-02-02 15:46, John Garry wrote:
>> On 02/02/2021 14:48, Marc Zyngier wrote:
>>> We can't really do that. All the events must be contiguous,
>>> and there is also a lot of assumptions in the ITS driver that
>>> LPI allocations is also contiguous.
>>> 
>>> But there is also the fact that for Multi-MSI, we *must*
>>> allocate 32 vectors. Any driver could assume that if we have
>>> allocated 17 vectors, then there is another 15 available.

In theory. In practice no driver can assume that simply because there is
no corresponding interrupt descriptor. The PCI/MSI code allocates $N
vectors, the MSI code allocates exactly $N interrupt descriptors, so
there is no way that the driver can make up N = round_up_power_of_to($N)
magically.

The only reason why this makes sense is when its_dev and the associated
bitmap is shared between devices because that way you ensure that above
the non-allocated interrupts there is a gap up to the next power of two
to catch malfunctioning hardware/drivers.

If its_dev is not shared, then this makes no difference as the gap is
there naturaly.

> I'm not overly keen on handling this in the ITS though, and I'd rather
> we try and do it in the generic code. How about this (compile tested
> only)?
...
> @@ -1399,8 +1399,19 @@ static void irq_domain_free_irqs_hierarchy(struct 
> irq_domain *domain,
>   		return;
>
>   	for (i = 0; i < nr_irqs; i++) {
> -		if (irq_domain_get_irq_data(domain, irq_base + i))
> -			domain->ops->free(domain, irq_base + i, 1);
> +		int n ;
> +
> +		/* Find the largest possible span of IRQs to free in one go */
> +		for (n = 0;
> +		     ((i + n) < nr_irqs &&
> +		      irq_domain_get_irq_data(domain, irq_base + i + n));
> +		     n++);
> +
> +		if (!n)
> +			continue;
> +
> +		domain->ops->free(domain, irq_base + i, n);
> +		i += n;

TBH. That makes my eyes bleed and it's just an ugly workaround, which
works by chance for that ITS implementation detail. That's really not
the place to do that.

I've stared at this before when I was doing the initial PoC to convert
GIC over to the per device MSI domain model and did not come up with a
solution to implement support for dynamic PCI-MSI allocation.

I know that you need a fix for the current code, but we really should
move all this muck over to the per device model which makes this way
simpler as you really have per device mechanisms.

The underlying problem is the whole bulk allocation/free assumption in
the ITS driver, which you need to address anyway when you ever want to
support dynamic PCI/MSI-X and PCI/IMS.

For that this really needs to be properly split up:

   1) Initialization: Reserve a LPI bitmap region for a sufficiently
      large number of MSI interrupts which handles the power of 2
      requirement

   2) Alloc: Allocate the individual interrupts within the bitmap
      region

   3) Free: Free the individual interrupts and just clear the
      corresponding bit within the bitmap region

   4) Teardown: Release the bitmap region

But lets look at that later.

For the problem at hand I rather see something like the below instead of
this hideous 'try to find a range' hackery which is incomprehensible
without a comment the size of a planet.

That's not pretty either, but at least it makes it entirely clear that
the underlying irqdomain requires this for whatever insane reason.

Thanks,

        tglx
---
 include/linux/irqdomain.h |    8 ++++++++
 kernel/irq/irqdomain.c    |    9 +++++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -208,6 +208,9 @@ enum {
 	/* Irq domain is a MSI device domain */
 	IRQ_DOMAIN_FLAG_MSI_DEVICE	= (1 << 9),
 
+	/* Irq domain requires bulk freeing of interrupts */
+	IRQ_DOMAIN_FREE_BULK		= (1 << 10),
+
 	/*
 	 * Flags starting from IRQ_DOMAIN_FLAG_NONCORE are reserved
 	 * for implementation specific purposes and ignored by the
@@ -572,6 +575,11 @@ static inline bool irq_domain_is_msi_dev
 	return domain->flags & IRQ_DOMAIN_FLAG_MSI_DEVICE;
 }
 
+static inline bool irq_domain_free_bulk(struct irq_domain *domain)
+{
+	return domain->flags & IRQ_DOMAIN_FREE_BULK;
+}
+
 #else	/* CONFIG_IRQ_DOMAIN_HIERARCHY */
 static inline int irq_domain_alloc_irqs(struct irq_domain *domain,
 			unsigned int nr_irqs, int node, void *arg)
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -1442,14 +1442,19 @@ static void irq_domain_free_irqs_hierarc
 					   unsigned int irq_base,
 					   unsigned int nr_irqs)
 {
-	unsigned int i;
+	unsigned int i, tofree = 1;
 
 	if (!domain->ops->free)
 		return;
 
+	if (irq_domain_free_bulk(domain)) {
+		tofree = nr_irqs;
+		nr_irqs = 1;
+	}
+
 	for (i = 0; i < nr_irqs; i++) {
 		if (irq_domain_get_irq_data(domain, irq_base + i))
-			domain->ops->free(domain, irq_base + i, 1);
+			domain->ops->free(domain, irq_base + i, tofree);
 	}
 }