Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()

From: David Woodhouse <dwmw2@infradead.org>
To: Thomas Gleixner <tglx@linutronix.de>, x86@kernel.org
Cc: iommu <iommu@lists.linux-foundation.org>,
	kvm <kvm@vger.kernel.org>,
	linux-hyperv@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
Date: Wed, 07 Oct 2020 15:10:24 +0100	[thread overview]
Message-ID: <336029ca32524147a61b6fa1eb734debc9d51a00.camel@infradead.org> (raw)
In-Reply-To: <87tuv640nw.fsf@nanos.tec.linutronix.de>

[-- Attachment #1: Type: text/plain, Size: 5425 bytes --]

On Wed, 2020-10-07 at 15:37 +0200, Thomas Gleixner wrote:
> On Wed, Oct 07 2020 at 08:19, David Woodhouse wrote:
> > On Tue, 2020-10-06 at 23:26 +0200, Thomas Gleixner wrote:
> > > On Mon, Oct 05 2020 at 16:28, David Woodhouse wrote:
> > > > From: David Woodhouse <dwmw@amazon.co.uk>
> > > > 
> > > > This is the maximum possible set of CPUs which can be used. Use it
> > > > to calculate the default affinity requested from __irq_alloc_descs()
> > > > by first attempting to find the intersection with irq_default_affinity,
> > > > or falling back to using just the max_affinity if the intersection
> > > > would be empty.
> > > 
> > > And why do we need that as yet another argument?
> > > 
> > > This is an optional property of the irq domain, really and no caller has
> > > any business with that. 
> > 
> > Because irq_domain_alloc_descs() doesn't actually *take* the domain as
> > an argument. It's more of an internal function, which is only non-
> > static because it's used from kernel/irq/ipi.c too for some reason. If
> > we convert the IPI code to just call __irq_alloc_descs() directly,
> > perhaps that we can actually make irq_domain_alloc_decs() static.
> 
> What is preventing you to change the function signature? But handing
> down irqdomain here is not cutting it. The right thing to do is to
> replace 'struct irq_affinity_desc *affinity' with something more
> flexible.

Yeah, although I do think I want to ditch this part completely, and
treat the "possible" mask for the domain very much more like we do
cpu_online_mask. In that we can allow affinities to be *requested*
which are outside it, and they can just never be *effective* while
those CPUs aren't present and reachable.

That way a lot of the nastiness about preparing an "acceptable" mask in
advance can go away.

It's *also* driven, as I noted, by the realisation that on x86, the
x86_non_ir_cpumask will only ever contain those CPUs which have been
brought online so far and meet the criteria... but won't that be true
on *any* system where CPU numbers are virtual and not 1:1 mapped with
whatever determines reachability by the IRQ domain? It isn't *such* an
x86ism, surely?

> Fact is, that if there are CPUs which cannot be targeted by device
> interrupts then the multiqueue affinity mechanism has to be fixed to
> handle this. Right now it's just broken.

I think part of the problem there is that I don't really understand how
this part is *supposed* to work. I was focusing on getting the simple
case working first, in the expectation that we'd come back to that part
ansd you'd keep me honest. Is there some decent documentation on this
that I'm missing?

> It's pretty obvious that the irq domains are the right place to store
> that information:
> 
> const struct cpumask *irqdomain_get_possible_affinity(struct irq_domain *d)
> {
>         while (d) {
>         	if (d->get_possible_affinity)
>                 	return d->get_possible_affinity(d);
>                 d = d->parent;
>         }
>         return cpu_possible_mask;
> }

Sure.

> So if you look at X86 then you have either:
> 
>    [VECTOR] ----------------- [IO/APIC]
>                           |-- [MSI]
>                           |-- [WHATEVER]
> 
> or
> 
>    [VECTOR] ---[REMAP]------- [IO/APIC]
>              |            |-- [MSI]
>              |----------------[WHATEVER]

Hierarchically, theoretically the IOAPIC and HPET really ought to be
children of the MSI domain. It's the Compatibility MSI which has the
restriction on destination ID, because of how many bits it interprets
from the MSI address. HPET and IOAPIC are just generating MSIs that hit
that upstream limit.

> So if REMAP allows cpu_possible_mask and VECTOR some restricted subset
> then irqdomain_get_possible_affinity() will return the correct result
> independent whether remapping is enabled or not.

Sure. Although VECTOR doesn't have the restriction. As noted, it's the
Compatibility MSI that does. So the diagram looks something like this:

 [ VECTOR ] ---- [ REMAP ] ---- [ IR-MSI ] ---- [ IR-HPET ]
              |                             |---[ IR-PCI-MSI ]
              |                             |---[ IR-IOAPIC ]
              |
              |--[ COMPAT-MSI ] ---- [ HPET ]
                                 |-- [ PCI-MSI ]
                                 |-- [ IOAPIC ]

In this diagram, it's COMPAT-MSI that has the affinity restriction,
inherited by its children.

Now, I understand that you're not keen on IOAPIC actually being a child
of the MSI domain, and that's fine. In Linux right now, those generic
'IR-MSI' and 'COMPAT-MSI' domains don't exist. So all three of the
compatibility HPET, PCI-MSI and IOAPIC domains would have to add that
same 8-bit affinity limit for themselves, as their parent is the VECTOR
domain.

I suppose it *might* not hurt to pretend that VECTOR does indeed have
the limit, and to *remove* it in the remapping domain. And then the
affinity limit could be removed in one place by the REMAP domain
because even now in Linux's imprecise hierarchy, the IR-HPET, IR-PCI-
MSI and IR-IOAPIC domains *are* children of that.

But honestly, I'd rather iterate and get the generic irqdomain support
for affinity limits into decent shape before I worry *too* much about
precisely which was round it gets used by x86.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]