Re: [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR

From: David Woodhouse <dwmw2@infradead.org>
To: Thomas Gleixner <tglx@linutronix.de>, x86@kernel.org
Cc: iommu <iommu@lists.linux-foundation.org>,
	kvm <kvm@vger.kernel.org>,
	linux-hyperv@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR
Date: Wed, 07 Oct 2020 16:46:10 +0100	[thread overview]
Message-ID: <2f09a1f97d97e638e90c6eca3ebeebb4be852f58.camel@infradead.org> (raw)
In-Reply-To: <87d01u3vo6.fsf@nanos.tec.linutronix.de>

[-- Attachment #1: Type: text/plain, Size: 2879 bytes --]

On Wed, 2020-10-07 at 17:25 +0200, Thomas Gleixner wrote:
> It's clearly how the hardware works. MSI has a message store of some
> sorts and if the entry is enabled then the MSI chip (in PCI or
> elsewhere) will send exactly the message which is in that message
> store. It knows absolutely nothing about what the message means and how
> it is composed. The only things which MSI knows about is whether the
> message address is 64bit wide or not and whether the entries are
> maskable or not and how many entries it can store.
> 
> Software allocates a message target at the underlying irq domain (vector
> or remap) and that underlying irq domain defines the properties.
> 
> If qemu emulates it differently then it's qemu's problem, but that does
> not make it in anyway something which influences the irq domain
> abstractions which are correctly modeled after how the hardware works.
> 
> > Not really the important part to deal with right now, either way.
> 
> Oh yes it is. We define that upfront and not after the fact.

The way the hardware works is that something handles physical address
cycles to addresses in the (on x86) 0xFEExxxxx range, and turns them
into actual interrupts on the appropriate CPU — where the APIC ID and
vector (etc.) are directly encoded in the bits of the address and the
data written. That compatibility x86 APIC MSI format is where the 8-bit 
(or 15-bit) limit comes from.

Then interrupt remapping comes along, and now those physical address
cycles are actually handled by the IOMMU — which can either handle the
compatibility format as before, or use a different format of
address/data bits and perform a lookup in its IRTE table.

The PCI MSI domain, HPET, and even the IOAPIC are just the things out
there on the bus which might perform those physical address cycles. And
yes, as you say they're just a message store sending exactly the
message that was composed for them. They know absolutely nothing about
what the message means and how it is composed.

It so happens that in Linux, we don't really architect the software
like that. So each of the PCI MSI domain, HPET, and IOAPIC have their
*own* message composer which has the same limits and composes basically
the same messages as if it was *their* format, not dictated to them by
the APIC upstream. And that's what we're both getting our panties in a
knot about, I think.

It really doesn't matter that much to the underlying generic irqdomain
support for limited affinities. Except that you want to make the
generic code support the concept of a child domain supporting *more*
CPUs than its parent, which really doesn't make much sense if you think
about it. But it isn't that hard to do, and if it means we don't have
to argue any more about the x86 hierarchy not matching the hardware
then it's a price worth paying. 

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]