Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()

From: David Woodhouse <dwmw2@infradead.org>
To: Thomas Gleixner <tglx@linutronix.de>, x86@kernel.org
Cc: iommu <iommu@lists.linux-foundation.org>,
	kvm <kvm@vger.kernel.org>,
	linux-hyperv@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
Date: Thu, 08 Oct 2020 08:21:17 +0100	[thread overview]
Message-ID: <119c2f993cac5d57c54d4720addc9f32bf1daadd.camel@infradead.org> (raw)
In-Reply-To: <87a6wy3u6n.fsf@nanos.tec.linutronix.de>

[-- Attachment #1: Type: text/plain, Size: 2997 bytes --]

On Wed, 2020-10-07 at 17:57 +0200, Thomas Gleixner wrote:
> TLDR & HTF;
> 
> Multiqueue devices want to have at max 1 queue per CPU or if the device
> has less queues than CPUs they want the queues to have a fixed
> association to a set of CPUs.
> 
> At setup time this is established considering possible CPUs to handle
> 'physical' hotplug correctly.
> 
> If a queue has no online CPUs it cannot be started. If it's active and
> the last CPU goes down then it's quiesced and stopped and the core code
> shuts down the interrupt and does not move it to a still online CPU.
> 
> So with your hackery, we end up in a situation where we have a large
> possible mask, but not all CPUs in that mask can be reached, which means
> in a 1 queue per CPU scenario all unreachable CPUs would have
> disfunctional queues.
> 
> So that spreading algorithm needs to know about this limitation.

OK, thanks. So the queue exists, with an MSI assigned to point to an
offline CPU(s), but it cannot actually be used until/unless at least
one CPU in its mask comes online.

So when I said I wanted to try treating "reachable" the same way as
"online", that would mean the queue can't start until/unless at least
one *reachable* CPU in its mask comes online.

The underlying problem here is that until a CPU comes online, we don't
actually *know* if it's reachable or not.

So if we want carefully create the affinity masks at setup time so that
they don't include any unreachable CPUs... that basically means we
don't include any non-present CPUs at all (unless they've been added
once and then removed).

That's because — at least on x86 — we don't assign CPU numbers to CPUs
which haven't yet been added. Theoretically we *could* but if there are
more than NR_CPUS listed in (e.g.) the MADT then we could run out of
CPU numbers. Then if the admin hotplugs one of the CPUs we *didn't*
have space for, we'd not be able to handle it.

I suppose there might be options like pre-assigning CPU numbers only to
non-present APIC IDs below 256 (or 32768). Or *grouping* the CPU
numbers, so some not-yet-assigned CPU#s are for low APICIDs and some
are for high APICIDs but it doesn't actually have to be a 1:1
predetermined mapping.

But those really do seem like hacks which might only apply on x86,
while the generic approach of treating "reachable" like "online" seems
like it would work in other cases too.

Fundamentally, there are three sets of CPUs. There are those known to
be reachable, those known not to be, and those which are not yet known.

So another approach we could use is to work with a cpumask of those
*known* not to be reachable, and to filter those *out* of the prebuilt
affinities. That gives us basically the right behaviour without
hotplug, but does include absent CPUs in a mask that *if* they are ever
added, wouldn't be able to receive the IRQ. Which does mean we'd have
to refrain from bringing up the corresponding queue. 

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]