Linux-HyperV Archive on lore.kernel.org
 help / color / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: David Woodhouse <dwmw2@infradead.org>, x86@kernel.org
Cc: iommu <iommu@lists.linux-foundation.org>,
	kvm <kvm@vger.kernel.org>,
	linux-hyperv@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
Date: Thu, 08 Oct 2020 11:34:41 +0200
Message-ID: <87k0w12h8u.fsf@nanos.tec.linutronix.de> (raw)
In-Reply-To: <119c2f993cac5d57c54d4720addc9f32bf1daadd.camel@infradead.org>

On Thu, Oct 08 2020 at 08:21, David Woodhouse wrote:
> On Wed, 2020-10-07 at 17:57 +0200, Thomas Gleixner wrote:
>> Multiqueue devices want to have at max 1 queue per CPU or if the device
>> has less queues than CPUs they want the queues to have a fixed
>> association to a set of CPUs.
>> 
>> At setup time this is established considering possible CPUs to handle
>> 'physical' hotplug correctly.
>> 
>> If a queue has no online CPUs it cannot be started. If it's active and
>> the last CPU goes down then it's quiesced and stopped and the core code
>> shuts down the interrupt and does not move it to a still online CPU.
>> 
>> So with your hackery, we end up in a situation where we have a large
>> possible mask, but not all CPUs in that mask can be reached, which means
>> in a 1 queue per CPU scenario all unreachable CPUs would have
>> disfunctional queues.
>> 
>> So that spreading algorithm needs to know about this limitation.
>
> OK, thanks. So the queue exists, with an MSI assigned to point to an
> offline CPU(s), but it cannot actually be used until/unless at least
> one CPU in its mask comes online.

The MSI entry in that case is actually directed to an online CPU's
MANAGED_IRQ_SHUTDOWN_VECTOR to catch cases where an interrupt is raised
by the device after shutdown.

> So when I said I wanted to try treating "reachable" the same way as
> "online", that would mean the queue can't start until/unless at least
> one *reachable* CPU in its mask comes online.
>
> The underlying problem here is that until a CPU comes online, we don't
> actually *know* if it's reachable or not.

It's known before online, i.e. when the CPU is registered which is
either at boot time for present CPUs or at 'physical' hotplug.

> So if we want carefully create the affinity masks at setup time so that
> they don't include any unreachable CPUs... that basically means we
> don't include any non-present CPUs at all (unless they've been added
> once and then removed).

That breaks _all_ multi-queue assumptions in one go. :)

> But those really do seem like hacks which might only apply on x86,
> while the generic approach of treating "reachable" like "online" seems
> like it would work in other cases too.
>
> Fundamentally, there are three sets of CPUs. There are those known to
> be reachable, those known not to be, and those which are not yet
> known.

Unfortunately there are lots of assumptions all over the place that
possible CPUs are reachable. Multi-queue using managed interrupts is
just the tip of the iceberg.

> So another approach we could use is to work with a cpumask of those
> *known* not to be reachable, and to filter those *out* of the prebuilt
> affinities. That gives us basically the right behaviour without
> hotplug, but does include absent CPUs in a mask that *if* they are ever
> added, wouldn't be able to receive the IRQ. Which does mean we'd have
> to refrain from bringing up the corresponding queue. 

The multi-queue drivers rely on the interrupt setup to create their
queues and the fundamental assumption is that this setup works. The
managed interrupt mechanism guarantees that the queue has a vector
available on all CPUs which are in the queues assigned affinity mask. As
of today it also guarantees that these CPUs are reachable once they come
online.

So in order to make that work you'd need to teach the multi-queue stuff
about this new world order:

 1) On hotplug the queue needs to be able to figure out whether the
    interrupt is functional. If not it has to redirect any requests to
    some actually functional queue.

 2) On unplug it needs to be able to figure out whether the interrupt
    will shutdown because the outgoing CPU is the last reachable in the
    group and if there are still online but unreachable CPUs then use
    the redirect mechanism.

I'm sure that the multi-queue people will be enthusiastic to add all of
this and deal with all the nasty corner cases coming out of it.

The overall conclusion for this is:

 1) X2APIC support on bare metal w/o irq remapping is not going to
    happen unless you:

      - added support in multi-queue devices which utilize managed
        interrupts
        
      - audited the whole tree for other assumptions related to the
        reachability of possible CPUs.

    I'm not expecting you to be done with that before I retire so for
    me it's just not going to happen :)

 2) X2APIC support on VIRT is possible if the extended ID magic is
    supported by the hypervisor because that does not make any CPU
    unreachable for MSI and therefore the multi-queue muck and
    everything else just works.

    This requires to have either the domain affinity limitation for HPET
    in place or just to force disable HPET or at least HPET-MSI which is
    a reasonable tradeoff.

    HPET is not required for guests which have kvmclock and
    APIC/deadline timer and known (hypervisor provided) frequencies.

Anything else is just wishful thinking, really.

Thanks,

        tglx

  reply index

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-05 15:28 [PATCH 0/13] Fix per-domain IRQ affinity, allow >255 CPUs on x86 without IRQ remapping David Woodhouse
2020-10-05 15:28 ` [PATCH 01/13] x86/apic: Use x2apic in guest kernels even with unusable CPUs David Woodhouse
2020-10-05 15:28   ` [PATCH 02/13] x86/msi: Only use high bits of MSI address for DMAR unit David Woodhouse
2020-10-06 20:45     ` Thomas Gleixner
2020-10-05 15:28   ` [PATCH 03/13] x86/ioapic: Handle Extended Destination ID field in RTE David Woodhouse
2020-10-05 15:28   ` [PATCH 04/13] x86/apic: Support 15 bits of APIC ID in IOAPIC/MSI where available David Woodhouse
2020-10-05 15:28   ` [PATCH 05/13] genirq: Prepare for default affinity to be passed to __irq_alloc_descs() David Woodhouse
2020-10-06 21:01     ` Thomas Gleixner
2020-10-06 21:07       ` David Woodhouse
2020-10-05 15:28   ` [PATCH 06/13] genirq: Add default_affinity argument " David Woodhouse
2020-10-06 21:06     ` Thomas Gleixner
2020-10-05 15:28   ` [PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs() David Woodhouse
2020-10-06 21:26     ` Thomas Gleixner
2020-10-07  7:19       ` David Woodhouse
2020-10-07 13:37         ` Thomas Gleixner
2020-10-07 14:10           ` David Woodhouse
2020-10-07 15:57             ` Thomas Gleixner
2020-10-07 16:11               ` David Woodhouse
2020-10-07 20:53                 ` Thomas Gleixner
2020-10-08  7:21               ` David Woodhouse
2020-10-08  9:34                 ` Thomas Gleixner [this message]
2020-10-08 11:10                   ` David Woodhouse
2020-10-08 12:40                     ` Thomas Gleixner
2020-10-09  7:54                       ` David Woodhouse
2020-10-05 15:28   ` [PATCH 08/13] genirq: Add irq_domain_set_affinity() David Woodhouse
2020-10-06 21:32     ` Thomas Gleixner
2020-10-07  7:22       ` David Woodhouse
2020-10-05 15:28   ` [PATCH 09/13] x86/irq: Add x86_non_ir_cpumask David Woodhouse
2020-10-06 21:42     ` Thomas Gleixner
2020-10-07  7:25       ` David Woodhouse
2020-10-05 15:28   ` [PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR David Woodhouse
2020-10-06 21:54     ` Thomas Gleixner
2020-10-07  7:48       ` David Woodhouse
2020-10-07 12:59         ` Thomas Gleixner
2020-10-07 13:08           ` David Woodhouse
2020-10-07 14:05             ` Thomas Gleixner
2020-10-07 14:23               ` David Woodhouse
2020-10-07 16:02                 ` Thomas Gleixner
2020-10-07 16:15                   ` David Woodhouse
2020-10-07 15:05               ` David Woodhouse
2020-10-07 15:25                 ` Thomas Gleixner
2020-10-07 15:46                   ` David Woodhouse
2020-10-07 17:23                     ` Thomas Gleixner
2020-10-07 17:34                       ` David Woodhouse
2020-10-05 15:28   ` [PATCH 11/13] x86/smp: Allow more than 255 CPUs even without interrupt remapping David Woodhouse
2020-10-05 15:28   ` [PATCH 12/13] iommu/irq_remapping: Kill most of hyperv-iommu.c now it's redundant David Woodhouse
2020-10-05 15:28   ` [PATCH 13/13] x86/kvm: Add KVM_FEATURE_MSI_EXT_DEST_ID David Woodhouse
2020-10-07  8:14     ` Paolo Bonzini
2020-10-07  8:59       ` David Woodhouse
2020-10-07 11:15         ` Paolo Bonzini
2020-10-07 12:04           ` David Woodhouse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87k0w12h8u.fsf@nanos.tec.linutronix.de \
    --to=tglx@linutronix.de \
    --cc=dwmw2@infradead.org \
    --cc=iommu@lists.linux-foundation.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-HyperV Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-hyperv/0 linux-hyperv/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-hyperv linux-hyperv/ https://lore.kernel.org/linux-hyperv \
		linux-hyperv@vger.kernel.org
	public-inbox-index linux-hyperv

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-hyperv


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git