Re: irq_build_affinity_masks() allocates improper affinity if num_possible_cpus() > num_present_cpus()?

From: David Woodhouse <dwmw2@infradead.org>
To: Dexuan Cui <decui@microsoft.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ming Lei <ming.lei@redhat.com>, Christoph Hellwig <hch@lst.de>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	Stefan Haberland <sth@linux.vnet.ibm.com>,
	Jens Axboe <axboe@kernel.dk>, Marc Zyngier <marc.zyngier@arm.com>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Cc: Long Li <longli@microsoft.com>,
	Haiyang Zhang <haiyangz@microsoft.com>,
	Michael Kelley <mikelley@microsoft.com>
Subject: Re: irq_build_affinity_masks() allocates improper affinity if num_possible_cpus() > num_present_cpus()?
Date: Tue, 06 Oct 2020 12:17:25 +0100	[thread overview]
Message-ID: <077f3399e68fca343c06d1016fd6816fb6a59712.camel@infradead.org> (raw)
In-Reply-To: <65ba8a8b86201d8906313fbacc4fb711b9b423af.camel@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 7076 bytes --]

On Tue, 2020-10-06 at 09:37 +0100, David Woodhouse wrote:
> On Tue, 2020-10-06 at 06:47 +0000, Dexuan Cui wrote:
> > Hi all,
> > I'm running a single-CPU Linux VM on Hyper-V. The Linux kernel is v5.9-rc7
> > and I have CONFIG_NR_CPUS=256.
> > 
> > The Hyper-V Host (Version 17763-10.0-1-0.1457) provides a guest firmware,
> > which always reports 128 Local APIC entries in the ACPI MADT table. Here
> > only the first Local APIC entry's "Processor Enabled" is 1 since this
> > Linux VM is configured to have only 1 CPU. This means: in the Linux kernel,
> > the "cpu_present_mask" and " cpu_online_mask " have only 1 CPU (i.e. CPU0),
> > while the "cpu_possible_mask" has 128 CPUs, and the "nr_cpu_ids" is 128.
> > 
> > I pass through an MSI-X-capable PCI device to the Linux VM (which has
> > only 1 virtual CPU), and the below code does *not* report any error
> > (i.e. pci_alloc_irq_vectors_affinity() returns 2, and request_irq()
> > returns 0), but the code does not work: the second MSI-X interrupt is not
> > happening while the first interrupt does work fine.
> > 
> > int nr_irqs = 2;
> > int i, nvec, irq;
> > 
> > nvec = pci_alloc_irq_vectors_affinity(pdev, nr_irqs, nr_irqs,
> >                 PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, NULL);
> > 
> > for (i = 0; i < nvec; i++) {
> >         irq = pci_irq_vector(pdev, i);
> >         err = request_irq(irq, test_intr, 0, "test_intr", &intr_cxt[i]);
> > }
> > 
> > It turns out that pci_alloc_irq_vectors_affinity() -> ... ->
> > irq_create_affinity_masks() allocates an improper affinity for the second
> > interrupt. The below printk() shows that the second interrupt's affinity is
> > 1-64, but only CPU0 is present in the system! As a result, later,
> > request_irq() -> ... -> irq_startup() -> __irq_startup_managed() returns
> > IRQ_STARTUP_ABORT because cpumask_any_and(aff, cpu_online_mask) is 
> > empty (i.e. >= nr_cpu_ids), and irq_startup() *silently* fails (i.e. "return 0;"),
> > since __irq_startup() is only called for IRQ_STARTUP_MANAGED and
> > IRQ_STARTUP_NORMAL.
> > 
> > --- a/kernel/irq/affinity.c
> > +++ b/kernel/irq/affinity.c
> > @@ -484,6 +484,9 @@ struct irq_affinity_desc *
> >         for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++)
> >                 masks[i].is_managed = 1;
> > 
> > +       for (i = 0; i < nvecs; i++)
> > +               printk("i=%d, affi = %*pbl\n", i,
> > +                      cpumask_pr_args(&masks[i].mask));
> >         return masks;
> >  }
> > 
> > [   43.770477] i=0, affi = 0,65-127
> > [   43.770484] i=1, affi = 1-64
> > 
> > Though here the issue happens to a Linux VM on Hyper-V, I think the same
> > issue can also happen to a physical machine, if the physical machine also
> > uses a lot of static MADT entries, of which only the entries of the present
> > CPUs are marked to be "Processor Enabled == 1".
> > 
> > I think pci_alloc_irq_vectors_affinity() -> __pci_enable_msix_range() ->
> > irq_calc_affinity_vectors() -> cpumask_weight(cpu_possible_mask) should
> > use cpu_present_mask rather than cpu_possible_mask (), so here
> > irq_calc_affinity_vectors() would return 1, and
> > __pci_enable_msix_range() would immediately return -ENOSPC to avoid a
> > *silent* failure.
> > 
> > However, git-log shows that this 2018 commit intentionally changed the
> > cpu_present_mask to cpu_possible_mask:
> > 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs")
> > 
> > so I'm not sure whether (and how?) we should address the *silent* failure.
> > 
> > BTW, here I use a single-CPU VM to simplify the discussion. Actually,
> > if the VM has n CPUs, with the above usage of
> > pci_alloc_irq_vectors_affinity() (which might seem incorrect, but my point is
> > that it's really not good to have a silent failure, which makes it a lot more 
> > difficult to figure out what goes wrong), it looks only the first n MSI-X interrupts
> > can work, and the (n+1)'th MSI-X interrupt can not work due to the allocated
> > improper affinity.
> > 
> > According to my tests, if we need n+1 MSI-X interrupts in such a VM that
> > has n CPUs, it looks we have 2 options (the second should be better):
> > 
> > 1. Do not use the PCI_IRQ_AFFINITY flag, i.e.
> >         pci_alloc_irq_vectors_affinity(pdev, n+1, n+1, PCI_IRQ_MSIX, NULL);
> > 
> > 2. Use the PCI_IRQ_AFFINITY flag, and pass a struct irq_affinity affd,
> > which tells the API that we don't care about the first interrupt's affinity:
> > 
> >         struct irq_affinity affd = {
> >                 .pre_vectors = 1,
> > 				...
> >         };
> > 
> >         pci_alloc_irq_vectors_affinity(pdev, n+1, n+1,
> >                 PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, &affd);
> > 
> > PS, irq_create_affinity_masks() is complicated. Let me know if you're
> > interested to know how it allocates the invalid affinity "1-64" for the
> > second MSI-X interrupt.
> 
> Go on. It'll save me a cup of coffee or two...
> 
> > PS2, the latest Hyper-V provides only one ACPI MADT entry to a 1-CPU VM,
> > so the issue described above can not reproduce there.
> 
> It seems fairly easy to reproduce in qemu with -smp 1,maxcpus=128 and a
> virtio-blk drive, having commented out the 'desc->pre_vectors++' around
> line 130 of virtio_pci_common.c so that it does actually spread them.
> 
> [    0.836252] i=0, affi = 0,65-127
> [    0.836672] i=1, affi = 1-64
> [    0.837905] virtio_blk virtio1: [vda] 41943040 512-byte logical blocks (21.5 GB/20.0 GiB)
> [    0.839080] vda: detected capacity change from 0 to 21474836480
> 
> In my build I had to add 'nox2apic' because I think I actually already
> fixed this for the x2apic + no-irq-remapping case with the max_affinity
> patch series¹. But mostly by accident.
> 
> 
> ¹ https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/irqaffinity

Is it fixed by 
https://git.infradead.org/users/dwmw2/linux.git/commitdiff/41cfe6d54e5?


---
 kernel/irq/affinity.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 6d7dbcf91061..00aa0ba6b32a 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -364,12 +364,17 @@ static int irq_build_affinity_masks(unsigned int startvec, unsigned int numvecs,
 		cpumask_copy(npresmsk, cpu_present_mask);
 
 	/* Spread on present CPUs starting from affd->pre_vectors */
-	ret = __irq_build_affinity_masks(curvec, numvecs, firstvec,
-					 node_to_cpumask, cpu_present_mask,
-					 nmsk, masks);
-	if (ret < 0)
-		goto fail_build_affinity;
-	nr_present = ret;
+	while (nr_present < numvecs) {
+		curvec = firstvec + nr_present;
+		ret = __irq_build_affinity_masks(curvec, numvecs, firstvec,
+						 node_to_cpumask, npresmsk,
+						 nmsk, masks);
+		if (ret < 0)
+			goto fail_build_affinity;
+		if (!ret)
+			break;
+		nr_present += ret;
+	}
 
 	/*
 	 * Spread on non present CPUs starting from the next vector to be
-- 
2.17.1



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]