All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Woodhouse <dwmw2@infradead.org>
To: Dexuan Cui <decui@microsoft.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ming Lei <ming.lei@redhat.com>, Christoph Hellwig <hch@lst.de>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	Stefan Haberland <sth@linux.vnet.ibm.com>,
	Jens Axboe <axboe@kernel.dk>, Marc Zyngier <marc.zyngier@arm.com>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Cc: Long Li <longli@microsoft.com>,
	Haiyang Zhang <haiyangz@microsoft.com>,
	Michael Kelley <mikelley@microsoft.com>
Subject: Re: irq_build_affinity_masks() allocates improper affinity if num_possible_cpus() > num_present_cpus()?
Date: Tue, 06 Oct 2020 12:17:25 +0100	[thread overview]
Message-ID: <077f3399e68fca343c06d1016fd6816fb6a59712.camel@infradead.org> (raw)
In-Reply-To: <65ba8a8b86201d8906313fbacc4fb711b9b423af.camel@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 7076 bytes --]

On Tue, 2020-10-06 at 09:37 +0100, David Woodhouse wrote:
> On Tue, 2020-10-06 at 06:47 +0000, Dexuan Cui wrote:
> > Hi all,
> > I'm running a single-CPU Linux VM on Hyper-V. The Linux kernel is v5.9-rc7
> > and I have CONFIG_NR_CPUS=256.
> > 
> > The Hyper-V Host (Version 17763-10.0-1-0.1457) provides a guest firmware,
> > which always reports 128 Local APIC entries in the ACPI MADT table. Here
> > only the first Local APIC entry's "Processor Enabled" is 1 since this
> > Linux VM is configured to have only 1 CPU. This means: in the Linux kernel,
> > the "cpu_present_mask" and " cpu_online_mask " have only 1 CPU (i.e. CPU0),
> > while the "cpu_possible_mask" has 128 CPUs, and the "nr_cpu_ids" is 128.
> > 
> > I pass through an MSI-X-capable PCI device to the Linux VM (which has
> > only 1 virtual CPU), and the below code does *not* report any error
> > (i.e. pci_alloc_irq_vectors_affinity() returns 2, and request_irq()
> > returns 0), but the code does not work: the second MSI-X interrupt is not
> > happening while the first interrupt does work fine.
> > 
> > int nr_irqs = 2;
> > int i, nvec, irq;
> > 
> > nvec = pci_alloc_irq_vectors_affinity(pdev, nr_irqs, nr_irqs,
> >                 PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, NULL);
> > 
> > for (i = 0; i < nvec; i++) {
> >         irq = pci_irq_vector(pdev, i);
> >         err = request_irq(irq, test_intr, 0, "test_intr", &intr_cxt[i]);
> > }
> > 
> > It turns out that pci_alloc_irq_vectors_affinity() -> ... ->
> > irq_create_affinity_masks() allocates an improper affinity for the second
> > interrupt. The below printk() shows that the second interrupt's affinity is
> > 1-64, but only CPU0 is present in the system! As a result, later,
> > request_irq() -> ... -> irq_startup() -> __irq_startup_managed() returns
> > IRQ_STARTUP_ABORT because cpumask_any_and(aff, cpu_online_mask) is 
> > empty (i.e. >= nr_cpu_ids), and irq_startup() *silently* fails (i.e. "return 0;"),
> > since __irq_startup() is only called for IRQ_STARTUP_MANAGED and
> > IRQ_STARTUP_NORMAL.
> > 
> > --- a/kernel/irq/affinity.c
> > +++ b/kernel/irq/affinity.c
> > @@ -484,6 +484,9 @@ struct irq_affinity_desc *
> >         for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++)
> >                 masks[i].is_managed = 1;
> > 
> > +       for (i = 0; i < nvecs; i++)
> > +               printk("i=%d, affi = %*pbl\n", i,
> > +                      cpumask_pr_args(&masks[i].mask));
> >         return masks;
> >  }
> > 
> > [   43.770477] i=0, affi = 0,65-127
> > [   43.770484] i=1, affi = 1-64
> > 
> > Though here the issue happens to a Linux VM on Hyper-V, I think the same
> > issue can also happen to a physical machine, if the physical machine also
> > uses a lot of static MADT entries, of which only the entries of the present
> > CPUs are marked to be "Processor Enabled == 1".
> > 
> > I think pci_alloc_irq_vectors_affinity() -> __pci_enable_msix_range() ->
> > irq_calc_affinity_vectors() -> cpumask_weight(cpu_possible_mask) should
> > use cpu_present_mask rather than cpu_possible_mask (), so here
> > irq_calc_affinity_vectors() would return 1, and
> > __pci_enable_msix_range() would immediately return -ENOSPC to avoid a
> > *silent* failure.
> > 
> > However, git-log shows that this 2018 commit intentionally changed the
> > cpu_present_mask to cpu_possible_mask:
> > 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs")
> > 
> > so I'm not sure whether (and how?) we should address the *silent* failure.
> > 
> > BTW, here I use a single-CPU VM to simplify the discussion. Actually,
> > if the VM has n CPUs, with the above usage of
> > pci_alloc_irq_vectors_affinity() (which might seem incorrect, but my point is
> > that it's really not good to have a silent failure, which makes it a lot more 
> > difficult to figure out what goes wrong), it looks only the first n MSI-X interrupts
> > can work, and the (n+1)'th MSI-X interrupt can not work due to the allocated
> > improper affinity.
> > 
> > According to my tests, if we need n+1 MSI-X interrupts in such a VM that
> > has n CPUs, it looks we have 2 options (the second should be better):
> > 
> > 1. Do not use the PCI_IRQ_AFFINITY flag, i.e.
> >         pci_alloc_irq_vectors_affinity(pdev, n+1, n+1, PCI_IRQ_MSIX, NULL);
> > 
> > 2. Use the PCI_IRQ_AFFINITY flag, and pass a struct irq_affinity affd,
> > which tells the API that we don't care about the first interrupt's affinity:
> > 
> >         struct irq_affinity affd = {
> >                 .pre_vectors = 1,
> > 				...
> >         };
> > 
> >         pci_alloc_irq_vectors_affinity(pdev, n+1, n+1,
> >                 PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, &affd);
> > 
> > PS, irq_create_affinity_masks() is complicated. Let me know if you're
> > interested to know how it allocates the invalid affinity "1-64" for the
> > second MSI-X interrupt.
> 
> Go on. It'll save me a cup of coffee or two...
> 
> > PS2, the latest Hyper-V provides only one ACPI MADT entry to a 1-CPU VM,
> > so the issue described above can not reproduce there.
> 
> It seems fairly easy to reproduce in qemu with -smp 1,maxcpus=128 and a
> virtio-blk drive, having commented out the 'desc->pre_vectors++' around
> line 130 of virtio_pci_common.c so that it does actually spread them.
> 
> [    0.836252] i=0, affi = 0,65-127
> [    0.836672] i=1, affi = 1-64
> [    0.837905] virtio_blk virtio1: [vda] 41943040 512-byte logical blocks (21.5 GB/20.0 GiB)
> [    0.839080] vda: detected capacity change from 0 to 21474836480
> 
> In my build I had to add 'nox2apic' because I think I actually already
> fixed this for the x2apic + no-irq-remapping case with the max_affinity
> patch series¹. But mostly by accident.
> 
> 
> ¹ https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/irqaffinity

Is it fixed by 
https://git.infradead.org/users/dwmw2/linux.git/commitdiff/41cfe6d54e5?


---
 kernel/irq/affinity.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 6d7dbcf91061..00aa0ba6b32a 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -364,12 +364,17 @@ static int irq_build_affinity_masks(unsigned int startvec, unsigned int numvecs,
 		cpumask_copy(npresmsk, cpu_present_mask);
 
 	/* Spread on present CPUs starting from affd->pre_vectors */
-	ret = __irq_build_affinity_masks(curvec, numvecs, firstvec,
-					 node_to_cpumask, cpu_present_mask,
-					 nmsk, masks);
-	if (ret < 0)
-		goto fail_build_affinity;
-	nr_present = ret;
+	while (nr_present < numvecs) {
+		curvec = firstvec + nr_present;
+		ret = __irq_build_affinity_masks(curvec, numvecs, firstvec,
+						 node_to_cpumask, npresmsk,
+						 nmsk, masks);
+		if (ret < 0)
+			goto fail_build_affinity;
+		if (!ret)
+			break;
+		nr_present += ret;
+	}
 
 	/*
 	 * Spread on non present CPUs starting from the next vector to be
-- 
2.17.1



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

  reply	other threads:[~2020-10-06 11:17 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-06  6:47 irq_build_affinity_masks() allocates improper affinity if num_possible_cpus() > num_present_cpus()? Dexuan Cui
2020-10-06  8:37 ` David Woodhouse
2020-10-06 11:17   ` David Woodhouse [this message]
2020-10-06 19:00   ` Thomas Gleixner
2020-10-06 18:57 ` Thomas Gleixner
2020-10-07  3:08   ` Dexuan Cui

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=077f3399e68fca343c06d1016fd6816fb6a59712.camel@infradead.org \
    --to=dwmw2@infradead.org \
    --cc=axboe@kernel.dk \
    --cc=borntraeger@de.ibm.com \
    --cc=decui@microsoft.com \
    --cc=haiyangz@microsoft.com \
    --cc=hch@lst.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=longli@microsoft.com \
    --cc=marc.zyngier@arm.com \
    --cc=mikelley@microsoft.com \
    --cc=ming.lei@redhat.com \
    --cc=sth@linux.vnet.ibm.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.