On Tue, 2020-10-06 at 09:37 +0100, David Woodhouse wrote: > On Tue, 2020-10-06 at 06:47 +0000, Dexuan Cui wrote: > > Hi all, > > I'm running a single-CPU Linux VM on Hyper-V. The Linux kernel is v5.9-rc7 > > and I have CONFIG_NR_CPUS=256. > > > > The Hyper-V Host (Version 17763-10.0-1-0.1457) provides a guest firmware, > > which always reports 128 Local APIC entries in the ACPI MADT table. Here > > only the first Local APIC entry's "Processor Enabled" is 1 since this > > Linux VM is configured to have only 1 CPU. This means: in the Linux kernel, > > the "cpu_present_mask" and " cpu_online_mask " have only 1 CPU (i.e. CPU0), > > while the "cpu_possible_mask" has 128 CPUs, and the "nr_cpu_ids" is 128. > > > > I pass through an MSI-X-capable PCI device to the Linux VM (which has > > only 1 virtual CPU), and the below code does *not* report any error > > (i.e. pci_alloc_irq_vectors_affinity() returns 2, and request_irq() > > returns 0), but the code does not work: the second MSI-X interrupt is not > > happening while the first interrupt does work fine. > > > > int nr_irqs = 2; > > int i, nvec, irq; > > > > nvec = pci_alloc_irq_vectors_affinity(pdev, nr_irqs, nr_irqs, > > PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, NULL); > > > > for (i = 0; i < nvec; i++) { > > irq = pci_irq_vector(pdev, i); > > err = request_irq(irq, test_intr, 0, "test_intr", &intr_cxt[i]); > > } > > > > It turns out that pci_alloc_irq_vectors_affinity() -> ... -> > > irq_create_affinity_masks() allocates an improper affinity for the second > > interrupt. The below printk() shows that the second interrupt's affinity is > > 1-64, but only CPU0 is present in the system! As a result, later, > > request_irq() -> ... -> irq_startup() -> __irq_startup_managed() returns > > IRQ_STARTUP_ABORT because cpumask_any_and(aff, cpu_online_mask) is > > empty (i.e. >= nr_cpu_ids), and irq_startup() *silently* fails (i.e. "return 0;"), > > since __irq_startup() is only called for IRQ_STARTUP_MANAGED and > > IRQ_STARTUP_NORMAL. > > > > --- a/kernel/irq/affinity.c > > +++ b/kernel/irq/affinity.c > > @@ -484,6 +484,9 @@ struct irq_affinity_desc * > > for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++) > > masks[i].is_managed = 1; > > > > + for (i = 0; i < nvecs; i++) > > + printk("i=%d, affi = %*pbl\n", i, > > + cpumask_pr_args(&masks[i].mask)); > > return masks; > > } > > > > [ 43.770477] i=0, affi = 0,65-127 > > [ 43.770484] i=1, affi = 1-64 > > > > Though here the issue happens to a Linux VM on Hyper-V, I think the same > > issue can also happen to a physical machine, if the physical machine also > > uses a lot of static MADT entries, of which only the entries of the present > > CPUs are marked to be "Processor Enabled == 1". > > > > I think pci_alloc_irq_vectors_affinity() -> __pci_enable_msix_range() -> > > irq_calc_affinity_vectors() -> cpumask_weight(cpu_possible_mask) should > > use cpu_present_mask rather than cpu_possible_mask (), so here > > irq_calc_affinity_vectors() would return 1, and > > __pci_enable_msix_range() would immediately return -ENOSPC to avoid a > > *silent* failure. > > > > However, git-log shows that this 2018 commit intentionally changed the > > cpu_present_mask to cpu_possible_mask: > > 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs") > > > > so I'm not sure whether (and how?) we should address the *silent* failure. > > > > BTW, here I use a single-CPU VM to simplify the discussion. Actually, > > if the VM has n CPUs, with the above usage of > > pci_alloc_irq_vectors_affinity() (which might seem incorrect, but my point is > > that it's really not good to have a silent failure, which makes it a lot more > > difficult to figure out what goes wrong), it looks only the first n MSI-X interrupts > > can work, and the (n+1)'th MSI-X interrupt can not work due to the allocated > > improper affinity. > > > > According to my tests, if we need n+1 MSI-X interrupts in such a VM that > > has n CPUs, it looks we have 2 options (the second should be better): > > > > 1. Do not use the PCI_IRQ_AFFINITY flag, i.e. > > pci_alloc_irq_vectors_affinity(pdev, n+1, n+1, PCI_IRQ_MSIX, NULL); > > > > 2. Use the PCI_IRQ_AFFINITY flag, and pass a struct irq_affinity affd, > > which tells the API that we don't care about the first interrupt's affinity: > > > > struct irq_affinity affd = { > > .pre_vectors = 1, > > ... > > }; > > > > pci_alloc_irq_vectors_affinity(pdev, n+1, n+1, > > PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, &affd); > > > > PS, irq_create_affinity_masks() is complicated. Let me know if you're > > interested to know how it allocates the invalid affinity "1-64" for the > > second MSI-X interrupt. > > Go on. It'll save me a cup of coffee or two... > > > PS2, the latest Hyper-V provides only one ACPI MADT entry to a 1-CPU VM, > > so the issue described above can not reproduce there. > > It seems fairly easy to reproduce in qemu with -smp 1,maxcpus=128 and a > virtio-blk drive, having commented out the 'desc->pre_vectors++' around > line 130 of virtio_pci_common.c so that it does actually spread them. > > [ 0.836252] i=0, affi = 0,65-127 > [ 0.836672] i=1, affi = 1-64 > [ 0.837905] virtio_blk virtio1: [vda] 41943040 512-byte logical blocks (21.5 GB/20.0 GiB) > [ 0.839080] vda: detected capacity change from 0 to 21474836480 > > In my build I had to add 'nox2apic' because I think I actually already > fixed this for the x2apic + no-irq-remapping case with the max_affinity > patch series¹. But mostly by accident. > > > ¹ https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/irqaffinity Is it fixed by https://git.infradead.org/users/dwmw2/linux.git/commitdiff/41cfe6d54e5? --- kernel/irq/affinity.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 6d7dbcf91061..00aa0ba6b32a 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -364,12 +364,17 @@ static int irq_build_affinity_masks(unsigned int startvec, unsigned int numvecs, cpumask_copy(npresmsk, cpu_present_mask); /* Spread on present CPUs starting from affd->pre_vectors */ - ret = __irq_build_affinity_masks(curvec, numvecs, firstvec, - node_to_cpumask, cpu_present_mask, - nmsk, masks); - if (ret < 0) - goto fail_build_affinity; - nr_present = ret; + while (nr_present < numvecs) { + curvec = firstvec + nr_present; + ret = __irq_build_affinity_masks(curvec, numvecs, firstvec, + node_to_cpumask, npresmsk, + nmsk, masks); + if (ret < 0) + goto fail_build_affinity; + if (!ret) + break; + nr_present += ret; + } /* * Spread on non present CPUs starting from the next vector to be -- 2.17.1