linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dexuan Cui <decui@microsoft.com>
To: Thomas Gleixner <tglx@linutronix.de>,
	Ming Lei <ming.lei@redhat.com>, Christoph Hellwig <hch@lst.de>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	Stefan Haberland <sth@linux.vnet.ibm.com>,
	Jens Axboe <axboe@kernel.dk>, Marc Zyngier <marc.zyngier@arm.com>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Cc: Long Li <longli@microsoft.com>,
	Haiyang Zhang <haiyangz@microsoft.com>,
	Michael Kelley <mikelley@microsoft.com>
Subject: irq_build_affinity_masks() allocates improper affinity if num_possible_cpus() > num_present_cpus()?
Date: Tue, 6 Oct 2020 06:47:45 +0000	[thread overview]
Message-ID: <KU1P153MB0120D20BC6ED8CF54168EEE6BF0D0@KU1P153MB0120.APCP153.PROD.OUTLOOK.COM> (raw)

Hi all,
I'm running a single-CPU Linux VM on Hyper-V. The Linux kernel is v5.9-rc7
and I have CONFIG_NR_CPUS=256.

The Hyper-V Host (Version 17763-10.0-1-0.1457) provides a guest firmware,
which always reports 128 Local APIC entries in the ACPI MADT table. Here
only the first Local APIC entry's "Processor Enabled" is 1 since this
Linux VM is configured to have only 1 CPU. This means: in the Linux kernel,
the "cpu_present_mask" and " cpu_online_mask " have only 1 CPU (i.e. CPU0),
while the "cpu_possible_mask" has 128 CPUs, and the "nr_cpu_ids" is 128.

I pass through an MSI-X-capable PCI device to the Linux VM (which has
only 1 virtual CPU), and the below code does *not* report any error
(i.e. pci_alloc_irq_vectors_affinity() returns 2, and request_irq()
returns 0), but the code does not work: the second MSI-X interrupt is not
happening while the first interrupt does work fine.

int nr_irqs = 2;
int i, nvec, irq;

nvec = pci_alloc_irq_vectors_affinity(pdev, nr_irqs, nr_irqs,
                PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, NULL);

for (i = 0; i < nvec; i++) {
        irq = pci_irq_vector(pdev, i);
        err = request_irq(irq, test_intr, 0, "test_intr", &intr_cxt[i]);
}

It turns out that pci_alloc_irq_vectors_affinity() -> ... ->
irq_create_affinity_masks() allocates an improper affinity for the second
interrupt. The below printk() shows that the second interrupt's affinity is
1-64, but only CPU0 is present in the system! As a result, later,
request_irq() -> ... -> irq_startup() -> __irq_startup_managed() returns
IRQ_STARTUP_ABORT because cpumask_any_and(aff, cpu_online_mask) is 
empty (i.e. >= nr_cpu_ids), and irq_startup() *silently* fails (i.e. "return 0;"),
since __irq_startup() is only called for IRQ_STARTUP_MANAGED and
IRQ_STARTUP_NORMAL.

--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -484,6 +484,9 @@ struct irq_affinity_desc *
        for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++)
                masks[i].is_managed = 1;

+       for (i = 0; i < nvecs; i++)
+               printk("i=%d, affi = %*pbl\n", i,
+                      cpumask_pr_args(&masks[i].mask));
        return masks;
 }

[   43.770477] i=0, affi = 0,65-127
[   43.770484] i=1, affi = 1-64

Though here the issue happens to a Linux VM on Hyper-V, I think the same
issue can also happen to a physical machine, if the physical machine also
uses a lot of static MADT entries, of which only the entries of the present
CPUs are marked to be "Processor Enabled == 1".

I think pci_alloc_irq_vectors_affinity() -> __pci_enable_msix_range() ->
irq_calc_affinity_vectors() -> cpumask_weight(cpu_possible_mask) should
use cpu_present_mask rather than cpu_possible_mask (), so here
irq_calc_affinity_vectors() would return 1, and
__pci_enable_msix_range() would immediately return -ENOSPC to avoid a
*silent* failure.

However, git-log shows that this 2018 commit intentionally changed the
cpu_present_mask to cpu_possible_mask:
84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs")

so I'm not sure whether (and how?) we should address the *silent* failure.

BTW, here I use a single-CPU VM to simplify the discussion. Actually,
if the VM has n CPUs, with the above usage of
pci_alloc_irq_vectors_affinity() (which might seem incorrect, but my point is
that it's really not good to have a silent failure, which makes it a lot more 
difficult to figure out what goes wrong), it looks only the first n MSI-X interrupts
can work, and the (n+1)'th MSI-X interrupt can not work due to the allocated
improper affinity.

According to my tests, if we need n+1 MSI-X interrupts in such a VM that
has n CPUs, it looks we have 2 options (the second should be better):

1. Do not use the PCI_IRQ_AFFINITY flag, i.e.
        pci_alloc_irq_vectors_affinity(pdev, n+1, n+1, PCI_IRQ_MSIX, NULL);

2. Use the PCI_IRQ_AFFINITY flag, and pass a struct irq_affinity affd,
which tells the API that we don't care about the first interrupt's affinity:

        struct irq_affinity affd = {
                .pre_vectors = 1,
				...
        };

        pci_alloc_irq_vectors_affinity(pdev, n+1, n+1,
                PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, &affd);

PS, irq_create_affinity_masks() is complicated. Let me know if you're
interested to know how it allocates the invalid affinity "1-64" for the
second MSI-X interrupt.

PS2, the latest Hyper-V provides only one ACPI MADT entry to a 1-CPU VM,
so the issue described above can not reproduce there.

Thanks,
-- Dexuan

             reply	other threads:[~2020-10-06  6:48 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-06  6:47 Dexuan Cui [this message]
2020-10-06  8:37 ` irq_build_affinity_masks() allocates improper affinity if num_possible_cpus() > num_present_cpus()? David Woodhouse
2020-10-06 11:17   ` David Woodhouse
2020-10-06 19:00   ` Thomas Gleixner
2020-10-06 18:57 ` Thomas Gleixner
2020-10-07  3:08   ` Dexuan Cui

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=KU1P153MB0120D20BC6ED8CF54168EEE6BF0D0@KU1P153MB0120.APCP153.PROD.OUTLOOK.COM \
    --to=decui@microsoft.com \
    --cc=axboe@kernel.dk \
    --cc=borntraeger@de.ibm.com \
    --cc=haiyangz@microsoft.com \
    --cc=hch@lst.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=longli@microsoft.com \
    --cc=marc.zyngier@arm.com \
    --cc=mikelley@microsoft.com \
    --cc=ming.lei@redhat.com \
    --cc=sth@linux.vnet.ibm.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).