RE: irq_build_affinity_masks() allocates improper affinity if num_possible_cpus() > num_present_cpus()?

From: Dexuan Cui <decui@microsoft.com>
To: Thomas Gleixner <tglx@linutronix.de>,
	Ming Lei <ming.lei@redhat.com>, Christoph Hellwig <hch@lst.de>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	Stefan Haberland <sth@linux.vnet.ibm.com>,
	Jens Axboe <axboe@kernel.dk>, Marc Zyngier <marc.zyngier@arm.com>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Cc: Long Li <longli@microsoft.com>,
	Haiyang Zhang <haiyangz@microsoft.com>,
	Michael Kelley <mikelley@microsoft.com>
Subject: RE: irq_build_affinity_masks() allocates improper affinity if num_possible_cpus() > num_present_cpus()?
Date: Wed, 7 Oct 2020 03:08:09 +0000	[thread overview]
Message-ID: <KU1P153MB0120B72473E50B1B723E25E6BF0A0@KU1P153MB0120.APCP153.PROD.OUTLOOK.COM> (raw)
In-Reply-To: <87lfgj6v30.fsf@nanos.tec.linutronix.de>

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Tuesday, October 6, 2020 11:58 AM
> > ...
> > I pass through an MSI-X-capable PCI device to the Linux VM (which has
> > only 1 virtual CPU), and the below code does *not* report any error
> > (i.e. pci_alloc_irq_vectors_affinity() returns 2, and request_irq()
> > returns 0), but the code does not work: the second MSI-X interrupt is not
> > happening while the first interrupt does work fine.
> >
> > int nr_irqs = 2;
> > int i, nvec, irq;
> >
> > nvec = pci_alloc_irq_vectors_affinity(pdev, nr_irqs, nr_irqs,
> >                 PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, NULL);
> 
> Why should it return an error?

The above code returns -ENOSPC if num_possible_cpus() is also 1, and
returns 0 if num_possible_cpus() is 128. So it looks the above code is
not using the API correctly, and hence gets undefined results.

> > for (i = 0; i < nvec; i++) {
> >         irq = pci_irq_vector(pdev, i);
> >         err = request_irq(irq, test_intr, 0, "test_intr", &intr_cxt[i]);
> > }
> 
> And why do you expect that the second interrupt works?
> 
> This is about managed interrupts and the spreading code has two vectors
> to which it can spread the interrupts. One is assigned to one half of
> the possible CPUs and the other one to the other half. Now you have only
> one CPU online so only the interrupt with has the online CPU in the
> assigned affinity mask is started up.
> 
> That's how managed interrupts work. If you don't want managed interrupts
> then don't use them.
> 
> Thanks,
> 
>         tglx

Thanks for the clarification! It looks with PCI_IRQ_AFFINITY the kernel 
guarantees that the allocated interrutps are 1:1 bound to CPUs, and the
userspace is unable to change the affinities. This is very useful to support
per-CPU I/O queues.

Thanks,
-- Dexuan