On 10/12/20 12:58 PM, Bjorn Helgaas wrote: > [+cc Christoph, Thomas, Nitesh] > > On Mon, Oct 12, 2020 at 09:49:37AM -0600, Chris Friesen wrote: >> I've got a linux system running the RT kernel with threaded irqs.  On >> startup we affine the various irq threads to the housekeeping CPUs, but I >> recently hit a scenario where after some days of uptime we ended up with a >> number of NVME irq threads affined to application cores instead (not good >> when we're trying to run low-latency applications). > pci_alloc_irq_vectors_affinity() basically just passes affinity > information through to kernel/irq/affinity.c, and the PCI core doesn't > change affinity after that. > >> Looking at the code, it appears that the NVME driver can in some scenarios >> end up calling pci_alloc_irq_vectors_affinity() after initial system >> startup, which seems to determine CPU affinity without any regard for things >> like "isolcpus" or "cset shield". >> >> There seem to be other reports of similar issues: >> >> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1831566 >> >> It looks like some SCSI drivers and virtio_pci_common.c will also call >> pci_alloc_irq_vectors_affinity(), though I'm not sure if they would ever do >> it after system startup. >> >> How does it make sense for the PCI subsystem to affine interrupts to CPUs >> which have explicitly been designated as "isolated"? > This recent thread may be useful: > > https://lore.kernel.org/linux-pci/20200928183529.471328-1-nitesh@redhat.com/ > > It contains a patch to "Limit pci_alloc_irq_vectors() to housekeeping > CPUs". I'm not sure that patch summary is 100% accurate because IIUC > that particular patch only reduces the *number* of vectors allocated > and does not actually *limit* them to housekeeping CPUs. That is correct the above-mentioned patch is just to reduce the number of vectors. Based on the problem that has been described here, I think the issue could be the usage of cpu_online_mask/cpu_possible_mask while creating the affinity mask or while distributing the jobs. What we should be doing in these cases is to basically use the housekeeping_cpumask instead. A few months back similar issue has been fixed for cpumask_local_spread and some other sub-systems [1]. [1] https://lore.kernel.org/lkml/20200625223443.2684-1-nitesh@redhat.com/ -- Nitesh