All of lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Ruffell <matthew.ruffell@canonical.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: linux-pci@vger.kernel.org, lkml <linux-kernel@vger.kernel.org>,
	kvm@vger.kernel.org,
	nathan.langford@xcelesunifiedtechnologies.com
Subject: Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio
Date: Mon, 1 Nov 2021 17:35:04 +1300	[thread overview]
Message-ID: <CAKAwkKsoKELnR=--06sRZL3S6_rQVi5J_Kcv6iRQ6w2tY71WCQ@mail.gmail.com> (raw)
In-Reply-To: <CAKAwkKsF3Kn1HLAg55cBVmPmo2y0QAf7g6Zc7q6ZsQZBXGW9bg@mail.gmail.com>

Hi Alex,

Nathan has been running a workload on the 5.14 kernel + the test patch, and has
ran into some interesting softlockups and hardlockups.

The first, happened on a secondary server running a Windows VM, with 7 (of 10)
1080TI GPUs passed through.

Full dmesg:
https://paste.ubuntu.com/p/Wx5hCBBXKb/

There isn't any "irq x: nobody cared" messages, and the crashkernel gets stuck
in the usual copying IR tables from dmar, which suggests an ongoing interrupt
storm.

Nathan disabled "kernel.hardlockup_panic = 1" sysctl, and managed to reproduce
the issue again, suggesting that we get stuck in kernel space for too long
without the ability for interrupts to be serviced.

It starts with the NIC hitting a tx queue timeout, and then does a NMI to unwind
the stack of each CPU, although the stacks don't appear to indicate where things
are stuck. The server then remains softlocked, and keeps unwinding stacks every
26 seconds or so, until it eventually hardlockups.

Full dmesg:
https://people.canonical.com/~mruffell/sf314568/1080TI_hardlockup.txt

The next interesting thing to report is when Nathan started the same Windows VM
on the primary host we have been debugging on, with the 8x 2080TI GPUs. Nathan
experienced a stuck VM, with the host responding just fine. When Nathan reset
the VM, he got 4x "irq xx: nobody cared" messages on IRQs 25, 27, 29 and 31,
which at the time corresponded to the PEX 8747 upstream PCI switches.

Interestingly, Nathan also observed 2x GPU Audio devices sharing the same IRQ
line as the upstream PCI switch, although Nathan mentioned this only occured
very briefly, and the GPU audio devices were re-assigned different IRQs shortly
afterward.

Full dmesg:
https://paste.ubuntu.com/p/C2V4CY3yjZ/

Output showing upstream ports belonging to those IRQs:
https://paste.ubuntu.com/p/6fkSbyFNWT/

Full lspci:
https://paste.ubuntu.com/p/CTX5kbjpRP/

Let us know if you would like any additional debug information. As always, we
are happy to test patches out.

Thanks,
Matthew

  reply	other threads:[~2021-11-01  4:35 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-13  6:31 [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio Matthew Ruffell
2021-09-14 16:43 ` Alex Williamson
2021-09-15  4:44   ` Matthew Ruffell
2021-09-15 16:32     ` Alex Williamson
2021-09-16  5:13       ` Matthew Ruffell
2021-10-05  5:02       ` Matthew Ruffell
2021-10-05 23:13         ` Alex Williamson
2021-10-12  4:58           ` Matthew Ruffell
2021-10-12 20:05             ` Alex Williamson
2021-10-12 22:35               ` Matthew Ruffell
2021-11-01  4:35                 ` Matthew Ruffell [this message]
2021-11-04 22:05                   ` Alex Williamson
2021-11-24  5:52                     ` Matthew Ruffell
2021-11-29 17:56                       ` Alex Williamson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKAwkKsoKELnR=--06sRZL3S6_rQVi5J_Kcv6iRQ6w2tY71WCQ@mail.gmail.com' \
    --to=matthew.ruffell@canonical.com \
    --cc=alex.williamson@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=nathan.langford@xcelesunifiedtechnologies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.