linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* pci_bus_read_config constantly took 1.3 seconds
@ 2019-11-27  0:46 Kexin Chen
  2019-11-27 23:17 ` Keith Busch
  0 siblings, 1 reply; 2+ messages in thread
From: Kexin Chen @ 2019-11-27  0:46 UTC (permalink / raw)
  To: linux-pci

Hi,

I'm Kexin. I'm working on Linux nvme system. Some of my test triggered
PCI AER uncorrectable errors leading to slow pci_bus_read_config_XXX,
which took 1.3 seconds for every access. This caused a lot of CPU
scheduling issues, for example, 'Thread not rescheduled for xxx ms
after irq xxx' or 'Softirq x took xxx ms', and finally kernel reboot
due to soft lockup. Definitely there's hardware issue, but could
kernel take some actions to avoid kernel from crashing and exit this
gracefully ? My current system is using 4.4.182.

Thanks,
Kexin

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: pci_bus_read_config constantly took 1.3 seconds
  2019-11-27  0:46 pci_bus_read_config constantly took 1.3 seconds Kexin Chen
@ 2019-11-27 23:17 ` Keith Busch
  0 siblings, 0 replies; 2+ messages in thread
From: Keith Busch @ 2019-11-27 23:17 UTC (permalink / raw)
  To: Kexin Chen; +Cc: linux-pci

On Tue, Nov 26, 2019 at 04:46:25PM -0800, Kexin Chen wrote:
> I'm Kexin. I'm working on Linux nvme system. Some of my test triggered
> PCI AER uncorrectable errors leading to slow pci_bus_read_config_XXX,
> which took 1.3 seconds for every access. This caused a lot of CPU
> scheduling issues, for example, 'Thread not rescheduled for xxx ms
> after irq xxx' or 'Softirq x took xxx ms', and finally kernel reboot
> due to soft lockup. Definitely there's hardware issue, but could
> kernel take some actions to avoid kernel from crashing and exit this
> gracefully ? My current system is using 4.4.182.

Unless the pci layer is reading some config space that it really should
know not to access, there really isn't anything the kernel can do here
if we're really waiting on hardware to complete the transaction. The
hardware just has to function correctly.

There are some types of AERs that do indicate the kernel may avoid
accessing some config space, and it's been improved since 4.4 For example,
we don't try reading upstream ports that are the source of an ERR_FATAL
because the link can't be considered reliable. You may want to try a
more recent stable to see if any of those improvements apply to your case.

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2019-11-27 23:18 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-27  0:46 pci_bus_read_config constantly took 1.3 seconds Kexin Chen
2019-11-27 23:17 ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).