From: Niklas Schnelle <firstname.lastname@example.org> To: Amir Tzin <email@example.com>, Moshe Shemesh <firstname.lastname@example.org>, Saeed Mahameed <email@example.com> Cc: linux-netdev <firstname.lastname@example.org>, email@example.com, linux-s390 <firstname.lastname@example.org> Subject: Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years Date: Fri, 19 Nov 2021 11:47:52 +0100 [thread overview] Message-ID: <email@example.com> (raw) Hello Amir, Moshe, and Saeed, During testing of PCI device recovery, I found a problem in the mlx5 recovery support introduced in v5.16-rc1 by commit 32def4120e48 ("net/mlx5: Read timeout values from DTOR"). It follows my analysis of the problem. When the device is in an error state, at least on s390 but I believe also on other systems, it is isolated and all PCI MMIO reads return 0xff. This is detected by your driver and it will immediately attempt to recovery the device with a mlx5_core driver specific recovery mechanism. Since at this point no reset has been done that would take the device out of isolation this will of course fail as it can't communicate with the device. Under normal circumstances this reset would happen later during the new recovery flow introduced in 4cdf2f4e24ff ("s390/pci: implement minimal PCI error recovery") once firmware has done their side of the recovery allowing that to succeed once the driver specific recovery has failed. With v5.16-rc1 however the driver specific recovery gets stuck holding locks which will block our recovery. Without our recovery mechanism this can also be seen by "echo 1 > /sys/bus/pci/devices/<dev>/remove" which hangs on the device lock forever. Digging into this I tracked the problem down to mlx5_health_wait_pci_up() hangig. I added a debug print to it and it turns out that with the device isolated mlx5_tout_ms(dev, FW_RESET) returns 774039849367420401 (0x6B...6B) milliseconds and we try to wait 245 million years. After reverting that commit things work again, though of course the driver specific recovery flow will still fail before ours can kick in and finally succeed. Thanks, Niklas Schnelle #regzbot introduced: 32def4120e48
next reply other threads:[~2021-11-19 10:47 UTC|newest] Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-11-19 10:47 Niklas Schnelle [this message] 2021-11-19 11:38 ` Thorsten Leemhuis 2021-11-19 12:17 ` Thorsten Leemhuis 2021-11-19 10:58 Niklas Schnelle 2021-11-20 16:38 ` Moshe Shemesh 2021-12-02 6:52 ` Thorsten Leemhuis 2021-12-02 10:05 ` Moshe Shemesh 2021-12-02 13:56 ` Thorsten Leemhuis
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --subject='Re: Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).