regressions.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years
@ 2021-11-19 10:58 Niklas Schnelle
  2021-11-20 16:38 ` Moshe Shemesh
  0 siblings, 1 reply; 8+ messages in thread
From: Niklas Schnelle @ 2021-11-19 10:58 UTC (permalink / raw)
  To: Amir Tzin, Moshe Shemesh, Saeed Mahameed; +Cc: netdev, regressions, linux-s390

Hello Amir, Moshe, and Saeed,

(resent due to wrong netdev mailing list address, sorry about that)

During testing of PCI device recovery, I found a problem in the mlx5
recovery support introduced in v5.16-rc1 by commit 32def4120e48
("net/mlx5: Read timeout values from DTOR"). It follows my analysis of
the problem.

When the device is in an error state, at least on s390 but I believe
also on other systems, it is isolated and all PCI MMIO reads return
0xff. This is detected by your driver and it will immediately attempt
to recovery the device with a mlx5_core driver specific recovery
mechanism. Since at this point no reset has been done that would take
the device out of isolation this will of course fail as it can't
communicate with the device. Under normal circumstances this reset
would happen later during the new recovery flow introduced in
4cdf2f4e24ff ("s390/pci: implement minimal PCI error recovery") once
firmware has done their side of the recovery allowing that to succeed
once the driver specific recovery has failed.

With v5.16-rc1 however the driver specific recovery gets stuck holding
locks which will block our recovery. Without our recovery mechanism
this can also be seen by "echo 1 > /sys/bus/pci/devices/<dev>/remove"
which hangs on the device lock forever.

Digging into this I tracked the problem down to
mlx5_health_wait_pci_up() hangig. I added a debug print to it and it
turns out that with the device isolated mlx5_tout_ms(dev, FW_RESET)
returns 774039849367420401 (0x6B...6B) milliseconds and we try to wait
245 million years. After reverting that commit things work again,
though of course the driver specific recovery flow will still fail
before ours can kick in and finally succeed.

Thanks,
Niklas Schnelle

#regzbot introduced: 32def4120e48


^ permalink raw reply	[flat|nested] 8+ messages in thread
* Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years
@ 2021-11-19 10:47 Niklas Schnelle
  2021-11-19 11:38 ` Thorsten Leemhuis
  0 siblings, 1 reply; 8+ messages in thread
From: Niklas Schnelle @ 2021-11-19 10:47 UTC (permalink / raw)
  To: Amir Tzin, Moshe Shemesh, Saeed Mahameed
  Cc: linux-netdev, regressions, linux-s390

Hello Amir, Moshe, and Saeed,

During testing of PCI device recovery, I found a problem in the mlx5
recovery support introduced in v5.16-rc1 by commit 32def4120e48
("net/mlx5: Read timeout values from DTOR"). It follows my analysis of
the problem.

When the device is in an error state, at least on s390 but I believe
also on other systems, it is isolated and all PCI MMIO reads return
0xff. This is detected by your driver and it will immediately attempt
to recovery the device with a mlx5_core driver specific recovery
mechanism. Since at this point no reset has been done that would take
the device out of isolation this will of course fail as it can't
communicate with the device. Under normal circumstances this reset
would happen later during the new recovery flow introduced in
4cdf2f4e24ff ("s390/pci: implement minimal PCI error recovery") once
firmware has done their side of the recovery allowing that to succeed
once the driver specific recovery has failed.

With v5.16-rc1 however the driver specific recovery gets stuck holding
locks which will block our recovery. Without our recovery mechanism
this can also be seen by "echo 1 > /sys/bus/pci/devices/<dev>/remove"
which hangs on the device lock forever.

Digging into this I tracked the problem down to
mlx5_health_wait_pci_up() hangig. I added a debug print to it and it
turns out that with the device isolated mlx5_tout_ms(dev, FW_RESET)
returns 774039849367420401 (0x6B...6B) milliseconds and we try to wait
245 million years. After reverting that commit things work again,
though of course the driver specific recovery flow will still fail
before ours can kick in and finally succeed.

Thanks,
Niklas Schnelle

#regzbot introduced: 32def4120e48



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-12-02 13:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-19 10:58 Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years Niklas Schnelle
2021-11-20 16:38 ` Moshe Shemesh
2021-12-02  6:52   ` Thorsten Leemhuis
2021-12-02 10:05     ` Moshe Shemesh
2021-12-02 13:56       ` Thorsten Leemhuis
  -- strict thread matches above, loose matches on Subject: below --
2021-11-19 10:47 Niklas Schnelle
2021-11-19 11:38 ` Thorsten Leemhuis
2021-11-19 12:17   ` Thorsten Leemhuis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).