Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years

regressions.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

* Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years
@ 2021-11-19 10:58 Niklas Schnelle
  2021-11-20 16:38 ` Moshe Shemesh
  0 siblings, 1 reply; 8+ messages in thread
From: Niklas Schnelle @ 2021-11-19 10:58 UTC (permalink / raw)
  To: Amir Tzin, Moshe Shemesh, Saeed Mahameed; +Cc: netdev, regressions, linux-s390

Hello Amir, Moshe, and Saeed,

(resent due to wrong netdev mailing list address, sorry about that)

During testing of PCI device recovery, I found a problem in the mlx5
recovery support introduced in v5.16-rc1 by commit 32def4120e48
("net/mlx5: Read timeout values from DTOR"). It follows my analysis of
the problem.

When the device is in an error state, at least on s390 but I believe
also on other systems, it is isolated and all PCI MMIO reads return
0xff. This is detected by your driver and it will immediately attempt
to recovery the device with a mlx5_core driver specific recovery
mechanism. Since at this point no reset has been done that would take
the device out of isolation this will of course fail as it can't
communicate with the device. Under normal circumstances this reset
would happen later during the new recovery flow introduced in
4cdf2f4e24ff ("s390/pci: implement minimal PCI error recovery") once
firmware has done their side of the recovery allowing that to succeed
once the driver specific recovery has failed.

With v5.16-rc1 however the driver specific recovery gets stuck holding
locks which will block our recovery. Without our recovery mechanism
this can also be seen by "echo 1 > /sys/bus/pci/devices/<dev>/remove"
which hangs on the device lock forever.

Digging into this I tracked the problem down to
mlx5_health_wait_pci_up() hangig. I added a debug print to it and it
turns out that with the device isolated mlx5_tout_ms(dev, FW_RESET)
returns 774039849367420401 (0x6B...6B) milliseconds and we try to wait
245 million years. After reverting that commit things work again,
though of course the driver specific recovery flow will still fail
before ours can kick in and finally succeed.

Thanks,
Niklas Schnelle

#regzbot introduced: 32def4120e48

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years
  2021-11-19 10:58 Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years Niklas Schnelle
@ 2021-11-20 16:38 ` Moshe Shemesh
  2021-12-02  6:52   ` Thorsten Leemhuis
  0 siblings, 1 reply; 8+ messages in thread
From: Moshe Shemesh @ 2021-11-20 16:38 UTC (permalink / raw)
  To: Niklas Schnelle, Amir Tzin, Saeed Mahameed
  Cc: netdev, regressions, linux-s390

Thank you for reporting Niklas.

This is actually a case of use after free, as following that patch the 
recovery flow goes through mlx5_tout_cleanup() while timeouts structure 
is still needed in this flow.

We know the root cause and will send a fix.


Thanks,

Moshe.

On 11/19/2021 12:58 PM, Niklas Schnelle wrote:
> Hello Amir, Moshe, and Saeed,
>
> (resent due to wrong netdev mailing list address, sorry about that)
>
> During testing of PCI device recovery, I found a problem in the mlx5
> recovery support introduced in v5.16-rc1 by commit 32def4120e48
> ("net/mlx5: Read timeout values from DTOR"). It follows my analysis of
> the problem.
>
> When the device is in an error state, at least on s390 but I believe
> also on other systems, it is isolated and all PCI MMIO reads return
> 0xff. This is detected by your driver and it will immediately attempt
> to recovery the device with a mlx5_core driver specific recovery
> mechanism. Since at this point no reset has been done that would take
> the device out of isolation this will of course fail as it can't
> communicate with the device. Under normal circumstances this reset
> would happen later during the new recovery flow introduced in
> 4cdf2f4e24ff ("s390/pci: implement minimal PCI error recovery") once
> firmware has done their side of the recovery allowing that to succeed
> once the driver specific recovery has failed.
>
> With v5.16-rc1 however the driver specific recovery gets stuck holding
> locks which will block our recovery. Without our recovery mechanism
> this can also be seen by "echo 1 > /sys/bus/pci/devices/<dev>/remove"
> which hangs on the device lock forever.
>
> Digging into this I tracked the problem down to
> mlx5_health_wait_pci_up() hangig. I added a debug print to it and it
> turns out that with the device isolated mlx5_tout_ms(dev, FW_RESET)
> returns 774039849367420401 (0x6B...6B) milliseconds and we try to wait
> 245 million years. After reverting that commit things work again,
> though of course the driver specific recovery flow will still fail
> before ours can kick in and finally succeed.
>
> Thanks,
> Niklas Schnelle
>
> #regzbot introduced: 32def4120e48
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years
  2021-11-20 16:38 ` Moshe Shemesh
@ 2021-12-02  6:52   ` Thorsten Leemhuis
  2021-12-02 10:05     ` Moshe Shemesh
  0 siblings, 1 reply; 8+ messages in thread
From: Thorsten Leemhuis @ 2021-12-02  6:52 UTC (permalink / raw)
  To: Moshe Shemesh, Niklas Schnelle, Amir Tzin, Saeed Mahameed
  Cc: netdev, regressions, linux-s390

Hi, this is your Linux kernel regression tracker speaking.

On 20.11.21 17:38, Moshe Shemesh wrote:
> Thank you for reporting Niklas.
> 
> This is actually a case of use after free, as following that patch the
> recovery flow goes through mlx5_tout_cleanup() while timeouts structure
> is still needed in this flow.
> 
> We know the root cause and will send a fix.

That was twelve days ago, thus allow me asking: has any progress been
made? I could not find any with a quick search on lore.

Ciao, Thorsten

> On 11/19/2021 12:58 PM, Niklas Schnelle wrote:
>> Hello Amir, Moshe, and Saeed,
>>
>> (resent due to wrong netdev mailing list address, sorry about that)
>>
>> During testing of PCI device recovery, I found a problem in the mlx5
>> recovery support introduced in v5.16-rc1 by commit 32def4120e48
>> ("net/mlx5: Read timeout values from DTOR"). It follows my analysis of
>> the problem.
>>
>> When the device is in an error state, at least on s390 but I believe
>> also on other systems, it is isolated and all PCI MMIO reads return
>> 0xff. This is detected by your driver and it will immediately attempt
>> to recovery the device with a mlx5_core driver specific recovery
>> mechanism. Since at this point no reset has been done that would take
>> the device out of isolation this will of course fail as it can't
>> communicate with the device. Under normal circumstances this reset
>> would happen later during the new recovery flow introduced in
>> 4cdf2f4e24ff ("s390/pci: implement minimal PCI error recovery") once
>> firmware has done their side of the recovery allowing that to succeed
>> once the driver specific recovery has failed.
>>
>> With v5.16-rc1 however the driver specific recovery gets stuck holding
>> locks which will block our recovery. Without our recovery mechanism
>> this can also be seen by "echo 1 > /sys/bus/pci/devices/<dev>/remove"
>> which hangs on the device lock forever.
>>
>> Digging into this I tracked the problem down to
>> mlx5_health_wait_pci_up() hangig. I added a debug print to it and it
>> turns out that with the device isolated mlx5_tout_ms(dev, FW_RESET)
>> returns 774039849367420401 (0x6B...6B) milliseconds and we try to wait
>> 245 million years. After reverting that commit things work again,
>> though of course the driver specific recovery flow will still fail
>> before ours can kick in and finally succeed.
>>
>> Thanks,
>> Niklas Schnelle
>>
>> #regzbot introduced: 32def4120e48
>>
> 
> 

P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
on my table. I can only look briefly into most of them. Unfortunately
therefore I sometimes will get things wrong or miss something important.
I hope that's not the case here; if you think it is, don't hesitate to
tell me about it in a public reply. That's in everyone's interest, as
what I wrote above might be misleading to everyone reading this; any
suggestion I gave they thus might sent someone reading this down the
wrong rabbit hole, which none of us wants.

BTW, I have no personal interest in this issue, which is tracked using
regzbot, my Linux kernel regression tracking bot
(https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
this mail to get things rolling again and hence don't need to be CC on
all further activities wrt to this regression.

#regzbot poke

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years
  2021-12-02  6:52   ` Thorsten Leemhuis
@ 2021-12-02 10:05     ` Moshe Shemesh
  2021-12-02 13:56       ` Thorsten Leemhuis
  0 siblings, 1 reply; 8+ messages in thread
From: Moshe Shemesh @ 2021-12-02 10:05 UTC (permalink / raw)
  To: Thorsten Leemhuis, Niklas Schnelle, Amir Tzin, Saeed Mahameed
  Cc: netdev, regressions, linux-s390


On 12/2/2021 8:52 AM, Thorsten Leemhuis wrote:
> External email: Use caution opening links or attachments
>
>
> Hi, this is your Linux kernel regression tracker speaking.
>
> On 20.11.21 17:38, Moshe Shemesh wrote:
>> Thank you for reporting Niklas.
>>
>> This is actually a case of use after free, as following that patch the
>> recovery flow goes through mlx5_tout_cleanup() while timeouts structure
>> is still needed in this flow.
>>
>> We know the root cause and will send a fix.
> That was twelve days ago, thus allow me asking: has any progress been
> made? I could not find any with a quick search on lore.


Yes, fix was submitted by Saeed yesterday, title: "[net 10/13] net/mlx5: 
Fix use after free in mlx5_health_wait_pci_up".

> Ciao, Thorsten
>
>> On 11/19/2021 12:58 PM, Niklas Schnelle wrote:
>>> Hello Amir, Moshe, and Saeed,
>>>
>>> (resent due to wrong netdev mailing list address, sorry about that)
>>>
>>> During testing of PCI device recovery, I found a problem in the mlx5
>>> recovery support introduced in v5.16-rc1 by commit 32def4120e48
>>> ("net/mlx5: Read timeout values from DTOR"). It follows my analysis of
>>> the problem.
>>>
>>> When the device is in an error state, at least on s390 but I believe
>>> also on other systems, it is isolated and all PCI MMIO reads return
>>> 0xff. This is detected by your driver and it will immediately attempt
>>> to recovery the device with a mlx5_core driver specific recovery
>>> mechanism. Since at this point no reset has been done that would take
>>> the device out of isolation this will of course fail as it can't
>>> communicate with the device. Under normal circumstances this reset
>>> would happen later during the new recovery flow introduced in
>>> 4cdf2f4e24ff ("s390/pci: implement minimal PCI error recovery") once
>>> firmware has done their side of the recovery allowing that to succeed
>>> once the driver specific recovery has failed.
>>>
>>> With v5.16-rc1 however the driver specific recovery gets stuck holding
>>> locks which will block our recovery. Without our recovery mechanism
>>> this can also be seen by "echo 1 > /sys/bus/pci/devices/<dev>/remove"
>>> which hangs on the device lock forever.
>>>
>>> Digging into this I tracked the problem down to
>>> mlx5_health_wait_pci_up() hangig. I added a debug print to it and it
>>> turns out that with the device isolated mlx5_tout_ms(dev, FW_RESET)
>>> returns 774039849367420401 (0x6B...6B) milliseconds and we try to wait
>>> 245 million years. After reverting that commit things work again,
>>> though of course the driver specific recovery flow will still fail
>>> before ours can kick in and finally succeed.
>>>
>>> Thanks,
>>> Niklas Schnelle
>>>
>>> #regzbot introduced: 32def4120e48
>>>
>>
> P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
> on my table. I can only look briefly into most of them. Unfortunately
> therefore I sometimes will get things wrong or miss something important.
> I hope that's not the case here; if you think it is, don't hesitate to
> tell me about it in a public reply. That's in everyone's interest, as
> what I wrote above might be misleading to everyone reading this; any
> suggestion I gave they thus might sent someone reading this down the
> wrong rabbit hole, which none of us wants.
>
> BTW, I have no personal interest in this issue, which is tracked using
> regzbot, my Linux kernel regression tracking bot
> (https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flinux-regtracking.leemhuis.info%2Fregzbot%2F&amp;data=04%7C01%7Cmoshe%40nvidia.com%7C33857ebcf13946a09c6408d9b5605f19%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637740248366231179%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=Fuqme7inI68fhvGfPh2WPzvussq1awkqxFLqKHm%2FSmQ%3D&amp;reserved=0). I'm only posting
> this mail to get things rolling again and hence don't need to be CC on
> all further activities wrt to this regression.
>
> #regzbot poke

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years
  2021-12-02 10:05     ` Moshe Shemesh
@ 2021-12-02 13:56       ` Thorsten Leemhuis
  0 siblings, 0 replies; 8+ messages in thread
From: Thorsten Leemhuis @ 2021-12-02 13:56 UTC (permalink / raw)
  To: Moshe Shemesh, Niklas Schnelle, Amir Tzin, Saeed Mahameed
  Cc: netdev, regressions, linux-s390

On 02.12.21 11:05, Moshe Shemesh wrote:
> On 12/2/2021 8:52 AM, Thorsten Leemhuis wrote:
>> On 20.11.21 17:38, Moshe Shemesh wrote:
>>> Thank you for reporting Niklas.
>>>
>>> This is actually a case of use after free, as following that patch the
>>> recovery flow goes through mlx5_tout_cleanup() while timeouts structure
>>> is still needed in this flow.
>>>
>>> We know the root cause and will send a fix.
>> That was twelve days ago, thus allow me asking: has any progress been
>> made? I could not find any with a quick search on lore.
> 
> Yes, fix was submitted by Saeed yesterday, title: "[net 10/13] net/mlx5:
> Fix use after free in mlx5_health_wait_pci_up".

Ahh, thx.

FWIW: would have been nice if the fix would have linked to the mail
which the regression report, for reasons explained in
Documentation/process/submitting-patches.rst. To quote:

```
If related discussions or any other background information behind the
change can be found on the web, add 'Link:' tags pointing to it. In case
your patch fixes a bug, for example, add a tag with a URL referencing
the report in the mailing list archives or a bug tracker;
```

This concept is old, but the text was reworked recently to make this use
case for the Link: tag clearer. For details see:
https://git.kernel.org/linus/1f57bd42b77c

Yes, that "Link:" is not really crucial; but it's good to have if
someone needs to look into the backstory of this change sometime in the
future. But I care for a different reason. I'm tracking this regression
(and others) with regzbot, my Linux kernel regression tracking bot. This
bot will notice if a patch with a Link: tag to a tracked regression gets
posted and record that, which allowed anyone looking into the regression
to quickly gasp the current status from regzbot's webui
(https://linux-regtracking.leemhuis.info/regzbot ) or its reports. The
bot will also notice if a commit with a Link: tag to a regression report
is applied by Linus and then automatically mark the regression as
resolved then.

Whatever, too late now, but maybe next time :-D I just rell regzbot
manually that a fix is heading towards mailine:

#regzbot monitor:
https://lore.kernel.org/r/20211201063709.229103-11-saeed@kernel.org/
#regzbot fixed-by: 76091b0fb60970f610b7ba2d886cd7fb95c5eb2e
#regzbot ignore-activity

Ciao, Thorsten

>> Ciao, Thorsten
>>
>>> On 11/19/2021 12:58 PM, Niklas Schnelle wrote:
>>>> Hello Amir, Moshe, and Saeed,
>>>>
>>>> (resent due to wrong netdev mailing list address, sorry about that)
>>>>
>>>> During testing of PCI device recovery, I found a problem in the mlx5
>>>> recovery support introduced in v5.16-rc1 by commit 32def4120e48
>>>> ("net/mlx5: Read timeout values from DTOR"). It follows my analysis of
>>>> the problem.
>>>>
>>>> When the device is in an error state, at least on s390 but I believe
>>>> also on other systems, it is isolated and all PCI MMIO reads return
>>>> 0xff. This is detected by your driver and it will immediately attempt
>>>> to recovery the device with a mlx5_core driver specific recovery
>>>> mechanism. Since at this point no reset has been done that would take
>>>> the device out of isolation this will of course fail as it can't
>>>> communicate with the device. Under normal circumstances this reset
>>>> would happen later during the new recovery flow introduced in
>>>> 4cdf2f4e24ff ("s390/pci: implement minimal PCI error recovery") once
>>>> firmware has done their side of the recovery allowing that to succeed
>>>> once the driver specific recovery has failed.
>>>>
>>>> With v5.16-rc1 however the driver specific recovery gets stuck holding
>>>> locks which will block our recovery. Without our recovery mechanism
>>>> this can also be seen by "echo 1 > /sys/bus/pci/devices/<dev>/remove"
>>>> which hangs on the device lock forever.
>>>>
>>>> Digging into this I tracked the problem down to
>>>> mlx5_health_wait_pci_up() hangig. I added a debug print to it and it
>>>> turns out that with the device isolated mlx5_tout_ms(dev, FW_RESET)
>>>> returns 774039849367420401 (0x6B...6B) milliseconds and we try to wait
>>>> 245 million years. After reverting that commit things work again,
>>>> though of course the driver specific recovery flow will still fail
>>>> before ours can kick in and finally succeed.
>>>>
>>>> Thanks,
>>>> Niklas Schnelle
>>>>
>>>> #regzbot introduced: 32def4120e48
>>>>
>>>
>> P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
>> on my table. I can only look briefly into most of them. Unfortunately
>> therefore I sometimes will get things wrong or miss something important.
>> I hope that's not the case here; if you think it is, don't hesitate to
>> tell me about it in a public reply. That's in everyone's interest, as
>> what I wrote above might be misleading to everyone reading this; any
>> suggestion I gave they thus might sent someone reading this down the
>> wrong rabbit hole, which none of us wants.
>>
>> BTW, I have no personal interest in this issue, which is tracked using
>> regzbot, my Linux kernel regression tracking bot
>> (https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flinux-regtracking.leemhuis.info%2Fregzbot%2F&amp;data=04%7C01%7Cmoshe%40nvidia.com%7C33857ebcf13946a09c6408d9b5605f19%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637740248366231179%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=Fuqme7inI68fhvGfPh2WPzvussq1awkqxFLqKHm%2FSmQ%3D&amp;reserved=0).
>> I'm only posting
>> this mail to get things rolling again and hence don't need to be CC on
>> all further activities wrt to this regression.
>>
>> #regzbot poke
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years
  2021-11-19 11:38 ` Thorsten Leemhuis
@ 2021-11-19 12:17   ` Thorsten Leemhuis
  0 siblings, 0 replies; 8+ messages in thread
From: Thorsten Leemhuis @ 2021-11-19 12:17 UTC (permalink / raw)
  To: regressions

On 19.11.21 12:38, Thorsten Leemhuis wrote:
> 
> 
> 
> On 19.11.21 11:47, Niklas Schnelle wrote:
>> Hello Amir, Moshe, and Saeed,
>>
>> During testing of PCI device recovery, I found a problem in the mlx5
>> recovery support introduced in v5.16-rc1 by commit 32def4120e48
>> ("net/mlx5: Read timeout values from DTOR"). It follows my analysis of
>> the problem.
>>
>> [...]
>>
>> Thanks,
>> Niklas Schnelle
>>
>> #regzbot introduced: 32def4120e48
> 
> Many thx for getting regzbot directly involved. Due to the repost it now
> has two entries for this regressions, thus I'm marking this one as a
> duplicate of the repost:
> 
> #regzbot dup-of:
> https://lore.kernel.org/regressions/a627b3cbd2ed4ba5dd75e1163760730187f1416b.camel@linux.ibm.com/

Brown paperbag mail:

Ugh, that command told regzbot to mark the regression as a duplicate of
itself, which obviously is stupid. Trying it again:

#regzbot dup-of:
https://lore.kernel.org/regressions/15db9c1d11d32fb16269afceb527b5d743177ac4.camel@linux.ibm.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years
  2021-11-19 10:47 Niklas Schnelle
@ 2021-11-19 11:38 ` Thorsten Leemhuis
  2021-11-19 12:17   ` Thorsten Leemhuis
  0 siblings, 1 reply; 8+ messages in thread
From: Thorsten Leemhuis @ 2021-11-19 11:38 UTC (permalink / raw)
  To: Niklas Schnelle, regressions




On 19.11.21 11:47, Niklas Schnelle wrote:
> Hello Amir, Moshe, and Saeed,
> 
> During testing of PCI device recovery, I found a problem in the mlx5
> recovery support introduced in v5.16-rc1 by commit 32def4120e48
> ("net/mlx5: Read timeout values from DTOR"). It follows my analysis of
> the problem.
>
> [...]
>
> Thanks,
> Niklas Schnelle
> 
> #regzbot introduced: 32def4120e48

Many thx for getting regzbot directly involved. Due to the repost it now
has two entries for this regressions, thus I'm marking this one as a
duplicate of the repost:

#regzbot dup-of:
https://lore.kernel.org/regressions/a627b3cbd2ed4ba5dd75e1163760730187f1416b.camel@linux.ibm.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years
@ 2021-11-19 10:47 Niklas Schnelle
  2021-11-19 11:38 ` Thorsten Leemhuis
  0 siblings, 1 reply; 8+ messages in thread
From: Niklas Schnelle @ 2021-11-19 10:47 UTC (permalink / raw)
  To: Amir Tzin, Moshe Shemesh, Saeed Mahameed
  Cc: linux-netdev, regressions, linux-s390

Hello Amir, Moshe, and Saeed,

During testing of PCI device recovery, I found a problem in the mlx5
recovery support introduced in v5.16-rc1 by commit 32def4120e48
("net/mlx5: Read timeout values from DTOR"). It follows my analysis of
the problem.

When the device is in an error state, at least on s390 but I believe
also on other systems, it is isolated and all PCI MMIO reads return
0xff. This is detected by your driver and it will immediately attempt
to recovery the device with a mlx5_core driver specific recovery
mechanism. Since at this point no reset has been done that would take
the device out of isolation this will of course fail as it can't
communicate with the device. Under normal circumstances this reset
would happen later during the new recovery flow introduced in
4cdf2f4e24ff ("s390/pci: implement minimal PCI error recovery") once
firmware has done their side of the recovery allowing that to succeed
once the driver specific recovery has failed.

With v5.16-rc1 however the driver specific recovery gets stuck holding
locks which will block our recovery. Without our recovery mechanism
this can also be seen by "echo 1 > /sys/bus/pci/devices/<dev>/remove"
which hangs on the device lock forever.

Digging into this I tracked the problem down to
mlx5_health_wait_pci_up() hangig. I added a debug print to it and it
turns out that with the device isolated mlx5_tout_ms(dev, FW_RESET)
returns 774039849367420401 (0x6B...6B) milliseconds and we try to wait
245 million years. After reverting that commit things work again,
though of course the driver specific recovery flow will still fail
before ours can kick in and finally succeed.

Thanks,
Niklas Schnelle

#regzbot introduced: 32def4120e48

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-12-02 13:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-19 10:58 Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years Niklas Schnelle
2021-11-20 16:38 ` Moshe Shemesh
2021-12-02  6:52   ` Thorsten Leemhuis
2021-12-02 10:05     ` Moshe Shemesh
2021-12-02 13:56       ` Thorsten Leemhuis
  -- strict thread matches above, loose matches on Subject: below --
2021-11-19 10:47 Niklas Schnelle
2021-11-19 11:38 ` Thorsten Leemhuis
2021-11-19 12:17   ` Thorsten Leemhuis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).