Re: Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years

From: Thorsten Leemhuis <regressions@leemhuis.info>
To: Moshe Shemesh <moshe@nvidia.com>,
	Niklas Schnelle <schnelle@linux.ibm.com>,
	Amir Tzin <amirtz@nvidia.com>, Saeed Mahameed <saeedm@nvidia.com>
Cc: netdev <netdev@vger.kernel.org>,
	regressions@lists.linux.dev,
	linux-s390 <linux-s390@vger.kernel.org>
Subject: Re: Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years
Date: Thu, 2 Dec 2021 14:56:53 +0100	[thread overview]
Message-ID: <bbb0dbf7-5374-d3bb-5bcc-e69043133ed6@leemhuis.info> (raw)
In-Reply-To: <c8cf2b24-c790-fa70-c2c5-474201743b4d@nvidia.com>

On 02.12.21 11:05, Moshe Shemesh wrote:
> On 12/2/2021 8:52 AM, Thorsten Leemhuis wrote:
>> On 20.11.21 17:38, Moshe Shemesh wrote:
>>> Thank you for reporting Niklas.
>>>
>>> This is actually a case of use after free, as following that patch the
>>> recovery flow goes through mlx5_tout_cleanup() while timeouts structure
>>> is still needed in this flow.
>>>
>>> We know the root cause and will send a fix.
>> That was twelve days ago, thus allow me asking: has any progress been
>> made? I could not find any with a quick search on lore.
> 
> Yes, fix was submitted by Saeed yesterday, title: "[net 10/13] net/mlx5:
> Fix use after free in mlx5_health_wait_pci_up".

Ahh, thx.

FWIW: would have been nice if the fix would have linked to the mail
which the regression report, for reasons explained in
Documentation/process/submitting-patches.rst. To quote:

```
If related discussions or any other background information behind the
change can be found on the web, add 'Link:' tags pointing to it. In case
your patch fixes a bug, for example, add a tag with a URL referencing
the report in the mailing list archives or a bug tracker;
```

This concept is old, but the text was reworked recently to make this use
case for the Link: tag clearer. For details see:
https://git.kernel.org/linus/1f57bd42b77c

Yes, that "Link:" is not really crucial; but it's good to have if
someone needs to look into the backstory of this change sometime in the
future. But I care for a different reason. I'm tracking this regression
(and others) with regzbot, my Linux kernel regression tracking bot. This
bot will notice if a patch with a Link: tag to a tracked regression gets
posted and record that, which allowed anyone looking into the regression
to quickly gasp the current status from regzbot's webui
(https://linux-regtracking.leemhuis.info/regzbot ) or its reports. The
bot will also notice if a commit with a Link: tag to a regression report
is applied by Linus and then automatically mark the regression as
resolved then.

Whatever, too late now, but maybe next time :-D I just rell regzbot
manually that a fix is heading towards mailine:

#regzbot monitor:
https://lore.kernel.org/r/20211201063709.229103-11-saeed@kernel.org/
#regzbot fixed-by: 76091b0fb60970f610b7ba2d886cd7fb95c5eb2e
#regzbot ignore-activity

Ciao, Thorsten

>> Ciao, Thorsten
>>
>>> On 11/19/2021 12:58 PM, Niklas Schnelle wrote:
>>>> Hello Amir, Moshe, and Saeed,
>>>>
>>>> (resent due to wrong netdev mailing list address, sorry about that)
>>>>
>>>> During testing of PCI device recovery, I found a problem in the mlx5
>>>> recovery support introduced in v5.16-rc1 by commit 32def4120e48
>>>> ("net/mlx5: Read timeout values from DTOR"). It follows my analysis of
>>>> the problem.
>>>>
>>>> When the device is in an error state, at least on s390 but I believe
>>>> also on other systems, it is isolated and all PCI MMIO reads return
>>>> 0xff. This is detected by your driver and it will immediately attempt
>>>> to recovery the device with a mlx5_core driver specific recovery
>>>> mechanism. Since at this point no reset has been done that would take
>>>> the device out of isolation this will of course fail as it can't
>>>> communicate with the device. Under normal circumstances this reset
>>>> would happen later during the new recovery flow introduced in
>>>> 4cdf2f4e24ff ("s390/pci: implement minimal PCI error recovery") once
>>>> firmware has done their side of the recovery allowing that to succeed
>>>> once the driver specific recovery has failed.
>>>>
>>>> With v5.16-rc1 however the driver specific recovery gets stuck holding
>>>> locks which will block our recovery. Without our recovery mechanism
>>>> this can also be seen by "echo 1 > /sys/bus/pci/devices/<dev>/remove"
>>>> which hangs on the device lock forever.
>>>>
>>>> Digging into this I tracked the problem down to
>>>> mlx5_health_wait_pci_up() hangig. I added a debug print to it and it
>>>> turns out that with the device isolated mlx5_tout_ms(dev, FW_RESET)
>>>> returns 774039849367420401 (0x6B...6B) milliseconds and we try to wait
>>>> 245 million years. After reverting that commit things work again,
>>>> though of course the driver specific recovery flow will still fail
>>>> before ours can kick in and finally succeed.
>>>>
>>>> Thanks,
>>>> Niklas Schnelle
>>>>
>>>> #regzbot introduced: 32def4120e48
>>>>
>>>
>> P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
>> on my table. I can only look briefly into most of them. Unfortunately
>> therefore I sometimes will get things wrong or miss something important.
>> I hope that's not the case here; if you think it is, don't hesitate to
>> tell me about it in a public reply. That's in everyone's interest, as
>> what I wrote above might be misleading to everyone reading this; any
>> suggestion I gave they thus might sent someone reading this down the
>> wrong rabbit hole, which none of us wants.
>>
>> BTW, I have no personal interest in this issue, which is tracked using
>> regzbot, my Linux kernel regression tracking bot
>> (https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flinux-regtracking.leemhuis.info%2Fregzbot%2F&amp;data=04%7C01%7Cmoshe%40nvidia.com%7C33857ebcf13946a09c6408d9b5605f19%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637740248366231179%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=Fuqme7inI68fhvGfPh2WPzvussq1awkqxFLqKHm%2FSmQ%3D&amp;reserved=0).
>> I'm only posting
>> this mail to get things rolling again and hence don't need to be CC on
>> all further activities wrt to this regression.
>>
>> #regzbot poke
>