All of lore.kernel.org
 help / color / mirror / Atom feed
From: Moshe Shemesh <moshe@nvidia.com>
To: Thorsten Leemhuis <regressions@leemhuis.info>,
	Niklas Schnelle <schnelle@linux.ibm.com>,
	Amir Tzin <amirtz@nvidia.com>, Saeed Mahameed <saeedm@nvidia.com>
Cc: netdev <netdev@vger.kernel.org>, <regressions@lists.linux.dev>,
	linux-s390 <linux-s390@vger.kernel.org>
Subject: Re: Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years
Date: Thu, 2 Dec 2021 12:05:42 +0200	[thread overview]
Message-ID: <c8cf2b24-c790-fa70-c2c5-474201743b4d@nvidia.com> (raw)
In-Reply-To: <68f2163e-63a2-c6dd-1491-fd748a92ac36@leemhuis.info>


On 12/2/2021 8:52 AM, Thorsten Leemhuis wrote:
> External email: Use caution opening links or attachments
>
>
> Hi, this is your Linux kernel regression tracker speaking.
>
> On 20.11.21 17:38, Moshe Shemesh wrote:
>> Thank you for reporting Niklas.
>>
>> This is actually a case of use after free, as following that patch the
>> recovery flow goes through mlx5_tout_cleanup() while timeouts structure
>> is still needed in this flow.
>>
>> We know the root cause and will send a fix.
> That was twelve days ago, thus allow me asking: has any progress been
> made? I could not find any with a quick search on lore.


Yes, fix was submitted by Saeed yesterday, title: "[net 10/13] net/mlx5: 
Fix use after free in mlx5_health_wait_pci_up".

> Ciao, Thorsten
>
>> On 11/19/2021 12:58 PM, Niklas Schnelle wrote:
>>> Hello Amir, Moshe, and Saeed,
>>>
>>> (resent due to wrong netdev mailing list address, sorry about that)
>>>
>>> During testing of PCI device recovery, I found a problem in the mlx5
>>> recovery support introduced in v5.16-rc1 by commit 32def4120e48
>>> ("net/mlx5: Read timeout values from DTOR"). It follows my analysis of
>>> the problem.
>>>
>>> When the device is in an error state, at least on s390 but I believe
>>> also on other systems, it is isolated and all PCI MMIO reads return
>>> 0xff. This is detected by your driver and it will immediately attempt
>>> to recovery the device with a mlx5_core driver specific recovery
>>> mechanism. Since at this point no reset has been done that would take
>>> the device out of isolation this will of course fail as it can't
>>> communicate with the device. Under normal circumstances this reset
>>> would happen later during the new recovery flow introduced in
>>> 4cdf2f4e24ff ("s390/pci: implement minimal PCI error recovery") once
>>> firmware has done their side of the recovery allowing that to succeed
>>> once the driver specific recovery has failed.
>>>
>>> With v5.16-rc1 however the driver specific recovery gets stuck holding
>>> locks which will block our recovery. Without our recovery mechanism
>>> this can also be seen by "echo 1 > /sys/bus/pci/devices/<dev>/remove"
>>> which hangs on the device lock forever.
>>>
>>> Digging into this I tracked the problem down to
>>> mlx5_health_wait_pci_up() hangig. I added a debug print to it and it
>>> turns out that with the device isolated mlx5_tout_ms(dev, FW_RESET)
>>> returns 774039849367420401 (0x6B...6B) milliseconds and we try to wait
>>> 245 million years. After reverting that commit things work again,
>>> though of course the driver specific recovery flow will still fail
>>> before ours can kick in and finally succeed.
>>>
>>> Thanks,
>>> Niklas Schnelle
>>>
>>> #regzbot introduced: 32def4120e48
>>>
>>
> P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
> on my table. I can only look briefly into most of them. Unfortunately
> therefore I sometimes will get things wrong or miss something important.
> I hope that's not the case here; if you think it is, don't hesitate to
> tell me about it in a public reply. That's in everyone's interest, as
> what I wrote above might be misleading to everyone reading this; any
> suggestion I gave they thus might sent someone reading this down the
> wrong rabbit hole, which none of us wants.
>
> BTW, I have no personal interest in this issue, which is tracked using
> regzbot, my Linux kernel regression tracking bot
> (https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flinux-regtracking.leemhuis.info%2Fregzbot%2F&amp;data=04%7C01%7Cmoshe%40nvidia.com%7C33857ebcf13946a09c6408d9b5605f19%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637740248366231179%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=Fuqme7inI68fhvGfPh2WPzvussq1awkqxFLqKHm%2FSmQ%3D&amp;reserved=0). I'm only posting
> this mail to get things rolling again and hence don't need to be CC on
> all further activities wrt to this regression.
>
> #regzbot poke

  reply	other threads:[~2021-12-02 10:05 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-19 10:58 Regression in v5.16-rc1: Timeout in mlx5_health_wait_pci_up() may try to wait 245 million years Niklas Schnelle
2021-11-20 16:38 ` Moshe Shemesh
2021-12-02  6:52   ` Thorsten Leemhuis
2021-12-02 10:05     ` Moshe Shemesh [this message]
2021-12-02 13:56       ` Thorsten Leemhuis
  -- strict thread matches above, loose matches on Subject: below --
2021-11-19 10:47 Niklas Schnelle
2021-11-19 11:38 ` Thorsten Leemhuis
2021-11-19 12:17   ` Thorsten Leemhuis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c8cf2b24-c790-fa70-c2c5-474201743b4d@nvidia.com \
    --to=moshe@nvidia.com \
    --cc=amirtz@nvidia.com \
    --cc=linux-s390@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=regressions@leemhuis.info \
    --cc=regressions@lists.linux.dev \
    --cc=saeedm@nvidia.com \
    --cc=schnelle@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.