From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.6 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,NICE_REPLY_A, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8E081C43457 for ; Mon, 12 Oct 2020 14:50:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 52E7120870 for ; Mon, 12 Oct 2020 14:50:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1602514251; bh=Um3brhR4V7mbgCo0NMThdWSXjyTwvJ/bNPfg+5StKaM=; h=Subject:To:Cc:References:From:Date:In-Reply-To:List-ID:From; b=AOAfitZ7h4hUz2yVp69tX8Kjytp4+FKY9crUsEJmF2/cRrLp0LTpg/8oAi8JhbGbq Ae/YRhWzYr5X9avO0JiX62Pywo2+YQ8hI+99mlzdc/VjnM4MivFC5rJ8mYYre2c8+v XbmelCwYW2BvuBLhPj/9SIdo/GV+rjoD68QT9NyI= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389189AbgJLOuc (ORCPT ); Mon, 12 Oct 2020 10:50:32 -0400 Received: from mail.kernel.org ([198.145.29.99]:57014 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387930AbgJLOub (ORCPT ); Mon, 12 Oct 2020 10:50:31 -0400 Received: from [192.168.0.112] (75-58-59-55.lightspeed.rlghnc.sbcglobal.net [75.58.59.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 37B142080A; Mon, 12 Oct 2020 14:50:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1602514230; bh=Um3brhR4V7mbgCo0NMThdWSXjyTwvJ/bNPfg+5StKaM=; h=Subject:To:Cc:References:From:Date:In-Reply-To:From; b=LJqc3NPzieo+zCMFKTWJbZTtyeoHuwpGsCfHSBwse3DTJOJaJoamovVG0tKre3pIP kP3vaN9XQyoxFskgTFvlBth9KjiJftUWI6h5Y6yOUjlPz5wzlrnUxtMt4Ayi4aIHuE 3R4tIBD50qhLdnfnA5/+1oYfSdQ37BN3Tka2EsjU= Subject: Re: [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling To: sathyanarayanan.nkuppuswamy@gmail.com, bhelgaas@google.com Cc: linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, ashok.raj@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com References: <5c5bca0bdb958e456176fe6ede10ba8f838fbafc.1602263264.git.sathyanarayanan.kuppuswamy@linux.intel.com> From: Sinan Kaya Autocrypt: addr=okaya@kernel.org; keydata= mQENBFrnOrUBCADGOL0kF21B6ogpOkuYvz6bUjO7NU99PKhXx1MfK/AzK+SFgxJF7dMluoF6 uT47bU7zb7HqACH6itTgSSiJeSoq86jYoq5s4JOyaj0/18Hf3/YBah7AOuwk6LtV3EftQIhw 9vXqCnBwP/nID6PQ685zl3vH68yzF6FVNwbDagxUz/gMiQh7scHvVCjiqkJ+qu/36JgtTYYw 8lGWRcto6gr0eTF8Wd8f81wspmUHGsFdN/xPsZPKMw6/on9oOj3AidcR3P9EdLY4qQyjvcNC V9cL9b5I/Ud9ghPwW4QkM7uhYqQDyh3SwgEFudc+/RsDuxjVlg9CFnGhS0nPXR89SaQZABEB AAG0HVNpbmFuIEtheWEgPG9rYXlhQGtlcm5lbC5vcmc+iQFOBBMBCAA4FiEEYdOlMSE+a7/c ckrQvGF4I+4LAFcFAlztcAoCGwMFCwkIBwIGFQoJCAsCBBYCAwECHgECF4AACgkQvGF4I+4L AFfidAf/VKHInxep0Z96iYkIq42432HTZUrxNzG9IWk4HN7c3vTJKv2W+b9pgvBF1SmkyQSy 8SJ3Zd98CO6FOHA1FigFyZahVsme+T0GsS3/OF1kjrtMktoREr8t0rK0yKpCTYVdlkHadxmR Qs5xLzW1RqKlrNigKHI2yhgpMwrpzS+67F1biT41227sqFzW9urEl/jqGJXaB6GV+SRKSHN+ ubWXgE1NkmfAMeyJPKojNT7ReL6eh3BNB/Xh1vQJew+AE50EP7o36UXghoUktnx6cTkge0ZS qgxuhN33cCOU36pWQhPqVSlLTZQJVxuCmlaHbYWvye7bBOhmiuNKhOzb3FcgT7kBDQRa5zq1 AQgAyRq/7JZKOyB8wRx6fHE0nb31P75kCnL3oE+smKW/sOcIQDV3C7mZKLf472MWB1xdr4Tm eXeL/wT0QHapLn5M5wWghC80YvjjdolHnlq9QlYVtvl1ocAC28y43tKJfklhHiwMNDJfdZbw 9lQ2h+7nccFWASNUu9cqZOABLvJcgLnfdDpnSzOye09VVlKr3NHgRyRZa7me/oFJCxrJlKAl 2hllRLt0yV08o7i14+qmvxI2EKLX9zJfJ2rGWLTVe3EJBnCsQPDzAUVYSnTtqELu2AGzvDiM gatRaosnzhvvEK+kCuXuCuZlRWP7pWSHqFFuYq596RRG5hNGLbmVFZrCxQARAQABiQEfBBgB CAAJBQJa5zq1AhsMAAoJELxheCPuCwBX2UYH/2kkMC4mImvoClrmcMsNGijcZHdDlz8NFfCI gSb3NHkarnA7uAg8KJuaHUwBMk3kBhv2BGPLcmAknzBIehbZ284W7u3DT9o1Y5g+LDyx8RIi e7pnMcC+bE2IJExCVf2p3PB1tDBBdLEYJoyFz/XpdDjZ8aVls/pIyrq+mqo5LuuhWfZzPPec 9EiM2eXpJw+Rz+vKjSt1YIhg46YbdZrDM2FGrt9ve3YaM5H0lzJgq/JQPKFdbd5MB0X37Qc+ 2m/A9u9SFnOovA42DgXUyC2cSbIJdPWOK9PnzfXqF3sX9Aol2eLUmQuLpThJtq5EHu6FzJ7Y L+s0nPaNMKwv/Xhhm6Y= Message-ID: <5ae14b67-94a5-6d2f-b74d-ca32bbd079cd@kernel.org> Date: Mon, 12 Oct 2020 10:50:29 -0400 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.12.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org On 10/12/2020 1:03 AM, sathyanarayanan.nkuppuswamy@gmail.com wrote: > From: Kuppuswamy Sathyanarayanan > > Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") > merged fatal and non-fatal error recovery paths, and also made > recovery code depend on hotplug handler for "remove affected > device + rescan" support. But this change also complicated the > error recovery path and which in turn led to the following > issues. > > 1. We depend on hotplug handler for removing the affected > devices/drivers on DLLSC LINK down event (on DPC event > trigger) and DPC handler for handling the error recovery. Since > both handlers operate on same set of affected devices, it leads > to race condition, which in turn leads to  NULL pointer > exceptions or error recovery failures.You can find more details > about this issue in following link. > > https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.zhao@intel.com/T/#t > > 2. For non-hotplug capable devices fatal (DPC) error recovery > is currently broken. Current fatal error recovery implementation > relies on PCIe hotplug (pciehp) handler for detaching and > re-enumerating the affected devices/drivers. So when dealing with > non-hotplug capable devices, recovery code does not restore the state > of the affected devices correctly. You can find more details about > this issue in the following links. > > https://lore.kernel.org/linux-pci/20200527083130.4137-1-Zhiqiang.Hou@nxp.com/ > https://lore.kernel.org/linux-pci/12115.1588207324@famine/ > https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com/ > > In order to fix the above two issues, we should stop relying on hotplug > handler for cleaning the affected devices/drivers and let error recovery > handler own this functionality. So this patch reverts Commit bdb5ac85777d > ("PCI/ERR: Handle fatal error recovery") and re-introduce the  "remove > affected device + rescan"  functionality in fatal error recovery handler. > > Also holding pci_lock_rescan_remove() will prevent the race between hotplug > and DPC handler. > > Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") > Signed-off-by: Kuppuswamy Sathyanarayanan > --- > Documentation/PCI/pci-error-recovery.rst | 47 ++++++++++------ > drivers/pci/pcie/err.c | 71 +++++++++++++++++++----- > 2 files changed, 87 insertions(+), 31 deletions(-) I'm not sure about locks involved but I do like the concept. Current fatal error handling is best effort. There is no way to recover if link is down by the time we reach to fatal error handling routine. This change will make the solution more reliable. Reviewed-by: Sinan Kaya