linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sinan Kaya <okaya@kernel.org>
To: sathyanarayanan.nkuppuswamy@gmail.com, bhelgaas@google.com
Cc: linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	ashok.raj@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com
Subject: Re: [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling
Date: Mon, 12 Oct 2020 10:50:29 -0400	[thread overview]
Message-ID: <5ae14b67-94a5-6d2f-b74d-ca32bbd079cd@kernel.org> (raw)
In-Reply-To: <c6e3f1168d5d88b207b59c434792a10a7331bb89.1602263264.git.sathyanarayanan.kuppuswamy@linux.intel.com>

On 10/12/2020 1:03 AM, sathyanarayanan.nkuppuswamy@gmail.com wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> merged fatal and non-fatal error recovery paths, and also made
> recovery code depend on hotplug handler for "remove affected
> device + rescan" support. But this change also complicated the
> error recovery path and which in turn led to the following
> issues.
> 
> 1. We depend on hotplug handler for removing the affected
> devices/drivers on DLLSC LINK down event (on DPC event
> trigger) and DPC handler for handling the error recovery. Since
> both handlers operate on same set of affected devices, it leads
> to race condition, which in turn leads to  NULL pointer
> exceptions or error recovery failures.You can find more details
> about this issue in following link.
> 
> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.zhao@intel.com/T/#t
> 
> 2. For non-hotplug capable devices fatal (DPC) error recovery
> is currently broken. Current fatal error recovery implementation
> relies on PCIe hotplug (pciehp) handler for detaching and
> re-enumerating the affected devices/drivers. So when dealing with
> non-hotplug capable devices, recovery code does not restore the state
> of the affected devices correctly. You can find more details about
> this issue in the following links.
> 
> https://lore.kernel.org/linux-pci/20200527083130.4137-1-Zhiqiang.Hou@nxp.com/
> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com/
> 
> In order to fix the above two issues, we should stop relying on hotplug
> handler for cleaning the affected devices/drivers and let error recovery
> handler own this functionality. So this patch reverts Commit bdb5ac85777d
> ("PCI/ERR: Handle fatal error recovery") and re-introduce the  "remove
> affected device + rescan"  functionality in fatal error recovery handler.
> 
> Also holding pci_lock_rescan_remove() will prevent the race between hotplug
> and DPC handler.
> 
> Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  Documentation/PCI/pci-error-recovery.rst | 47 ++++++++++------
>  drivers/pci/pcie/err.c                   | 71 +++++++++++++++++++-----
>  2 files changed, 87 insertions(+), 31 deletions(-)

I'm not sure about locks involved but I do like the concept.
Current fatal error handling is best effort.

There is no way to recover if link is down by the time we
reach to fatal error handling routine.

This change will make the solution more reliable.

Reviewed-by: Sinan Kaya <okaya@kernel.org>

  reply	other threads:[~2020-10-12 14:50 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-12  5:03 [PATCH v4 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback sathyanarayanan.nkuppuswamy
2020-10-12  5:03 ` [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling sathyanarayanan.nkuppuswamy
2020-10-12 14:50   ` Sinan Kaya [this message]
2020-10-13 11:56   ` Christoph Hellwig
2020-10-13 15:17     ` Kuppuswamy, Sathyanarayanan
2020-10-15  6:43       ` Christoph Hellwig
2020-10-15  6:49         ` Sathyanarayanan Kuppuswamy Natarajan
2020-10-14  5:44   ` Ethan Zhao
2020-10-14  5:51     ` Kuppuswamy, Sathyanarayanan
2020-10-12 14:51 ` [PATCH v4 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback Sinan Kaya
2020-10-12 21:05 ` Raj, Ashok
2020-10-12 21:47   ` Kuppuswamy, Sathyanarayanan
2020-10-14  8:00     ` Ethan Zhao
2020-10-14  8:19       ` Kuppuswamy, Sathyanarayanan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5ae14b67-94a5-6d2f-b74d-ca32bbd079cd@kernel.org \
    --to=okaya@kernel.org \
    --cc=ashok.raj@intel.com \
    --cc=bhelgaas@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=sathyanarayanan.nkuppuswamy@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).