From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:39906 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727408AbeHPAuB (ORCPT ); Wed, 15 Aug 2018 20:50:01 -0400 Received: from pps.filterd (m0098417.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w7FLsSpV116045 for ; Wed, 15 Aug 2018 17:55:59 -0400 Received: from e06smtp03.uk.ibm.com (e06smtp03.uk.ibm.com [195.75.94.99]) by mx0a-001b2d01.pphosted.com with ESMTP id 2kvv500n9h-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Wed, 15 Aug 2018 17:55:59 -0400 Received: from localhost by e06smtp03.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 15 Aug 2018 22:55:57 +0100 Subject: Re: [PATCH 1/1] PCI/AER: prevent pcie_do_fatal_recovery from using device after it is removed From: Benjamin Herrenschmidt Reply-To: benh@au1.ibm.com To: poza@codeaurora.org, Thomas Tai Cc: bhelgaas@google.com, keith.busch@intel.com, linux-pci@vger.kernel.org, linux-pci-owner@vger.kernel.org Date: Thu, 16 Aug 2018 07:55:48 +1000 In-Reply-To: References: <1534179088-44219-1-git-send-email-thomas.tai@oracle.com> <1534179088-44219-2-git-send-email-thomas.tai@oracle.com> <51f4b387d9bd96a42d526a6a029fc43b@codeaurora.org> Mime-Version: 1.0 Message-Id: <1c844f2fff88379a2ccc84a1c2ddc2a61dfa036f.camel@au1.ibm.com> Content-Type: text/plain; charset="UTF-8" Sender: linux-pci-owner@vger.kernel.org List-ID: On Tue, 2018-08-14 at 14:52 +0530, poza@codeaurora.org wrote: > > > if (result == PCI_ERS_RESULT_RECOVERED) { > > > if (pcie_wait_for_link(udev, true)) > > > pci_rescan_bus(udev->bus); > > > - pci_info(dev, "Device recovery from fatal error successful\n"); > > > + /* find the pci_dev after rescanning the bus */ > > > + dev = pci_get_domain_bus_and_slot(domain, bus, devfn); > > > > one of the motivations was to remove and re-enumerate rather then > > going thorugh driver's recovery sequence > > was; it might be possible that hotplug capable bridge, the device > > might have changed. > > hence this check will fail Under what circumstances do you actually "unplug" the device ? We are trying to cleanup/fix some of the PowerPC EEH code which is in a way similar to AER, and we found that this unplug/replug, which we do if the driver doesn't have recovery callbacks only, is causing more problems than it solves. We are moving toward instead unbinding the driver, resetting the device, then re-binding the driver instead of unplug/replug. Also why would you ever bypass the driver callbacks if the driver has some ? The whole point is to keep the driver bound while resetting the device (provided it has the right callbacks) so we don't lose the linkage between stroage devices and mounted filesystems. Cheers, Ben.