From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp.codeaurora.org ([198.145.29.96]:59002 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729249AbeHOSwV (ORCPT ); Wed, 15 Aug 2018 14:52:21 -0400 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Date: Wed, 15 Aug 2018 21:29:37 +0530 From: poza@codeaurora.org To: Thomas Tai Cc: bhelgaas@google.com, keith.busch@intel.com, linux-pci@vger.kernel.org, linux-pci-owner@vger.kernel.org Subject: Re: [PATCH 1/1] PCI/AER: prevent pcie_do_fatal_recovery from using device after it is removed In-Reply-To: References: <1534179088-44219-1-git-send-email-thomas.tai@oracle.com> <1534179088-44219-2-git-send-email-thomas.tai@oracle.com> <51f4b387d9bd96a42d526a6a029fc43b@codeaurora.org> <8b8db3c5-e884-d697-6193-b409276f9638@oracle.com> <08462817-7a4f-2d8a-a329-e856c9e702c9@oracle.com> <9215fe3c62c2f15d7da221bb77bd5680@codeaurora.org> Message-ID: <3c5fe7896f5d558fef9d37f8b7156a3b@codeaurora.org> Sender: linux-pci-owner@vger.kernel.org List-ID: On 2018-08-15 21:13, Thomas Tai wrote: > [ ... ]>>>>>>> +        if (dev) >>>>>>>> +            pci_info(dev, "Device recovery from fatal error >>>>>>>> successful\n"); >>>>>>>> +        else >>>>>>>> +            pr_err("AER: Can not find pci_dev for >>>>>>>> %04x:%02x:%02x.%x\n", >>>>>>>> +                   domain, bus, >>>>>>>> +                   PCI_SLOT(devfn), PCI_FUNC(devfn)); >>>>>>>> +        pci_dev_put(dev); >>>>> >>>>> That is, replace above with: >>>>> >>>>>     pr_info("AER: Device %04x:%02x:%02x.%x recover from fatal error >>>>> successful\n", domain, bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); >>>>> >>>> is this not sufficient to print ?  which is there already. >>>> pci_info(dev, "Device recovery from fatal error successful\n"); >>> >>> Unfortunately, I can't use "dev" because as you said I can't >>> pci_get_domain_slot to search for it after rescanning. >>> >>> -T >> >> If I understood the patch correctly, >> 1) it is restructuring of pci_dev_put(dev); >> 2) now we are sending uevent only if it is bridge. >> 3) the other logic where you store and compare bdf is not required, >> although we have to fix following. >> pci_info(dev, "Device recovery from fatal error successful\n"); >> >> >> probably 1) and 2) could be a separate patch. > > Yes Oza. I will separate 1) and 2) and fix the pci_info(dev, "...") to > use domain:bus:devfn Thanks Thomas, although I am not sure on 3) by doing following. dev = pci_get_domain_bus_and_slot(domain, bus, devfn); if (dev) pci_info(dev, "Device recovery from fatal error successful\n"); but also it may be the case that device is removed and there is no new device which would match BDF so in any case recovery has happened as far as PCIe link is concerned. surprisingly in my case followin also went fine pci_info(dev, "Device recovery from fatal error successful\n"); I guess probably after re-enumeration, I got the sane dev back !! (since it enumerated the same device) how about something like this dev = pci_get_domain_bus_and_slot(domain, bus, devfn); if (dev) pci_info(dev, "Device recovery from fatal error successful\n"); else pr_ifo ("AER: Device recovery from fatal error successful") Regards, Oza. > >> >> by the way, >> In my testing even after removing device the call pci_uevent_ers(dev, >> PCI_ERS_RESULT_DISCONNECT); did not create any problem. > > I think that is dangerous, if we restructure pci_dev_put(dev), ie > revert commit bd91b56cb3b27492963caeb5fccefe20a986ca8d. The dev > structure is freed and we shouldn't really use it to send > pci_uevent_ers(). Although pci_uevent_ers() may be smart enough to > figure out to avoid crash, we should only use it when the dev is not > removed. > I Agree. > Thanks, > Thomas > >>> >> >>> >>>> >>>>> >>>>>>>>      } else { >>>>>>>> -        pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT); >>>>>>>> -        pci_info(dev, "Device recovery from fatal error >>>>>>>> failed\n"); >>>>>>>> +        if (is_bridge_device) >>>>>>> previously there was no condition.. why is this required now ? >>>>>>> and it was doing on "dev" while you are doing it on "udev" >>>>>> is that the reason we dont watnt o use dev after it is removed ? >>>>>> instead do pci_uevent_ers on bridge ? >>>>> >>>>> >>>>> Yes, that's exactly the reason. I should have written down the >>>>> reason >>>>> in the commit log. Here is the details explanation for the benefit >>>>> of >>>>> others. If the failing device is a bridge device, udev is equal to >>>>> dev. So we can use udev in the pci_uevent_ers. But for non bridge >>>>> device the device is already removed from the bus and thus can't >>>>> send >>>>> pci_uevent_ers. >>>>> >>>>> Thank you, >>>>> Thomas >>>>> >>>>>>>> +            pci_uevent_ers(udev, PCI_ERS_RESULT_DISCONNECT); >>>>>>>> +        pr_err("AER: Device %04x:%02x:%02x.%x recovery from >>>>>>>> fatal error failed\n", >>>>>>>> +               domain, bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); >>>>>>>>      } >>>>>>>> >>>>>>>> -    pci_dev_put(dev); >>>>>>>>      pci_unlock_rescan_remove(); >>>>>>>>  } >>>>>>> >>>>>>> Regards, >>>>>>> Oza.