From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from gate.crashing.org ([63.228.1.57]:50451 "EHLO gate.crashing.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726046AbeHPJ4V (ORCPT ); Thu, 16 Aug 2018 05:56:21 -0400 Message-ID: <5bd99bcacb772b588771fce62c61a59fdeb167f3.camel@kernel.crashing.org> Subject: Re: [PATCH 1/1] PCI/AER: prevent pcie_do_fatal_recovery from using device after it is removed From: Benjamin Herrenschmidt To: poza@codeaurora.org Cc: Thomas Tai , bhelgaas@google.com, keith.busch@intel.com, linux-pci@vger.kernel.org, linux-pci-owner@vger.kernel.org Date: Thu, 16 Aug 2018 16:59:34 +1000 In-Reply-To: References: <1534179088-44219-1-git-send-email-thomas.tai@oracle.com> <1534179088-44219-2-git-send-email-thomas.tai@oracle.com> <51f4b387d9bd96a42d526a6a029fc43b@codeaurora.org> <903394c04d6ad468ed06dc0a779200e7555345a7.camel@kernel.crashing.org> <6cb069038530757f31f3dd60328c7e30@codeaurora.org> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Sender: linux-pci-owner@vger.kernel.org List-ID: On Thu, 2018-08-16 at 16:51 +1000, Benjamin Herrenschmidt wrote: > > refering Sepc: 6.2.2.2.1 Fatal Errors, where link is unreliable and it > > might need AER style reset of link or DPC style HW recovery > > In both cases, the shutdown callbacks are expected to be called, > > No, this is wrong and not the intent of the error handling. > > You seem to be applying PCIe specific concepts brain-farted at Intel > that are way way away from what we care about in practice and in Linux. > > > e.g. some driver handle errors ERR_NONFATAL or FATAL in similar ways > > e.g. > > ioat_pcie_error_detected(); calls ioat_shutdown(); in case of > > ERR_NONFATAL > > otherwise ioat_shutdown() in case of ERR_FATAL. > > Since when the error handling callbacks even have the concept of FATAL > vs. non-fatal ? This doesn't appear anyhwhere in the prototype of the > struct pci_error_handlers and shouldn't. In fact looking at pcie_do_nonfatal_recovery() it's indeed completely broken. It tells the driver that the slot was reset without actually resetting anything... Ugh. Ben.