Linux-PCI Archive on lore.kernel.org
 help / color / Atom feed
From: Jay Vosburgh <jay.vosburgh@canonical.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: "Kuppuswamy\,
	Sathyanarayanan"  <sathyanarayanan.kuppuswamy@linux.intel.com>,
	linux-pci@vger.kernel.org
Subject: Re: [PATCH] PCI/ERR: Resolve regression in pcie_do_recovery
Date: Wed, 06 May 2020 17:56:12 -0700
Message-ID: <18609.1588812972@famine> (raw)
In-Reply-To: <20200506203249.GA453633@bjorn-Precision-5520>

Bjorn Helgaas <helgaas@kernel.org> wrote:

>On Wed, May 06, 2020 at 11:08:35AM -0700, Jay Vosburgh wrote:
>> Jay Vosburgh <jay.vosburgh@canonical.com> wrote:
>> 
>> >"Kuppuswamy, Sathyanarayanan" wrote:
>> >
>> >>Hi Jay,
>> >>
>> >>On 4/29/20 6:15 PM, Kuppuswamy, Sathyanarayanan wrote:
>> >>>
>> >>>
>> >>> On 4/29/20 5:42 PM, Jay Vosburgh wrote:
[...]
>> >>> I think this issue is related to the issue discussed in following
>> >>> thread (DPC non-hotplug support).
>> >>>
>> >>> https://lkml.org/lkml/2020/3/28/328
>> >>>
>> >>> If my assumption is correct, you are dealing with devices which are
>> >>> not hotplug capable. If the devices are hotplug capable then you don't
>> >>> need to proceed to report_slot_reset(), since hotplug handler will
>> >>> remove/re-enumerate the devices correctly.
>> >
>> >	Correct, this particular device (a network card) is in a
>> >non-hotplug slot.
>> >
>> >>Can you check whether following fix works for you?
>> >
>> >	Yes, it does.
>> >
>> >	I fixed up the whitespace and made a minor change to add braces
>> >in what look like the correct places around the "if (reset_link)" block;
>> >the patch I tested with is below.  I'll also install this on another
>> >machine with hotplug capable slots to test there as well.
>> 
>> 	We've tested the below patch on a couple of different machines
>> and devices (network card, NVMe device) and it appears to solve the
>> recovery issue in our testing.
>> 
>> 	Is there anything further we need to do, or can this be
>> considered for inclusion upstream at this time?
>
>Can somebody please post a clean version of what we should merge?
>There was the initial patch plus a follow-up fix, so it's not clear
>where we ended up.
>
>Bjorn

	Below is the patch we tested, from Sathyanarayanan's test patch
(slightly edited to clarify ambiguous "if else" nesting), along with an
edited version of the commit log from my original patch.  I have not
seen a Signed-off-by from Sathyanarayanan, so I didn't include one here.

	One question I have is that, after the patch is applied, the
"status" filled in by pci_walk_bus(... report_frozen_detected ...) is
discarded regardless of its value.  Is that correct behavior in all
cases?  The original issue I was trying to solve was that the status set
by report_frozen_detected was thrown away starting with 6d2c89441571
("PCI/ERR: Update error status after reset_link()"), causing a
_NEED_RESET to be lost.  With the below patch, all cases of
pci_channel_io_frozen will call reset_link unconditionally.

	-J

Subject: [PATCH] PCI/ERR: Resolve regression in pcie_do_recovery

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
 
	Commit 6d2c89441571 ("PCI/ERR: Update error status after
reset_link()"), introduced a regression, as pcie_do_recovery will
discard the status result from report_frozen_detected.  This can cause a
failure to recover if _NEED_RESET is returned by report_frozen_detected
and report_slot_reset is not invoked.

	Such an event can be induced for testing purposes by reducing
the Max_Payload_Size of a PCIe bridge to less than that of a device
downstream from the bridge, and then initating I/O through the device,
resulting in oversize transactions.  In the presence of DPC, this
results in a containment event and attempted reset and recovery via
pcie_do_recovery.  After 6d2c89441571 report_slot_reset is not invoked,
and the device does not recover.

Fixes: 6d2c89441571 ("PCI/ERR: Update error status after reset_link()")


diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 14bb8f54723e..db80e1ecb2dc 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -165,13 +165,24 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	pci_dbg(dev, "broadcast error_detected message\n");
 	if (state == pci_channel_io_frozen) {
 		pci_walk_bus(bus, report_frozen_detected, &status);
-		status = reset_link(dev);
-		if (status != PCI_ERS_RESULT_RECOVERED) {
+		status = PCI_ERS_RESULT_NEED_RESET;
+	} else {
+		pci_walk_bus(bus, report_normal_detected, &status);
+	}
+
+	if (status == PCI_ERS_RESULT_NEED_RESET) {
+		if (reset_link) {
+			if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED)
+				status = PCI_ERS_RESULT_DISCONNECT;
+		} else {
+			if (pci_bus_error_reset(dev))
+				status = PCI_ERS_RESULT_DISCONNECT;
+		}
+
+		if (status == PCI_ERS_RESULT_DISCONNECT) {
 			pci_warn(dev, "link reset failed\n");
 			goto failed;
 		}
-	} else {
-		pci_walk_bus(bus, report_normal_detected, &status);
 	}
 
 	if (status == PCI_ERS_RESULT_CAN_RECOVER) {


---
	-Jay Vosburgh, jay.vosburgh@canonical.com

       reply index

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20200506203249.GA453633@bjorn-Precision-5520>
2020-05-07  0:56 ` Jay Vosburgh [this message]
2020-05-07  3:32   ` [PATCH v1 1/1] PCI/ERR: Handle fatal error recovery for non-hotplug capable devices sathyanarayanan.kuppuswamy
2020-05-12 19:20     ` Jay Vosburgh
2020-05-13  1:50       ` Yicong Yang
2020-05-13 22:44     ` Bjorn Helgaas
2020-05-14 20:36       ` Kuppuswamy, Sathyanarayanan
2020-05-20  8:28     ` Yicong Yang
2020-05-20 17:04       ` Kuppuswamy, Sathyanarayanan
2020-05-21 10:58         ` Yicong Yang
2020-05-21 19:31           ` Kuppuswamy, Sathyanarayanan
2020-05-22  2:56             ` Yicong Yang
2020-05-27  1:31               ` Kuppuswamy, Sathyanarayanan
2020-05-27  3:00                 ` Oliver O'Halloran
2020-05-27  3:06                   ` Kuppuswamy, Sathyanarayanan
2020-05-27  3:35                     ` Oliver O'Halloran
2020-05-27  3:50                 ` Yicong Yang
2020-05-27  4:04                   ` Kuppuswamy, Sathyanarayanan
2020-05-27  6:41                     ` Yicong Yang
2020-05-28  3:57                       ` Kuppuswamy, Sathyanarayanan
2020-04-30  0:42 [PATCH] PCI/ERR: Resolve regression in pcie_do_recovery Jay Vosburgh
2020-04-30  1:15 ` Kuppuswamy, Sathyanarayanan
2020-04-30 19:35   ` Kuppuswamy, Sathyanarayanan
2020-04-30 20:41     ` Jay Vosburgh
2020-05-06 18:08       ` Jay Vosburgh
2020-05-09  6:35       ` Yicong Yang
2020-05-09 17:55         ` Kuppuswamy, Sathyanarayanan
2020-05-09  8:34 ` Yicong Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=18609.1588812972@famine \
    --to=jay.vosburgh@canonical.com \
    --cc=helgaas@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-PCI Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-pci/0 linux-pci/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-pci linux-pci/ https://lore.kernel.org/linux-pci \
		linux-pci@vger.kernel.org
	public-inbox-index linux-pci

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-pci


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git