Re: HW power fault defect cause system hang on kernel 5.4.y

From: Lukas Wunner <lukas@wunner.de>
To: "Bao, Joseph" <joseph.bao@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Stuart Hayes <stuart.w.hayes@gmail.com>,
	kw@linux.com
Subject: Re: HW power fault defect cause system hang on kernel 5.4.y
Date: Mon, 15 Nov 2021 20:27:23 +0100	[thread overview]
Message-ID: <20211115192723.GA19161@wunner.de> (raw)
In-Reply-To: <DM8PR11MB5702255A6A92F735D90A4446868B9@DM8PR11MB5702.namprd11.prod.outlook.com>

On Tue, Nov 02, 2021 at 03:45:00AM +0000, Bao, Joseph wrote:
> Recently we encounter system hang (dead spinlock) when move to kernel
> linux-5.4.y. 
> 
> Finally, we use bisect to locate the suspicious commit https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4667358dab9cc07da044d5bc087065545b1000df.
> 
> Our system has some HW defect, which will wrongly set PCI_EXP_SLTSTA_PFD
> high, and this commit will lead to infinite loop jumping to read_status
> (no chance to clear status PCI_EXP_SLTSTA_PFD bit since ctrl is not
> updated), I know this is our HW defect, but this commit makes kernel
> trapped in this isr function and leads to kernel hang (then the user
> could not get useful information to show what's wrong), which I think
> is not expected behavior, so I would like to report to you for discussion.

Thanks a lot for the report and apologies for the breakage and the delay.
Below please find a tentative fix.  Could you test whether it fixes the
issue?

I don't think this is a hardware defect.  If I'm reading the spec right
(PCIe r5.0, sec. 6.7.1.8), the PFD bit is meant to remain set and cannot
be cleared until the kernel disables slot power.

When a power fault happens, we currently only change the LEDs (Power
Indicator Off, Attention Indicator On) and emit a log message.
We otherwise leave the slot as is, even though I'd assume that the
PCI device in the slot is no longer accessible.

I'm wondering whether we should interpret a power fault as surprise
removal.  Alternatively, we could attempt recovery, i.e. turn slot
power off and back on.  Similar to what we're doing when an Uncorrectable
Error occurs.  Do you have an opinion on that?  What would be the
desired behavior for your users?

Thanks,

Lukas

-- >8 --

Subject: [PATCH] PCI: pciehp: Fix infinite loop in IRQ handler upon power
 fault

The Power Fault Detected bit in the Slot Status register differs from
all other hotplug events in that it is sticky:  It can only be cleared
after turning off slot power.  Per PCIe r5.0, sec. 6.7.1.8:

  If a power controller detects a main power fault on the hot-plug slot,
  it must automatically set its internal main power fault latch [...].
  The main power fault latch is cleared when software turns off power to
  the hot-plug slot.

The stickiness used to cause interrupt storms and infinite loops which
were fixed in 2009 by commits 5651c48cfafe ("PCI pciehp: fix power fault
interrupt storm problem") and 99f0169c17f3 ("PCI: pciehp: enable
software notification on empty slots").

Unfortunately in 2020 the infinite loop issue was inadvertently
reintroduced by commit 8edf5332c393 ("PCI: pciehp: Fix MSI interrupt
race"):  The hardirq handler pciehp_isr() clears the PFD bit until
pciehp's power_fault_detected flag is set.  That happens in the IRQ
thread pciehp_ist(), which never learns of the event because the hardirq
handler is stuck in an infinite loop.  Fix by setting the
power_fault_detected flag already in the hardirq handler.

Fixes: 8edf5332c393 ("PCI: pciehp: Fix MSI interrupt race")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=214989
Link: https://lore.kernel.org/linux-pci/DM8PR11MB5702255A6A92F735D90A4446868B9@DM8PR11MB5702.namprd11.prod.outlook.com
Reported-by: Joseph Bao <joseph.bao@intel.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org # v4.19+
Cc: Stuart Hayes <stuart.w.hayes@gmail.com>
---
 drivers/pci/hotplug/pciehp_hpc.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 6ac5ea5..fac6b8e 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -640,6 +640,8 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
 	 */
 	if (ctrl->power_fault_detected)
 		status &= ~PCI_EXP_SLTSTA_PFD;
+	else if (status & PCI_EXP_SLTSTA_PFD)
+		ctrl->power_fault_detected = true;
 
 	events |= status;
 	if (!events) {
@@ -649,7 +651,7 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
 	}
 
 	if (status) {
-		pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, events);
+		pcie_capability_write_word(pdev, PCI_EXP_SLTSTA, status);
 
 		/*
 		 * In MSI mode, all event bits must be zero before the port
@@ -723,8 +725,7 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
 	}
 
 	/* Check Power Fault Detected */
-	if ((events & PCI_EXP_SLTSTA_PFD) && !ctrl->power_fault_detected) {
-		ctrl->power_fault_detected = 1;
+	if (events & PCI_EXP_SLTSTA_PFD) {
 		ctrl_err(ctrl, "Slot(%s): Power fault\n", slot_name(ctrl));
 		pciehp_set_indicators(ctrl, PCI_EXP_SLTCTL_PWR_IND_OFF,
 				      PCI_EXP_SLTCTL_ATTN_IND_ON);
-- 
2.33.0