[PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly
@ 2022-03-11  2:58 ` Kuppuswamy Sathyanarayanan
  0 siblings, 0 replies; 12+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2022-03-11  2:58 UTC (permalink / raw)
  To: Bjorn Helgaas, Russell Currey, Oliver OHalloran
  Cc: linux-pci, linuxppc-dev, linux-kernel,
	Kuppuswamy Sathyanarayanan, Kuppuswamy Sathyanarayanan,
	Eric Badger, Ashok Raj

Currently the aer_irq() handler returns IRQ_NONE for cases without bits
PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
assumption is incorrect.

Consider a scenario where aer_irq() is triggered for a correctable
error, and while we process the error and before we clear the error
status in "Root Error Status" register, if the same kind of error
is triggered again, since aer_irq() only clears events it saw, the
multi-bit error is left in tact. This will cause the interrupt to fire
again, resulting in entering aer_irq() with just the multi-bit error
logged in the "Root Error Status" register.

Repeated AER recovery test has revealed this condition does happen
and this prevents any new interrupt from being triggered. Allow to
process interrupt even if only multi-correctable (BIT 1) or
multi-uncorrectable bit (BIT 3) is set.

Reported-by: Eric Badger <ebadger@purestorage.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/pci/pcie/aer.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9fa1f97e5b27..7952e5efd6cf 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -101,6 +101,11 @@ struct aer_stats {
 #define ERR_COR_ID(d)			(d & 0xffff)
 #define ERR_UNCOR_ID(d)			(d >> 16)

+#define AER_ERR_STATUS_MASK		(PCI_ERR_ROOT_UNCOR_RCV |	\
+					PCI_ERR_ROOT_COR_RCV |		\
+					PCI_ERR_ROOT_MULTI_COR_RCV |	\
+					PCI_ERR_ROOT_MULTI_UNCOR_RCV)
+
 static int pcie_aer_disable;
 static pci_ers_result_t aer_root_reset(struct pci_dev *dev);

@@ -1196,7 +1201,7 @@ static irqreturn_t aer_irq(int irq, void *context)
 	struct aer_err_source e_src = {};

 	pci_read_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, &e_src.status);
-	if (!(e_src.status & (PCI_ERR_ROOT_UNCOR_RCV|PCI_ERR_ROOT_COR_RCV)))
+	if (!(e_src.status & AER_ERR_STATUS_MASK))
 		return IRQ_NONE;

 	pci_read_config_dword(rp, aer + PCI_ERR_ROOT_ERR_SRC, &e_src.id);
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly
@ 2022-03-11  2:58 ` Kuppuswamy Sathyanarayanan
  0 siblings, 0 replies; 12+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2022-03-11  2:58 UTC (permalink / raw)
  To: Bjorn Helgaas, Russell Currey, Oliver OHalloran
  Cc: Kuppuswamy Sathyanarayanan, Kuppuswamy Sathyanarayanan,
	Ashok Raj, linux-pci, linux-kernel, Eric Badger, linuxppc-dev

Currently the aer_irq() handler returns IRQ_NONE for cases without bits
PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
assumption is incorrect.

Consider a scenario where aer_irq() is triggered for a correctable
error, and while we process the error and before we clear the error
status in "Root Error Status" register, if the same kind of error
is triggered again, since aer_irq() only clears events it saw, the
multi-bit error is left in tact. This will cause the interrupt to fire
again, resulting in entering aer_irq() with just the multi-bit error
logged in the "Root Error Status" register.

Repeated AER recovery test has revealed this condition does happen
and this prevents any new interrupt from being triggered. Allow to
process interrupt even if only multi-correctable (BIT 1) or
multi-uncorrectable bit (BIT 3) is set.

Reported-by: Eric Badger <ebadger@purestorage.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/pci/pcie/aer.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9fa1f97e5b27..7952e5efd6cf 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -101,6 +101,11 @@ struct aer_stats {
 #define ERR_COR_ID(d)			(d & 0xffff)
 #define ERR_UNCOR_ID(d)			(d >> 16)

+#define AER_ERR_STATUS_MASK		(PCI_ERR_ROOT_UNCOR_RCV |	\
+					PCI_ERR_ROOT_COR_RCV |		\
+					PCI_ERR_ROOT_MULTI_COR_RCV |	\
+					PCI_ERR_ROOT_MULTI_UNCOR_RCV)
+
 static int pcie_aer_disable;
 static pci_ers_result_t aer_root_reset(struct pci_dev *dev);

@@ -1196,7 +1201,7 @@ static irqreturn_t aer_irq(int irq, void *context)
 	struct aer_err_source e_src = {};

 	pci_read_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, &e_src.status);
-	if (!(e_src.status & (PCI_ERR_ROOT_UNCOR_RCV|PCI_ERR_ROOT_COR_RCV)))
+	if (!(e_src.status & AER_ERR_STATUS_MASK))
 		return IRQ_NONE;

 	pci_read_config_dword(rp, aer + PCI_ERR_ROOT_ERR_SRC, &e_src.id);
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly
  2022-03-11  2:58 ` Kuppuswamy Sathyanarayanan
@ 2022-03-13 19:52   ` Bjorn Helgaas
  -1 siblings, 0 replies; 12+ messages in thread
From: Bjorn Helgaas @ 2022-03-13 19:52 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Bjorn Helgaas, Russell Currey, Oliver OHalloran,
	Kuppuswamy Sathyanarayanan, Ashok Raj, linux-pci, linux-kernel,
	Eric Badger, linuxppc-dev

On Fri, Mar 11, 2022 at 02:58:07AM +0000, Kuppuswamy Sathyanarayanan wrote:
> Currently the aer_irq() handler returns IRQ_NONE for cases without bits
> PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
> assumption is incorrect.
> 
> Consider a scenario where aer_irq() is triggered for a correctable
> error, and while we process the error and before we clear the error
> status in "Root Error Status" register, if the same kind of error
> is triggered again, since aer_irq() only clears events it saw, the
> multi-bit error is left in tact. This will cause the interrupt to fire
> again, resulting in entering aer_irq() with just the multi-bit error
> logged in the "Root Error Status" register.
> 
> Repeated AER recovery test has revealed this condition does happen
> and this prevents any new interrupt from being triggered. Allow to
> process interrupt even if only multi-correctable (BIT 1) or
> multi-uncorrectable bit (BIT 3) is set.
> 
> Reported-by: Eric Badger <ebadger@purestorage.com>

Is there a bug report with any concrete details (dmesg, lspci, etc)
that we can include here?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly
@ 2022-03-13 19:52   ` Bjorn Helgaas
  0 siblings, 0 replies; 12+ messages in thread
From: Bjorn Helgaas @ 2022-03-13 19:52 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Kuppuswamy Sathyanarayanan, Ashok Raj, linux-pci, linux-kernel,
	Eric Badger, Oliver OHalloran, Bjorn Helgaas, linuxppc-dev

On Fri, Mar 11, 2022 at 02:58:07AM +0000, Kuppuswamy Sathyanarayanan wrote:
> Currently the aer_irq() handler returns IRQ_NONE for cases without bits
> PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
> assumption is incorrect.
> 
> Consider a scenario where aer_irq() is triggered for a correctable
> error, and while we process the error and before we clear the error
> status in "Root Error Status" register, if the same kind of error
> is triggered again, since aer_irq() only clears events it saw, the
> multi-bit error is left in tact. This will cause the interrupt to fire
> again, resulting in entering aer_irq() with just the multi-bit error
> logged in the "Root Error Status" register.
> 
> Repeated AER recovery test has revealed this condition does happen
> and this prevents any new interrupt from being triggered. Allow to
> process interrupt even if only multi-correctable (BIT 1) or
> multi-uncorrectable bit (BIT 3) is set.
> 
> Reported-by: Eric Badger <ebadger@purestorage.com>

Is there a bug report with any concrete details (dmesg, lspci, etc)
that we can include here?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly
  2022-03-13 19:52   ` Bjorn Helgaas
@ 2022-03-13 21:43     ` Raj, Ashok
  -1 siblings, 0 replies; 12+ messages in thread
From: Raj, Ashok @ 2022-03-13 21:43 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Kuppuswamy Sathyanarayanan, Bjorn Helgaas, Russell Currey,
	Oliver OHalloran, Kuppuswamy Sathyanarayanan, linux-pci,
	linux-kernel, Eric Badger, linuxppc-dev, Ashok Raj

On Sun, Mar 13, 2022 at 02:52:20PM -0500, Bjorn Helgaas wrote:
> On Fri, Mar 11, 2022 at 02:58:07AM +0000, Kuppuswamy Sathyanarayanan wrote:
> > Currently the aer_irq() handler returns IRQ_NONE for cases without bits
> > PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
> > assumption is incorrect.
> > 
> > Consider a scenario where aer_irq() is triggered for a correctable
> > error, and while we process the error and before we clear the error
> > status in "Root Error Status" register, if the same kind of error
> > is triggered again, since aer_irq() only clears events it saw, the
> > multi-bit error is left in tact. This will cause the interrupt to fire
> > again, resulting in entering aer_irq() with just the multi-bit error
> > logged in the "Root Error Status" register.
> > 
> > Repeated AER recovery test has revealed this condition does happen
> > and this prevents any new interrupt from being triggered. Allow to
> > process interrupt even if only multi-correctable (BIT 1) or
> > multi-uncorrectable bit (BIT 3) is set.
> > 
> > Reported-by: Eric Badger <ebadger@purestorage.com>
> 
> Is there a bug report with any concrete details (dmesg, lspci, etc)
> that we can include here?

Eric might have more details to add when he collected numerous logs to get
to the timeline of the problem. The test was to stress the links with an
automated power off, this will result in some eDPC UC error followed by
link down. The recovery worked fine for several cycles and suddenly there
were no more interrupts. A manual rescan on pci would probe and device is
operational again.

The test patch revealed we entered the aer_irq() with just the multi-error
PCI_ERR_ROOT_MULTI_COR_RCV or PCI_ERR_ROOT_MULTI_UNCOR_RCV, then we didn't
clear those bits causing interrupt generation to cease after that.

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly
@ 2022-03-13 21:43     ` Raj, Ashok
  0 siblings, 0 replies; 12+ messages in thread
From: Raj, Ashok @ 2022-03-13 21:43 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Kuppuswamy Sathyanarayanan, Kuppuswamy Sathyanarayanan,
	Ashok Raj, linux-pci, linux-kernel, Eric Badger,
	Oliver OHalloran, Bjorn Helgaas, linuxppc-dev

On Sun, Mar 13, 2022 at 02:52:20PM -0500, Bjorn Helgaas wrote:
> On Fri, Mar 11, 2022 at 02:58:07AM +0000, Kuppuswamy Sathyanarayanan wrote:
> > Currently the aer_irq() handler returns IRQ_NONE for cases without bits
> > PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
> > assumption is incorrect.
> > 
> > Consider a scenario where aer_irq() is triggered for a correctable
> > error, and while we process the error and before we clear the error
> > status in "Root Error Status" register, if the same kind of error
> > is triggered again, since aer_irq() only clears events it saw, the
> > multi-bit error is left in tact. This will cause the interrupt to fire
> > again, resulting in entering aer_irq() with just the multi-bit error
> > logged in the "Root Error Status" register.
> > 
> > Repeated AER recovery test has revealed this condition does happen
> > and this prevents any new interrupt from being triggered. Allow to
> > process interrupt even if only multi-correctable (BIT 1) or
> > multi-uncorrectable bit (BIT 3) is set.
> > 
> > Reported-by: Eric Badger <ebadger@purestorage.com>
> 
> Is there a bug report with any concrete details (dmesg, lspci, etc)
> that we can include here?

Eric might have more details to add when he collected numerous logs to get
to the timeline of the problem. The test was to stress the links with an
automated power off, this will result in some eDPC UC error followed by
link down. The recovery worked fine for several cycles and suddenly there
were no more interrupts. A manual rescan on pci would probe and device is
operational again.

The test patch revealed we entered the aer_irq() with just the multi-error
PCI_ERR_ROOT_MULTI_COR_RCV or PCI_ERR_ROOT_MULTI_UNCOR_RCV, then we didn't
clear those bits causing interrupt generation to cease after that.

Cheers,
Ashok

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly
  2022-03-13 21:43     ` Raj, Ashok
@ 2022-03-14 16:21       ` Eric Badger
  -1 siblings, 0 replies; 12+ messages in thread
From: Eric Badger @ 2022-03-14 16:21 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Bjorn Helgaas, Kuppuswamy Sathyanarayanan, Bjorn Helgaas,
	Russell Currey, Oliver OHalloran, Kuppuswamy Sathyanarayanan,
	linux-pci, linux-kernel, linuxppc-dev

On Sun, Mar 13, 2022 at 02:43:14PM -0700, Raj, Ashok wrote:
> On Sun, Mar 13, 2022 at 02:52:20PM -0500, Bjorn Helgaas wrote:
> > On Fri, Mar 11, 2022 at 02:58:07AM +0000, Kuppuswamy Sathyanarayanan wrote:
> > > Currently the aer_irq() handler returns IRQ_NONE for cases without bits
> > > PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
> > > assumption is incorrect.
> > > 
> > > Consider a scenario where aer_irq() is triggered for a correctable
> > > error, and while we process the error and before we clear the error
> > > status in "Root Error Status" register, if the same kind of error
> > > is triggered again, since aer_irq() only clears events it saw, the
> > > multi-bit error is left in tact. This will cause the interrupt to fire
> > > again, resulting in entering aer_irq() with just the multi-bit error
> > > logged in the "Root Error Status" register.
> > > 
> > > Repeated AER recovery test has revealed this condition does happen
> > > and this prevents any new interrupt from being triggered. Allow to
> > > process interrupt even if only multi-correctable (BIT 1) or
> > > multi-uncorrectable bit (BIT 3) is set.
> > > 
> > > Reported-by: Eric Badger <ebadger@purestorage.com>
> > 
> > Is there a bug report with any concrete details (dmesg, lspci, etc)
> > that we can include here?
> 
> Eric might have more details to add when he collected numerous logs to get
> to the timeline of the problem. The test was to stress the links with an
> automated power off, this will result in some eDPC UC error followed by
> link down. The recovery worked fine for several cycles and suddenly there
> were no more interrupts. A manual rescan on pci would probe and device is
> operational again.

The problem was originally discovered while performing a looping hot plug
test. At hot remove time, one or more corrected errors usually appeared:

[256236.078151] pcieport 0000:89:02.0: AER: Corrected error received: 0000:89:02.0
[256236.078154] pcieport 0000:89:02.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[256236.088606] pcieport 0000:89:02.0: AER:   device [8086:347a] error status/mask=00000001/00000000
[256236.097857] pcieport 0000:89:02.0: AER:    [ 0] RxErr                 
[256236.152622] pcieport 0000:89:02.0: pciehp: Slot(400): Link Down
[256236.152623] pcieport 0000:89:02.0: pciehp: Slot(400): Card not present
[256236.152631] pcieport 0000:89:02.0: DPC: containment event, status:0x1f01 source:0x0000
[256236.152632] pcieport 0000:89:02.0: DPC: unmasked uncorrectable error detected reason 0 ext_reason 0
[256236.152634] pcieport 0000:89:02.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
[256236.164207] pcieport 0000:89:02.0: AER:   device [8086:347a] error status/mask=00000020/00100000
[256236.173464] pcieport 0000:89:02.0: AER:    [ 5] SDES                   (First)
[256236.278407] pci 0000:8a:00.0: Removing from iommu group 32
[256237.500837] pcieport 0000:89:02.0: Data Link Layer Link Active not set in 1000 msec
[256237.500842] pcieport 0000:89:02.0: link reset at upstream device 0000:89:02.0 failed
[256237.500865] pcieport 0000:89:02.0: AER: Device recovery failed

The problematic case arose when 2 corrected errors arrived in a sequence like this:

1. Correctable error triggered, bit 0 (ERR_COR) set in Root Error Status,
   which now has value 0x1.
2. aer_irq() triggered, reads Root Error Status, finds value 0x1.
3. Second correctable error triggered, bit 1 (multiple ERR_COR) set in Root
   Error Status, which now has value 0x3.
4. aer_irq() writes back 0x1 to Root Error Status, which now has value 0x2.
5. aer_irq() triggered again due to the second error, but, finding value 0x2
   in Root Error Status, takes no action. Future interrupts are now inhibited.
  
My observation on Intel Icelake is that a new AER interrupt will be generated
when one writes to Root Error Status but other bits remain set. I concluded
this based on testing with ACPI EINJ and a hack like this:

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9fa1f97e5b27..5c9bbbe7887b 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1196,6 +1196,10 @@ static irqreturn_t aer_irq(int irq, void *context)
 	struct aer_err_source e_src = {};
 
 	pci_read_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, &e_src.status);
+
+	pci_dbg(pdev->port, "Root Error Status: %04x\n", e_src.status);
+	return IRQ_NONE;
+
 	if (!(e_src.status & (PCI_ERR_ROOT_UNCOR_RCV|PCI_ERR_ROOT_COR_RCV)))
 		return IRQ_NONE;
 
And then running these commands:

    # Prep injection data for a correctable error.
    $ cd /sys/kernel/debug/apei/einj
    $ echo 0x00000040 > error_type
    $ echo 0x4 > flags
    $ echo 0x891000 > param4
    
    # Root Error Status is initially clear
    $ setpci -s 89:02.0 ECAP0001+0x30.w
    0000
    
    # Inject one error
    $ echo 1 > error_inject
    
    # Interrupt received
    [  285.526275] pcieport 0000:89:02.0: AER: Root Error Status 0001
    
    # Inject another error
    $ echo 1 > error_inject
    
    # No interrupt received, but "multiple ERR_COR" is now set
    $ setpci -s 89:02.0 ECAP0001+0x30.w
    0003
    
    # Wait for a while, then clear ERR_COR. A new interrupt immediately fires.
    $ setpci -s 89:02.0 ECAP0001+0x30.w=0x1
    [  354.596748] pcieport 0000:89:02.0: AER: Root Error Status 0002


I've tried to track down some different hardware to confirm this behavior, but
haven't found any that can run this test.

My reading of the PCIe 5.0 spec, section "6.2.4.1.2 Interrupt Generation"
doesn't seem to describe the behavior I saw on Icelake.

	If a Root Port or Root Complex Event Collector is enabled for
	edge-triggered interrupt signaling using MSI or MSI-X, an interrupt
	message must be sent every time the logical AND of the following
	conditions transitions from FALSE to TRUE:
	...
    At least one Error Reporting Enable bit in the Root Error Command
    register and its associated error Messages Received bit in the Root
    Error Status register are both set to 1b.

This section of the spec seems to say that, if Root Error Status sees the
sequence of values described above (0x1->0x3->0x2), only one interrupt would
be generated, since there was no FALSE to TRUE transition at 0x3->0x2.  So you
would need something analogous to:

8edf5332c3934 ("PCI: pciehp: Fix MSI interrupt race")

However, this seems not to be the case for Icelake.

Cheers,
Eric

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly
@ 2022-03-14 16:21       ` Eric Badger
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Badger @ 2022-03-14 16:21 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Kuppuswamy Sathyanarayanan, Kuppuswamy Sathyanarayanan,
	linux-pci, linux-kernel, Oliver OHalloran, Bjorn Helgaas,
	Bjorn Helgaas, linuxppc-dev

On Sun, Mar 13, 2022 at 02:43:14PM -0700, Raj, Ashok wrote:
> On Sun, Mar 13, 2022 at 02:52:20PM -0500, Bjorn Helgaas wrote:
> > On Fri, Mar 11, 2022 at 02:58:07AM +0000, Kuppuswamy Sathyanarayanan wrote:
> > > Currently the aer_irq() handler returns IRQ_NONE for cases without bits
> > > PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
> > > assumption is incorrect.
> > > 
> > > Consider a scenario where aer_irq() is triggered for a correctable
> > > error, and while we process the error and before we clear the error
> > > status in "Root Error Status" register, if the same kind of error
> > > is triggered again, since aer_irq() only clears events it saw, the
> > > multi-bit error is left in tact. This will cause the interrupt to fire
> > > again, resulting in entering aer_irq() with just the multi-bit error
> > > logged in the "Root Error Status" register.
> > > 
> > > Repeated AER recovery test has revealed this condition does happen
> > > and this prevents any new interrupt from being triggered. Allow to
> > > process interrupt even if only multi-correctable (BIT 1) or
> > > multi-uncorrectable bit (BIT 3) is set.
> > > 
> > > Reported-by: Eric Badger <ebadger@purestorage.com>
> > 
> > Is there a bug report with any concrete details (dmesg, lspci, etc)
> > that we can include here?
> 
> Eric might have more details to add when he collected numerous logs to get
> to the timeline of the problem. The test was to stress the links with an
> automated power off, this will result in some eDPC UC error followed by
> link down. The recovery worked fine for several cycles and suddenly there
> were no more interrupts. A manual rescan on pci would probe and device is
> operational again.

The problem was originally discovered while performing a looping hot plug
test. At hot remove time, one or more corrected errors usually appeared:

[256236.078151] pcieport 0000:89:02.0: AER: Corrected error received: 0000:89:02.0
[256236.078154] pcieport 0000:89:02.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[256236.088606] pcieport 0000:89:02.0: AER:   device [8086:347a] error status/mask=00000001/00000000
[256236.097857] pcieport 0000:89:02.0: AER:    [ 0] RxErr                 
[256236.152622] pcieport 0000:89:02.0: pciehp: Slot(400): Link Down
[256236.152623] pcieport 0000:89:02.0: pciehp: Slot(400): Card not present
[256236.152631] pcieport 0000:89:02.0: DPC: containment event, status:0x1f01 source:0x0000
[256236.152632] pcieport 0000:89:02.0: DPC: unmasked uncorrectable error detected reason 0 ext_reason 0
[256236.152634] pcieport 0000:89:02.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
[256236.164207] pcieport 0000:89:02.0: AER:   device [8086:347a] error status/mask=00000020/00100000
[256236.173464] pcieport 0000:89:02.0: AER:    [ 5] SDES                   (First)
[256236.278407] pci 0000:8a:00.0: Removing from iommu group 32
[256237.500837] pcieport 0000:89:02.0: Data Link Layer Link Active not set in 1000 msec
[256237.500842] pcieport 0000:89:02.0: link reset at upstream device 0000:89:02.0 failed
[256237.500865] pcieport 0000:89:02.0: AER: Device recovery failed

The problematic case arose when 2 corrected errors arrived in a sequence like this:

1. Correctable error triggered, bit 0 (ERR_COR) set in Root Error Status,
   which now has value 0x1.
2. aer_irq() triggered, reads Root Error Status, finds value 0x1.
3. Second correctable error triggered, bit 1 (multiple ERR_COR) set in Root
   Error Status, which now has value 0x3.
4. aer_irq() writes back 0x1 to Root Error Status, which now has value 0x2.
5. aer_irq() triggered again due to the second error, but, finding value 0x2
   in Root Error Status, takes no action. Future interrupts are now inhibited.
  
My observation on Intel Icelake is that a new AER interrupt will be generated
when one writes to Root Error Status but other bits remain set. I concluded
this based on testing with ACPI EINJ and a hack like this:

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9fa1f97e5b27..5c9bbbe7887b 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1196,6 +1196,10 @@ static irqreturn_t aer_irq(int irq, void *context)
 	struct aer_err_source e_src = {};
 
 	pci_read_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, &e_src.status);
+
+	pci_dbg(pdev->port, "Root Error Status: %04x\n", e_src.status);
+	return IRQ_NONE;
+
 	if (!(e_src.status & (PCI_ERR_ROOT_UNCOR_RCV|PCI_ERR_ROOT_COR_RCV)))
 		return IRQ_NONE;
 
And then running these commands:

    # Prep injection data for a correctable error.
    $ cd /sys/kernel/debug/apei/einj
    $ echo 0x00000040 > error_type
    $ echo 0x4 > flags
    $ echo 0x891000 > param4
    
    # Root Error Status is initially clear
    $ setpci -s 89:02.0 ECAP0001+0x30.w
    0000
    
    # Inject one error
    $ echo 1 > error_inject
    
    # Interrupt received
    [  285.526275] pcieport 0000:89:02.0: AER: Root Error Status 0001
    
    # Inject another error
    $ echo 1 > error_inject
    
    # No interrupt received, but "multiple ERR_COR" is now set
    $ setpci -s 89:02.0 ECAP0001+0x30.w
    0003
    
    # Wait for a while, then clear ERR_COR. A new interrupt immediately fires.
    $ setpci -s 89:02.0 ECAP0001+0x30.w=0x1
    [  354.596748] pcieport 0000:89:02.0: AER: Root Error Status 0002


I've tried to track down some different hardware to confirm this behavior, but
haven't found any that can run this test.

My reading of the PCIe 5.0 spec, section "6.2.4.1.2 Interrupt Generation"
doesn't seem to describe the behavior I saw on Icelake.

	If a Root Port or Root Complex Event Collector is enabled for
	edge-triggered interrupt signaling using MSI or MSI-X, an interrupt
	message must be sent every time the logical AND of the following
	conditions transitions from FALSE to TRUE:
	...
    At least one Error Reporting Enable bit in the Root Error Command
    register and its associated error Messages Received bit in the Root
    Error Status register are both set to 1b.

This section of the spec seems to say that, if Root Error Status sees the
sequence of values described above (0x1->0x3->0x2), only one interrupt would
be generated, since there was no FALSE to TRUE transition at 0x3->0x2.  So you
would need something analogous to:

8edf5332c3934 ("PCI: pciehp: Fix MSI interrupt race")

However, this seems not to be the case for Icelake.

Cheers,
Eric

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly
  2022-03-14 16:21       ` Eric Badger
@ 2022-03-17 22:59         ` Bjorn Helgaas
  -1 siblings, 0 replies; 12+ messages in thread
From: Bjorn Helgaas @ 2022-03-17 22:59 UTC (permalink / raw)
  To: Eric Badger
  Cc: Raj, Ashok, Kuppuswamy Sathyanarayanan, Bjorn Helgaas,
	Russell Currey, Oliver OHalloran, Kuppuswamy Sathyanarayanan,
	linux-pci, linux-kernel, linuxppc-dev

On Mon, Mar 14, 2022 at 09:21:46AM -0700, Eric Badger wrote:
> On Sun, Mar 13, 2022 at 02:43:14PM -0700, Raj, Ashok wrote:
> > On Sun, Mar 13, 2022 at 02:52:20PM -0500, Bjorn Helgaas wrote:
> > > On Fri, Mar 11, 2022 at 02:58:07AM +0000, Kuppuswamy Sathyanarayanan wrote:
> > > > Currently the aer_irq() handler returns IRQ_NONE for cases without bits
> > > > PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
> > > > assumption is incorrect.
> > > > 
> > > > Consider a scenario where aer_irq() is triggered for a correctable
> > > > error, and while we process the error and before we clear the error
> > > > status in "Root Error Status" register, if the same kind of error
> > > > is triggered again, since aer_irq() only clears events it saw, the
> > > > multi-bit error is left in tact. This will cause the interrupt to fire
> > > > again, resulting in entering aer_irq() with just the multi-bit error
> > > > logged in the "Root Error Status" register.
> > > > 
> > > > Repeated AER recovery test has revealed this condition does happen
> > > > and this prevents any new interrupt from being triggered. Allow to
> > > > process interrupt even if only multi-correctable (BIT 1) or
> > > > multi-uncorrectable bit (BIT 3) is set.
> > > > 
> > > > Reported-by: Eric Badger <ebadger@purestorage.com>
> > > 
> > > Is there a bug report with any concrete details (dmesg, lspci, etc)
> > > that we can include here?
> > 
> > Eric might have more details to add when he collected numerous logs to get
> > to the timeline of the problem. The test was to stress the links with an
> > automated power off, this will result in some eDPC UC error followed by
> > link down. The recovery worked fine for several cycles and suddenly there
> > were no more interrupts. A manual rescan on pci would probe and device is
> > operational again.
> 
> The problem was originally discovered while performing a looping hot plug
> test. At hot remove time, one or more corrected errors usually appeared:
> 
> [256236.078151] pcieport 0000:89:02.0: AER: Corrected error received: 0000:89:02.0
> [256236.078154] pcieport 0000:89:02.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> [256236.088606] pcieport 0000:89:02.0: AER:   device [8086:347a] error status/mask=00000001/00000000
> [256236.097857] pcieport 0000:89:02.0: AER:    [ 0] RxErr                 
> [256236.152622] pcieport 0000:89:02.0: pciehp: Slot(400): Link Down
> [256236.152623] pcieport 0000:89:02.0: pciehp: Slot(400): Card not present
> [256236.152631] pcieport 0000:89:02.0: DPC: containment event, status:0x1f01 source:0x0000
> [256236.152632] pcieport 0000:89:02.0: DPC: unmasked uncorrectable error detected reason 0 ext_reason 0
> [256236.152634] pcieport 0000:89:02.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
> [256236.164207] pcieport 0000:89:02.0: AER:   device [8086:347a] error status/mask=00000020/00100000
> [256236.173464] pcieport 0000:89:02.0: AER:    [ 5] SDES                   (First)
> [256236.278407] pci 0000:8a:00.0: Removing from iommu group 32
> [256237.500837] pcieport 0000:89:02.0: Data Link Layer Link Active not set in 1000 msec
> [256237.500842] pcieport 0000:89:02.0: link reset at upstream device 0000:89:02.0 failed
> [256237.500865] pcieport 0000:89:02.0: AER: Device recovery failed
> 
> The problematic case arose when 2 corrected errors arrived in a sequence like this:
> 
> 1. Correctable error triggered, bit 0 (ERR_COR) set in Root Error Status,
>    which now has value 0x1.
> 2. aer_irq() triggered, reads Root Error Status, finds value 0x1.
> 3. Second correctable error triggered, bit 1 (multiple ERR_COR) set in Root
>    Error Status, which now has value 0x3.
> 4. aer_irq() writes back 0x1 to Root Error Status, which now has value 0x2.
> 5. aer_irq() triggered again due to the second error, but, finding value 0x2
>    in Root Error Status, takes no action. Future interrupts are now inhibited.

Thanks for the additional details!

After this patch, I guess aer_irq() still reads 0x2
(PCI_ERR_ROOT_MULTI_COR_RCV), but now it writes 0x2 back which clears
PCI_ERR_ROOT_MULTI_COR_RCV.

In addition, aer_irq() will continue on to read PCI_ERR_ROOT_ERR_SRC,
which probably contains either 0 or junk left over from being captured
when PCI_ERR_ROOT_COR_RCV was set.

And aer_irq() will queue an e_src record with status ==
PCI_ERR_ROOT_MULTI_COR_RCV.  But since PCI_ERR_ROOT_COR_RCV is not set
in status, aer_isr_one_error() will do nothing, right?

That might not be *terrible* and is definitely better than not being
able to handle future interrupts.  But we basically threw away the
information that multiple errors occurred, and we queued an e_src
record that occupies space without being used for anything.

Bjorn

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly
@ 2022-03-17 22:59         ` Bjorn Helgaas
  0 siblings, 0 replies; 12+ messages in thread
From: Bjorn Helgaas @ 2022-03-17 22:59 UTC (permalink / raw)
  To: Eric Badger
  Cc: Kuppuswamy Sathyanarayanan, Kuppuswamy Sathyanarayanan, Raj,
	Ashok, linux-pci, linux-kernel, Oliver OHalloran, Bjorn Helgaas,
	linuxppc-dev

On Mon, Mar 14, 2022 at 09:21:46AM -0700, Eric Badger wrote:
> On Sun, Mar 13, 2022 at 02:43:14PM -0700, Raj, Ashok wrote:
> > On Sun, Mar 13, 2022 at 02:52:20PM -0500, Bjorn Helgaas wrote:
> > > On Fri, Mar 11, 2022 at 02:58:07AM +0000, Kuppuswamy Sathyanarayanan wrote:
> > > > Currently the aer_irq() handler returns IRQ_NONE for cases without bits
> > > > PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
> > > > assumption is incorrect.
> > > > 
> > > > Consider a scenario where aer_irq() is triggered for a correctable
> > > > error, and while we process the error and before we clear the error
> > > > status in "Root Error Status" register, if the same kind of error
> > > > is triggered again, since aer_irq() only clears events it saw, the
> > > > multi-bit error is left in tact. This will cause the interrupt to fire
> > > > again, resulting in entering aer_irq() with just the multi-bit error
> > > > logged in the "Root Error Status" register.
> > > > 
> > > > Repeated AER recovery test has revealed this condition does happen
> > > > and this prevents any new interrupt from being triggered. Allow to
> > > > process interrupt even if only multi-correctable (BIT 1) or
> > > > multi-uncorrectable bit (BIT 3) is set.
> > > > 
> > > > Reported-by: Eric Badger <ebadger@purestorage.com>
> > > 
> > > Is there a bug report with any concrete details (dmesg, lspci, etc)
> > > that we can include here?
> > 
> > Eric might have more details to add when he collected numerous logs to get
> > to the timeline of the problem. The test was to stress the links with an
> > automated power off, this will result in some eDPC UC error followed by
> > link down. The recovery worked fine for several cycles and suddenly there
> > were no more interrupts. A manual rescan on pci would probe and device is
> > operational again.
> 
> The problem was originally discovered while performing a looping hot plug
> test. At hot remove time, one or more corrected errors usually appeared:
> 
> [256236.078151] pcieport 0000:89:02.0: AER: Corrected error received: 0000:89:02.0
> [256236.078154] pcieport 0000:89:02.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> [256236.088606] pcieport 0000:89:02.0: AER:   device [8086:347a] error status/mask=00000001/00000000
> [256236.097857] pcieport 0000:89:02.0: AER:    [ 0] RxErr                 
> [256236.152622] pcieport 0000:89:02.0: pciehp: Slot(400): Link Down
> [256236.152623] pcieport 0000:89:02.0: pciehp: Slot(400): Card not present
> [256236.152631] pcieport 0000:89:02.0: DPC: containment event, status:0x1f01 source:0x0000
> [256236.152632] pcieport 0000:89:02.0: DPC: unmasked uncorrectable error detected reason 0 ext_reason 0
> [256236.152634] pcieport 0000:89:02.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
> [256236.164207] pcieport 0000:89:02.0: AER:   device [8086:347a] error status/mask=00000020/00100000
> [256236.173464] pcieport 0000:89:02.0: AER:    [ 5] SDES                   (First)
> [256236.278407] pci 0000:8a:00.0: Removing from iommu group 32
> [256237.500837] pcieport 0000:89:02.0: Data Link Layer Link Active not set in 1000 msec
> [256237.500842] pcieport 0000:89:02.0: link reset at upstream device 0000:89:02.0 failed
> [256237.500865] pcieport 0000:89:02.0: AER: Device recovery failed
> 
> The problematic case arose when 2 corrected errors arrived in a sequence like this:
> 
> 1. Correctable error triggered, bit 0 (ERR_COR) set in Root Error Status,
>    which now has value 0x1.
> 2. aer_irq() triggered, reads Root Error Status, finds value 0x1.
> 3. Second correctable error triggered, bit 1 (multiple ERR_COR) set in Root
>    Error Status, which now has value 0x3.
> 4. aer_irq() writes back 0x1 to Root Error Status, which now has value 0x2.
> 5. aer_irq() triggered again due to the second error, but, finding value 0x2
>    in Root Error Status, takes no action. Future interrupts are now inhibited.

Thanks for the additional details!

After this patch, I guess aer_irq() still reads 0x2
(PCI_ERR_ROOT_MULTI_COR_RCV), but now it writes 0x2 back which clears
PCI_ERR_ROOT_MULTI_COR_RCV.

In addition, aer_irq() will continue on to read PCI_ERR_ROOT_ERR_SRC,
which probably contains either 0 or junk left over from being captured
when PCI_ERR_ROOT_COR_RCV was set.

And aer_irq() will queue an e_src record with status ==
PCI_ERR_ROOT_MULTI_COR_RCV.  But since PCI_ERR_ROOT_COR_RCV is not set
in status, aer_isr_one_error() will do nothing, right?

That might not be *terrible* and is definitely better than not being
able to handle future interrupts.  But we basically threw away the
information that multiple errors occurred, and we queued an e_src
record that occupies space without being used for anything.

Bjorn

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly
  2022-03-17 22:59         ` Bjorn Helgaas
@ 2022-03-18 20:28           ` Sathyanarayanan Kuppuswamy
  -1 siblings, 0 replies; 12+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-03-18 20:28 UTC (permalink / raw)
  To: Bjorn Helgaas, Eric Badger
  Cc: Raj, Ashok, Bjorn Helgaas, Russell Currey, Oliver OHalloran,
	Kuppuswamy Sathyanarayanan, linux-pci, linux-kernel,
	linuxppc-dev



On 3/17/22 3:59 PM, Bjorn Helgaas wrote:
> Thanks for the additional details!
> 
> After this patch, I guess aer_irq() still reads 0x2
> (PCI_ERR_ROOT_MULTI_COR_RCV), but now it writes 0x2 back which clears
> PCI_ERR_ROOT_MULTI_COR_RCV.
> 
> In addition, aer_irq() will continue on to read PCI_ERR_ROOT_ERR_SRC,
> which probably contains either 0 or junk left over from being captured
> when PCI_ERR_ROOT_COR_RCV was set.
> 
> And aer_irq() will queue an e_src record with status ==
> PCI_ERR_ROOT_MULTI_COR_RCV.  But since PCI_ERR_ROOT_COR_RCV is not set
> in status, aer_isr_one_error() will do nothing, right?
> 
> That might not be*terrible*  and is definitely better than not being
> able to handle future interrupts.  But we basically threw away the
> information that multiple errors occurred, and we queued an e_src
> record that occupies space without being used for anything.

Yes, you are correct.  One other way to minimize this race window is to
clear the Root status register as soon as possible. Maybe we can move
it before source ID read as below.

--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1204,8 +1204,8 @@ static irqreturn_t aer_irq(int irq, void *context)
         if (!(e_src.status & AER_ERR_STATUS_MASK))
                 return IRQ_NONE;

-       pci_read_config_dword(rp, aer + PCI_ERR_ROOT_ERR_SRC, &e_src.id);
         pci_write_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, 
e_src.status);
+       pci_read_config_dword(rp, aer + PCI_ERR_ROOT_ERR_SRC, &e_src.id)

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly
@ 2022-03-18 20:28           ` Sathyanarayanan Kuppuswamy
  0 siblings, 0 replies; 12+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2022-03-18 20:28 UTC (permalink / raw)
  To: Bjorn Helgaas, Eric Badger
  Cc: Kuppuswamy Sathyanarayanan, Raj, Ashok, linux-pci, linux-kernel,
	Oliver OHalloran, Bjorn Helgaas, linuxppc-dev



On 3/17/22 3:59 PM, Bjorn Helgaas wrote:
> Thanks for the additional details!
> 
> After this patch, I guess aer_irq() still reads 0x2
> (PCI_ERR_ROOT_MULTI_COR_RCV), but now it writes 0x2 back which clears
> PCI_ERR_ROOT_MULTI_COR_RCV.
> 
> In addition, aer_irq() will continue on to read PCI_ERR_ROOT_ERR_SRC,
> which probably contains either 0 or junk left over from being captured
> when PCI_ERR_ROOT_COR_RCV was set.
> 
> And aer_irq() will queue an e_src record with status ==
> PCI_ERR_ROOT_MULTI_COR_RCV.  But since PCI_ERR_ROOT_COR_RCV is not set
> in status, aer_isr_one_error() will do nothing, right?
> 
> That might not be*terrible*  and is definitely better than not being
> able to handle future interrupts.  But we basically threw away the
> information that multiple errors occurred, and we queued an e_src
> record that occupies space without being used for anything.

Yes, you are correct.  One other way to minimize this race window is to
clear the Root status register as soon as possible. Maybe we can move
it before source ID read as below.

--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1204,8 +1204,8 @@ static irqreturn_t aer_irq(int irq, void *context)
         if (!(e_src.status & AER_ERR_STATUS_MASK))
                 return IRQ_NONE;

-       pci_read_config_dword(rp, aer + PCI_ERR_ROOT_ERR_SRC, &e_src.id);
         pci_write_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, 
e_src.status);
+       pci_read_config_dword(rp, aer + PCI_ERR_ROOT_ERR_SRC, &e_src.id)

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-03-18 20:30 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-11  2:58 [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly Kuppuswamy Sathyanarayanan
2022-03-11  2:58 ` Kuppuswamy Sathyanarayanan
2022-03-13 19:52 ` Bjorn Helgaas
2022-03-13 19:52   ` Bjorn Helgaas
2022-03-13 21:43   ` Raj, Ashok
2022-03-13 21:43     ` Raj, Ashok
2022-03-14 16:21     ` Eric Badger
2022-03-14 16:21       ` Eric Badger
2022-03-17 22:59       ` Bjorn Helgaas
2022-03-17 22:59         ` Bjorn Helgaas
2022-03-18 20:28         ` Sathyanarayanan Kuppuswamy
2022-03-18 20:28           ` Sathyanarayanan Kuppuswamy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.