linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
To: linux-pci@vger.kernel.org
Cc: chao.p.peng@linux.intel.com, chao.p.peng@intel.com,
	erwin.tsaur@intel.com, feiting.wanyan@intel.com,
	qingshun.wang@intel.com, "Wang,
	Qingshun" <qingshun.wang@linux.intel.com>
Subject: [PATCH 0/4] pci/aer: Handle Advisory Non-Fatal properly
Date: Thu, 11 Jan 2024 15:32:15 +0800	[thread overview]
Message-ID: <20240111073227.31488-1-qingshun.wang@linux.intel.com> (raw)

According to PCIe specification 4.0 sections 6.2.3.2.4 and 6.2.4.3,
certain uncorrectable errors will signal ERR_COR instead of
ERR_NONFATAL, logged as Advisory Non-Fatal Error, and set bits in
both Correctable Error Status register and Uncorrectable Error Status
register. Currently, when handling AER event the kernel will only look
at CE status or UE status, but never both. In the Advisory
Non-Fatal Error case, bits set in UE status register will not be
reported and cleared until the next Fatal/Non-Fatal error arrives.

For instance, before this patch series, once kernel receives an ANFE
with Poisoned TLP in OS native AER mode, only status of CE will be
reported and cleared:

[  148.459816] pcieport 0000:b7:02.0: AER: Corrected error received: 0000:b7:02.0
[  148.459858] pcieport 0000:b7:02.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
[  148.459863] pcieport 0000:b7:02.0:   device [8086:0db0] error status/mask=00002000/00000000
[  148.459868] pcieport 0000:b7:02.0:    [13] NonFatalErr           

If the kernel receives a Malformed TLP after that, two UE will be
reported, which is unexpected. Malformed TLP Header was lost since
the previous ANF gated the TLP header logs:

[  170.540192] pcieport 0000:b7:02.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
[  170.552420] pcieport 0000:b7:02.0:   device [8086:0db0] error status/mask=00041000/00180020
[  170.561904] pcieport 0000:b7:02.0:    [12] TLP                    (First)
[  170.569656] pcieport 0000:b7:02.0:    [18] MalfTLP       

To handle this case properly, this patch set adds additional fields
in aer_err_info to track both CE and UE status/mask and UE severity.
This information will later be used to determine the register bits
that need to be cleared. Additionally, adds more data to aer_event
tracepoint, which would help to better understand ANFE and other errors
for external observation.

In the previous scenario, after this patch series, both CE status and
related UE status will be reported and cleared after ANFE:

[  148.459816] pcieport 0000:b7:02.0: AER: Corrected error received: 0000:b7:02.0
[  148.459858] pcieport 0000:b7:02.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
[  148.459863] pcieport 0000:b7:02.0:   device [8086:0db0] error status/mask=00002000/00000000
[  148.459868] pcieport 0000:b7:02.0:    [13] NonFatalErr           
[  148.459868] pcieport 0000:b7:02.0:   Uncorrectable errors that may cause Advisory Non-Fatal:
[  148.459868] pcieport 0000:b7:02.0:    [18] TLP

Wang, Qingshun (4):
  pci/aer: Store more information in aer_err_info
  pci/aer: Handle Advisory Non-Fatal properly
  pci/aer: Fetch information for FTrace
  ras: Trace more information in aer_event

 drivers/acpi/apei/ghes.c      |  16 ++-
 drivers/cxl/core/pci.c        |  15 ++-
 drivers/pci/pci.h             |  12 ++-
 drivers/pci/pcie/aer.c        | 188 +++++++++++++++++++++++++++-------
 include/linux/aer.h           |   6 +-
 include/linux/pci.h           |  27 +++++
 include/ras/ras_event.h       |  48 ++++++++-
 include/uapi/linux/pci_regs.h |   1 +
 8 files changed, 266 insertions(+), 47 deletions(-)

base-commit: 610a9b8f49fbcf1100716370d3b5f6f884a2835a
-- 
2.42.0


             reply	other threads:[~2024-01-11  7:32 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-11  7:32 Wang, Qingshun [this message]
2024-01-11  7:32 ` [PATCH 1/4] pci/aer: Store more information in aer_err_info Wang, Qingshun
2024-01-11 11:27   ` Ilpo Järvinen
2024-01-12  3:21     ` Wang, Qingshun
2024-01-12 16:32   ` Bjorn Helgaas
2024-01-16  8:35     ` Wang, Qingshun
2024-01-11  7:32 ` [PATCH 2/4] pci/aer: Handle Advisory Non-Fatal properly Wang, Qingshun
2024-01-12 16:35   ` Bjorn Helgaas
2024-01-16  8:42     ` Wang, Qingshun
2024-01-11  7:32 ` [PATCH 3/4] pci/aer: Fetch information for FTrace Wang, Qingshun
2024-01-11  7:32 ` [PATCH 4/4] ras: Trace more information in aer_event Wang, Qingshun
2024-01-12  3:23 ` [PATCH 0/4] pci/aer: Handle Advisory Non-Fatal properly Wang, Qingshun
2024-01-12 16:41 ` Bjorn Helgaas
2024-01-16  8:32   ` Wang, Qingshun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240111073227.31488-1-qingshun.wang@linux.intel.com \
    --to=qingshun.wang@linux.intel.com \
    --cc=chao.p.peng@intel.com \
    --cc=chao.p.peng@linux.intel.com \
    --cc=erwin.tsaur@intel.com \
    --cc=feiting.wanyan@intel.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=qingshun.wang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).