All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zhenzhong Duan <zhenzhong.duan@intel.com>
To: linux-pci@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	linux-acpi@vger.kernel.org
Cc: rafael@kernel.org, lenb@kernel.org, james.morse@arm.com,
	tony.luck@intel.com, bp@alien8.de, dave@stgolabs.net,
	jonathan.cameron@huawei.com, dave.jiang@intel.com,
	alison.schofield@intel.com, vishal.l.verma@intel.com,
	ira.weiny@intel.com, bhelgaas@google.com, helgaas@kernel.org,
	mahesh@linux.ibm.com, oohall@gmail.com, linmiaohe@huawei.com,
	shiju.jose@huawei.com, adam.c.preble@intel.com,
	leoyang.li@nxp.com, lukas@wunner.de,
	Smita.KoralahalliChannabasappa@amd.com, rrichter@amd.com,
	linux-cxl@vger.kernel.org, linux-edac@vger.kernel.org,
	linux-kernel@vger.kernel.org, erwin.tsaur@intel.com,
	sathyanarayanan.kuppuswamy@intel.com, dan.j.williams@intel.com,
	feiting.wanyan@intel.com, yudong.wang@intel.com,
	chao.p.peng@intel.com, qingshun.wang@linux.intel.com,
	Zhenzhong Duan <zhenzhong.duan@intel.com>
Subject: [PATCH v3 0/3] PCI/AER: Handle Advisory Non-Fatal error
Date: Wed, 17 Apr 2024 14:14:04 +0800	[thread overview]
Message-ID: <20240417061407.1491361-1-zhenzhong.duan@intel.com> (raw)

Hi,

This is a relay work of Qingshun's v2 [1], but changed to focus on ANFE
processing as subject suggests and drops trace-event for now. I think it's
a bit heavy to do extra IOes to get PCIe registers only for trace purpose
and not see it a community request for now.

According to PCIe Base Specification Revision 6.1, Sections 6.2.3.2.4 and
6.2.4.3, certain uncorrectable errors will signal ERR_COR instead of
ERR_NONFATAL, logged as Advisory Non-Fatal Error(ANFE), and set bits in
both Correctable Error(CE) Status register and Uncorrectable Error(UE)
Status register. Currently, when handling AER events the kernel will only
look at CE status or UE status, but never both. In the ANFE case, bits set
in the UE status register will not be reported and cleared until the next
FE/NFE arrives.

For instance, previously, when the kernel receives an ANFE with Poisoned
TLP in OS native AER mode, only the status of CE will be reported and
cleared:

  AER: Correctable error message received from 0000:b7:02.0
  PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00002000/00000000
     [13] NonFatalErr

If the kernel receives a Malformed TLP after that, two UEs will be
reported, which is unexpected. The Malformed TLP Header is lost since
the previous ANFE gated the TLP header logs:

  PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00041000/00180020
     [12] TLP                    (First)
     [18] MalfTLP

To handle this case properly, calculate potential ANFE related status bits
and save in aer_err_info. Use this information to determine the status bits
that need to be cleared.

Now, for the previous scenario, both CE status and related UE status will
be reported and cleared after ANFE:

  AER: Correctable error message received from 0000:b7:02.0
  PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00002000/00000000
     [13] NonFatalErr
    Uncorrectable errors that may cause Advisory Non-Fatal:
     [18] TLP

Note:
checkpatch.pl will produce following warnings on PATCH2/3:

WARNING: 'UE' may be misspelled - perhaps 'USE'?
#22:
uncorrectable error(UE) status should be cleared. However, there is no

...similar warnings omitted...

This is a false-positive, so not fixed.

WARNING: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
#10:
  PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)

...similar warnings omitted...

For readability reasons, these warnings are not fixed.



[1] https://lore.kernel.org/linux-pci/20240125062802.50819-1-qingshun.wang@linux.intel.com

Thanks
Qingshun, Zhenzhong

Changelog:
v3:
  - Split ANFE print and processing to two patches (Bjorn)
  - Simplify ANFE handling, drop trace event
  - Polish comments and patch description
  - Add Tested-by

v2:
  - Reference to the latest PCIe Specification in both commit messages
    and comments, as suggested by Bjorn Helgaas.
  - Describe the reason for storing additional information in
    aer_err_info in the commit message of PATCH 1, as suggested by Bjorn
    Helgaas.
  - Add more details of behavior changes in the commit message of PATCH
    2, as suggested by Bjorn Helgaas.

v1: https://lore.kernel.org/linux-pci/20240111073227.31488-1-qingshun.wang@linux.intel.com/


Zhenzhong Duan (3):
  PCI/AER: Store UNCOR_STATUS bits that might be ANFE in aer_err_info
  PCI/AER: Print UNCOR_STATUS bits that might be ANFE
  PCI/AER: Clear UNCOR_STATUS bits that might be ANFE

 drivers/pci/pci.h      |  1 +
 drivers/pci/pcie/aer.c | 67 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 67 insertions(+), 1 deletion(-)

-- 
2.34.1


WARNING: multiple messages have this Message-ID (diff)
From: Zhenzhong Duan <zhenzhong.duan@intel.com>
To: linux-pci@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	linux-acpi@vger.kernel.org
Cc: linmiaohe@huawei.com, alison.schofield@intel.com,
	rafael@kernel.org, sathyanarayanan.kuppuswamy@intel.com,
	erwin.tsaur@intel.com, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, oohall@gmail.com,
	ira.weiny@intel.com, dave@stgolabs.net, dave.jiang@intel.com,
	vishal.l.verma@intel.com, Smita.KoralahalliChannabasappa@amd.com,
	helgaas@kernel.org, lenb@kernel.org, chao.p.peng@intel.com,
	rrichter@amd.com, yudong.wang@intel.com, bp@alien8.de,
	qingshun.wang@linux.intel.com, jonathan.cameron@huawei.com,
	bhelgaas@google.com, dan.j.williams@intel.com,
	Zhenzhong Duan <zhenzhong.duan@intel.com>,
	linux-edac@vger.kernel.org, tony.luck@intel.com,
	feiting.wanyan@intel.com, adam.c.preble@intel.com,
	mahesh@linux.ibm.com, leoyang.li@nxp.com, lukas@wunner.de,
	james.morse@arm.com, shiju.jose@huawei.com
Subject: [PATCH v3 0/3] PCI/AER: Handle Advisory Non-Fatal error
Date: Wed, 17 Apr 2024 14:14:04 +0800	[thread overview]
Message-ID: <20240417061407.1491361-1-zhenzhong.duan@intel.com> (raw)

Hi,

This is a relay work of Qingshun's v2 [1], but changed to focus on ANFE
processing as subject suggests and drops trace-event for now. I think it's
a bit heavy to do extra IOes to get PCIe registers only for trace purpose
and not see it a community request for now.

According to PCIe Base Specification Revision 6.1, Sections 6.2.3.2.4 and
6.2.4.3, certain uncorrectable errors will signal ERR_COR instead of
ERR_NONFATAL, logged as Advisory Non-Fatal Error(ANFE), and set bits in
both Correctable Error(CE) Status register and Uncorrectable Error(UE)
Status register. Currently, when handling AER events the kernel will only
look at CE status or UE status, but never both. In the ANFE case, bits set
in the UE status register will not be reported and cleared until the next
FE/NFE arrives.

For instance, previously, when the kernel receives an ANFE with Poisoned
TLP in OS native AER mode, only the status of CE will be reported and
cleared:

  AER: Correctable error message received from 0000:b7:02.0
  PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00002000/00000000
     [13] NonFatalErr

If the kernel receives a Malformed TLP after that, two UEs will be
reported, which is unexpected. The Malformed TLP Header is lost since
the previous ANFE gated the TLP header logs:

  PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00041000/00180020
     [12] TLP                    (First)
     [18] MalfTLP

To handle this case properly, calculate potential ANFE related status bits
and save in aer_err_info. Use this information to determine the status bits
that need to be cleared.

Now, for the previous scenario, both CE status and related UE status will
be reported and cleared after ANFE:

  AER: Correctable error message received from 0000:b7:02.0
  PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00002000/00000000
     [13] NonFatalErr
    Uncorrectable errors that may cause Advisory Non-Fatal:
     [18] TLP

Note:
checkpatch.pl will produce following warnings on PATCH2/3:

WARNING: 'UE' may be misspelled - perhaps 'USE'?
#22:
uncorrectable error(UE) status should be cleared. However, there is no

...similar warnings omitted...

This is a false-positive, so not fixed.

WARNING: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
#10:
  PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)

...similar warnings omitted...

For readability reasons, these warnings are not fixed.



[1] https://lore.kernel.org/linux-pci/20240125062802.50819-1-qingshun.wang@linux.intel.com

Thanks
Qingshun, Zhenzhong

Changelog:
v3:
  - Split ANFE print and processing to two patches (Bjorn)
  - Simplify ANFE handling, drop trace event
  - Polish comments and patch description
  - Add Tested-by

v2:
  - Reference to the latest PCIe Specification in both commit messages
    and comments, as suggested by Bjorn Helgaas.
  - Describe the reason for storing additional information in
    aer_err_info in the commit message of PATCH 1, as suggested by Bjorn
    Helgaas.
  - Add more details of behavior changes in the commit message of PATCH
    2, as suggested by Bjorn Helgaas.

v1: https://lore.kernel.org/linux-pci/20240111073227.31488-1-qingshun.wang@linux.intel.com/


Zhenzhong Duan (3):
  PCI/AER: Store UNCOR_STATUS bits that might be ANFE in aer_err_info
  PCI/AER: Print UNCOR_STATUS bits that might be ANFE
  PCI/AER: Clear UNCOR_STATUS bits that might be ANFE

 drivers/pci/pci.h      |  1 +
 drivers/pci/pcie/aer.c | 67 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 67 insertions(+), 1 deletion(-)

-- 
2.34.1


             reply	other threads:[~2024-04-17  6:16 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-17  6:14 Zhenzhong Duan [this message]
2024-04-17  6:14 ` [PATCH v3 0/3] PCI/AER: Handle Advisory Non-Fatal error Zhenzhong Duan
2024-04-17  6:14 ` [PATCH v3 1/3] PCI/AER: Store UNCOR_STATUS bits that might be ANFE in aer_err_info Zhenzhong Duan
2024-04-17  6:14   ` Zhenzhong Duan
2024-04-22 16:16   ` Jonathan Cameron
2024-04-22 16:16     ` Jonathan Cameron
2024-04-23  2:25     ` Duan, Zhenzhong
2024-04-23  2:25       ` Duan, Zhenzhong
2024-04-26 16:11       ` Jonathan Cameron
2024-04-26 16:11         ` Jonathan Cameron
2024-04-28  3:31         ` Duan, Zhenzhong
2024-04-28  3:31           ` Duan, Zhenzhong
2024-05-01 12:24           ` Jonathan Cameron
2024-05-01 12:24             ` Jonathan Cameron
2024-05-03 11:10             ` Duan, Zhenzhong
2024-05-03 11:10               ` Duan, Zhenzhong
2024-04-17  6:14 ` [PATCH v3 2/3] PCI/AER: Print UNCOR_STATUS bits that might be ANFE Zhenzhong Duan
2024-04-17  6:14   ` Zhenzhong Duan
2024-04-17  6:14 ` [PATCH v3 3/3] PCI/AER: Clear " Zhenzhong Duan
2024-04-17  6:14   ` Zhenzhong Duan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240417061407.1491361-1-zhenzhong.duan@intel.com \
    --to=zhenzhong.duan@intel.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=adam.c.preble@intel.com \
    --cc=alison.schofield@intel.com \
    --cc=bhelgaas@google.com \
    --cc=bp@alien8.de \
    --cc=chao.p.peng@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=erwin.tsaur@intel.com \
    --cc=feiting.wanyan@intel.com \
    --cc=helgaas@kernel.org \
    --cc=ira.weiny@intel.com \
    --cc=james.morse@arm.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=lenb@kernel.org \
    --cc=leoyang.li@nxp.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=lukas@wunner.de \
    --cc=mahesh@linux.ibm.com \
    --cc=oohall@gmail.com \
    --cc=qingshun.wang@linux.intel.com \
    --cc=rafael@kernel.org \
    --cc=rrichter@amd.com \
    --cc=sathyanarayanan.kuppuswamy@intel.com \
    --cc=shiju.jose@huawei.com \
    --cc=tony.luck@intel.com \
    --cc=vishal.l.verma@intel.com \
    --cc=yudong.wang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.