linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Loic Poulain <loic.poulain@linaro.org>
Cc: bhelgaas@google.com, ruscur@russell.cc, linux-pci@vger.kernel.org
Subject: Re: [RFC] pci: aer: Disable corrected error reporting by default
Date: Fri, 4 Dec 2020 12:06:15 -0600	[thread overview]
Message-ID: <20201204180615.GA1754610@bjorn-Precision-5520> (raw)
In-Reply-To: <1607102932-10384-1-git-send-email-loic.poulain@linaro.org>

On Fri, Dec 04, 2020 at 06:28:52PM +0100, Loic Poulain wrote:
> It appears to be very common that people complain about kernel log
> (and irq) flooding because of reported corrected errors by AER.

Do you have any pointers to these handy?

I'm pretty sure there's an issue in our AER handling where we don't
clear the error correctly, so we keep complaining about the same error
again and again.

My preferences for how to handle this situation:

  1) Fix our AER handling (if it's indeed broken; I could be wrong)
  2) Automatically rate-limit corrected error reporting
  3) As a last resort, some parameter like this patch

Reports like https://bugzilla.kernel.org/show_bug.cgi?id=111601
have been around for an embarrassingly long time, but I guess nobody
has had time to really dig into them.

My theory about our AER problem is here:
https://lore.kernel.org/linux-pci/20160215171423.GA12641@localhost/

> An usual reply/solution is to completely disable aer with 'noaer' pci
> parameter. This is a big hammer tip since it also prevents reporting of
> 'real' non corrected PCI errors, that need to be handled by the kernel.
> 
> A PCI correctable error is an error corrected at hardware level by the
> PCI protocol (e.g. with retry mechanism), the OS can then totally live
> without being notified about that hardware event.
> 
> A simple change would then consist in not enabling correctable error
> reporting at all, but it can remain useful in some cases, such as for
> determining health of the PCI link.
> 
> This patch changes the default AER mask to not enable correctable error
> reporting by default, and introduce a new pci parameter, 'aerfull' that
> can be used to re-enable all error reports, including correctable ones.
> 
> Note: Alternatively, if changing the legacy behavior is not desirable,
> that can be done the other way, with a 'noaer_correctable' parameter to
> only disable correctable error reporting.
> 
> Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
> ---
>  drivers/pci/pci.c      |  2 ++
>  drivers/pci/pci.h      |  2 ++
>  drivers/pci/pcie/aer.c | 12 +++++++++++-
>  3 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 6d4d5a2..c67ec709 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -6498,6 +6498,8 @@ static int __init pci_setup(char *str)
>  				pcie_ats_disabled = true;
>  			} else if (!strcmp(str, "noaer")) {
>  				pci_no_aer();
> +			} else if (!strcmp(str, "aerfull")) {
> +				pci_aer_full();
>  			} else if (!strcmp(str, "earlydump")) {
>  				pci_early_dump = true;
>  			} else if (!strncmp(str, "realloc=", 8)) {
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index f86cae9..36306a1 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -662,6 +662,7 @@ static inline int devm_of_pci_bridge_init(struct device *dev, struct pci_host_br
>  
>  #ifdef CONFIG_PCIEAER
>  void pci_no_aer(void);
> +void pci_aer_full(void);
>  void pci_aer_init(struct pci_dev *dev);
>  void pci_aer_exit(struct pci_dev *dev);
>  extern const struct attribute_group aer_stats_attr_group;
> @@ -670,6 +671,7 @@ int pci_aer_clear_status(struct pci_dev *dev);
>  int pci_aer_raw_clear_status(struct pci_dev *dev);
>  #else
>  static inline void pci_no_aer(void) { }
> +static inline void pci_aer_full(void) { }
>  static inline void pci_aer_init(struct pci_dev *d) { }
>  static inline void pci_aer_exit(struct pci_dev *d) { }
>  static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 65dff5f..e0ec7047 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -102,6 +102,7 @@ struct aer_stats {
>  #define ERR_UNCOR_ID(d)			(d >> 16)
>  
>  static int pcie_aer_disable;
> +static int pcie_aer_full;
>  static pci_ers_result_t aer_root_reset(struct pci_dev *dev);
>  
>  void pci_no_aer(void)
> @@ -109,6 +110,11 @@ void pci_no_aer(void)
>  	pcie_aer_disable = 1;
>  }
>  
> +void pci_aer_full(void)
> +{
> +	pcie_aer_full = 1;
> +}
> +
>  bool pci_aer_available(void)
>  {
>  	return !pcie_aer_disable && pci_msi_enabled();
> @@ -224,12 +230,16 @@ int pcie_aer_is_native(struct pci_dev *dev)
>  
>  int pci_enable_pcie_error_reporting(struct pci_dev *dev)
>  {
> +	u16 flags = PCI_EXP_AER_FLAGS;
>  	int rc;
>  
>  	if (!pcie_aer_is_native(dev))
>  		return -EIO;
>  
> -	rc = pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);
> +	if (!pcie_aer_full)
> +		flags &= ~PCI_EXP_DEVCTL_CERE;
> +
> +	rc = pcie_capability_set_word(dev, PCI_EXP_DEVCTL, flags);
>  	return pcibios_err_to_errno(rc);
>  }
>  EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);
> -- 
> 2.7.4
> 

      reply	other threads:[~2020-12-04 18:06 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-04 17:28 [RFC] pci: aer: Disable corrected error reporting by default Loic Poulain
2020-12-04 18:06 ` Bjorn Helgaas [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201204180615.GA1754610@bjorn-Precision-5520 \
    --to=helgaas@kernel.org \
    --cc=bhelgaas@google.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=loic.poulain@linaro.org \
    --cc=ruscur@russell.cc \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).