All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Rajat Khandelwal <rajat.khandelwal@linux.intel.com>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>,
	"Neftin, Sasha" <sasha.neftin@intel.com>,
	Leon Romanovsky <leon@kernel.org>,
	linux-pci@vger.kernel.org,
	Frederick Zhang <frederick888@tsundere.moe>,
	rajat.khandelwal@intel.com, linux-kernel@vger.kernel.org,
	oohall@gmail.com, bhelgaas@google.com,
	linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH] PCI/AER: Rate limit the reporting of the correctable errors
Date: Tue, 3 Jan 2023 13:14:18 -0600	[thread overview]
Message-ID: <20230103191418.GA1011392@bhelgaas> (raw)
In-Reply-To: <20230103165548.570377-1-rajat.khandelwal@linux.intel.com>

[+cc Paul, Sasha, Leon, Frederick]

(Please cc folks who have commented on previous versions of your
patch.)

On Tue, Jan 03, 2023 at 10:25:48PM +0530, Rajat Khandelwal wrote:
> There are many instances where correctable errors tend to inundate
> the message buffer. We observe such instances during thunderbolt PCIe
> tunneling.
> 
> It's true that they are mitigated by the hardware and are non-fatal
> but we shouldn't be spamming the logs with such correctable errors as it
> confuses other kernel developers less familiar with PCI errors, support
> staff, and users who happen to look at the logs, hence rate limit them.

I want a better understanding of why we have so many errors before
rate-limiting everybody.

> A typical example log inside an HP TBT4 dock:
> [54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0
> [54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> [54912.661203] igc 0000:2b:00.0:   device [8086:5502] error status/mask=00001100/00002000
> [54912.661211] igc 0000:2b:00.0:    [ 8] Rollover
> [54912.661219] igc 0000:2b:00.0:    [12] Timeout
> [54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0
> [54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> [54982.838808] igc 0000:2b:00.0:   device [8086:5502] error status/mask=00001000/00002000
> [54982.838817] igc 0000:2b:00.0:    [12] Timeout

Please remove the timestamps; they don't contribute to understanding
the problem.

> This gets repeated continuously, thus inundating the buffer.

Did you verify that we actually clear the Correctable Error Status
register?

https://bugzilla.kernel.org/show_bug.cgi?id=216863 looks like a
similar issue.  The issue Frederick is seeing happens when resuming
from sleep.  Is there some event that triggers the correctable errors
you see?

Bjorn

WARNING: multiple messages have this Message-ID (diff)
From: Bjorn Helgaas <helgaas@kernel.org>
To: Rajat Khandelwal <rajat.khandelwal@linux.intel.com>
Cc: ruscur@russell.cc, oohall@gmail.com, bhelgaas@google.com,
	linuxppc-dev@lists.ozlabs.org, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, rajat.khandelwal@intel.com,
	Paul Menzel <pmenzel@molgen.mpg.de>,
	"Neftin, Sasha" <sasha.neftin@intel.com>,
	Leon Romanovsky <leon@kernel.org>,
	Frederick Zhang <frederick888@tsundere.moe>
Subject: Re: [PATCH] PCI/AER: Rate limit the reporting of the correctable errors
Date: Tue, 3 Jan 2023 13:14:18 -0600	[thread overview]
Message-ID: <20230103191418.GA1011392@bhelgaas> (raw)
In-Reply-To: <20230103165548.570377-1-rajat.khandelwal@linux.intel.com>

[+cc Paul, Sasha, Leon, Frederick]

(Please cc folks who have commented on previous versions of your
patch.)

On Tue, Jan 03, 2023 at 10:25:48PM +0530, Rajat Khandelwal wrote:
> There are many instances where correctable errors tend to inundate
> the message buffer. We observe such instances during thunderbolt PCIe
> tunneling.
> 
> It's true that they are mitigated by the hardware and are non-fatal
> but we shouldn't be spamming the logs with such correctable errors as it
> confuses other kernel developers less familiar with PCI errors, support
> staff, and users who happen to look at the logs, hence rate limit them.

I want a better understanding of why we have so many errors before
rate-limiting everybody.

> A typical example log inside an HP TBT4 dock:
> [54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0
> [54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> [54912.661203] igc 0000:2b:00.0:   device [8086:5502] error status/mask=00001100/00002000
> [54912.661211] igc 0000:2b:00.0:    [ 8] Rollover
> [54912.661219] igc 0000:2b:00.0:    [12] Timeout
> [54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0
> [54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> [54982.838808] igc 0000:2b:00.0:   device [8086:5502] error status/mask=00001000/00002000
> [54982.838817] igc 0000:2b:00.0:    [12] Timeout

Please remove the timestamps; they don't contribute to understanding
the problem.

> This gets repeated continuously, thus inundating the buffer.

Did you verify that we actually clear the Correctable Error Status
register?

https://bugzilla.kernel.org/show_bug.cgi?id=216863 looks like a
similar issue.  The issue Frederick is seeing happens when resuming
from sleep.  Is there some event that triggers the correctable errors
you see?

Bjorn

  reply	other threads:[~2023-01-03 19:15 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-03 16:55 [PATCH] PCI/AER: Rate limit the reporting of the correctable errors Rajat Khandelwal
2023-01-03 16:55 ` Rajat Khandelwal
2023-01-03 19:14 ` Bjorn Helgaas [this message]
2023-01-03 19:14   ` Bjorn Helgaas
2023-01-04  4:57   ` Rajat Khandelwal
2023-01-04  6:46     ` Leon Romanovsky
2023-01-04  6:46       ` Leon Romanovsky
2023-01-04 13:04       ` Rajat Khandelwal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230103191418.GA1011392@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=bhelgaas@google.com \
    --cc=frederick888@tsundere.moe \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=oohall@gmail.com \
    --cc=pmenzel@molgen.mpg.de \
    --cc=rajat.khandelwal@intel.com \
    --cc=rajat.khandelwal@linux.intel.com \
    --cc=sasha.neftin@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.