Re: [PATCH] PCI/AER: Rate limit the reporting of the correctable errors

From: Bjorn Helgaas <helgaas@kernel.org>
To: Rajat Khandelwal <rajat.khandelwal@linux.intel.com>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>,
	"Neftin, Sasha" <sasha.neftin@intel.com>,
	Leon Romanovsky <leon@kernel.org>,
	linux-pci@vger.kernel.org,
	Frederick Zhang <frederick888@tsundere.moe>,
	rajat.khandelwal@intel.com, linux-kernel@vger.kernel.org,
	oohall@gmail.com, bhelgaas@google.com,
	linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH] PCI/AER: Rate limit the reporting of the correctable errors
Date: Tue, 3 Jan 2023 13:14:18 -0600	[thread overview]
Message-ID: <20230103191418.GA1011392@bhelgaas> (raw)
In-Reply-To: <20230103165548.570377-1-rajat.khandelwal@linux.intel.com>

[+cc Paul, Sasha, Leon, Frederick]

(Please cc folks who have commented on previous versions of your
patch.)

On Tue, Jan 03, 2023 at 10:25:48PM +0530, Rajat Khandelwal wrote:
> There are many instances where correctable errors tend to inundate
> the message buffer. We observe such instances during thunderbolt PCIe
> tunneling.
> 
> It's true that they are mitigated by the hardware and are non-fatal
> but we shouldn't be spamming the logs with such correctable errors as it
> confuses other kernel developers less familiar with PCI errors, support
> staff, and users who happen to look at the logs, hence rate limit them.

I want a better understanding of why we have so many errors before
rate-limiting everybody.

> A typical example log inside an HP TBT4 dock:
> [54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0
> [54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> [54912.661203] igc 0000:2b:00.0:   device [8086:5502] error status/mask=00001100/00002000
> [54912.661211] igc 0000:2b:00.0:    [ 8] Rollover
> [54912.661219] igc 0000:2b:00.0:    [12] Timeout
> [54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0
> [54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> [54982.838808] igc 0000:2b:00.0:   device [8086:5502] error status/mask=00001000/00002000
> [54982.838817] igc 0000:2b:00.0:    [12] Timeout

Please remove the timestamps; they don't contribute to understanding
the problem.

> This gets repeated continuously, thus inundating the buffer.

Did you verify that we actually clear the Correctable Error Status
register?

https://bugzilla.kernel.org/show_bug.cgi?id=216863 looks like a
similar issue.  The issue Frederick is seeing happens when resuming
from sleep.  Is there some event that triggers the correctable errors
you see?

Bjorn