From: Bjorn Helgaas <helgaas@kernel.org> To: Rajat Khandelwal <rajat.khandelwal@linux.intel.com> Cc: Paul Menzel <pmenzel@molgen.mpg.de>, "Neftin, Sasha" <sasha.neftin@intel.com>, Leon Romanovsky <leon@kernel.org>, linux-pci@vger.kernel.org, Frederick Zhang <frederick888@tsundere.moe>, rajat.khandelwal@intel.com, linux-kernel@vger.kernel.org, oohall@gmail.com, bhelgaas@google.com, linuxppc-dev@lists.ozlabs.org Subject: Re: [PATCH] PCI/AER: Rate limit the reporting of the correctable errors Date: Tue, 3 Jan 2023 13:14:18 -0600 [thread overview] Message-ID: <20230103191418.GA1011392@bhelgaas> (raw) In-Reply-To: <20230103165548.570377-1-rajat.khandelwal@linux.intel.com> [+cc Paul, Sasha, Leon, Frederick] (Please cc folks who have commented on previous versions of your patch.) On Tue, Jan 03, 2023 at 10:25:48PM +0530, Rajat Khandelwal wrote: > There are many instances where correctable errors tend to inundate > the message buffer. We observe such instances during thunderbolt PCIe > tunneling. > > It's true that they are mitigated by the hardware and are non-fatal > but we shouldn't be spamming the logs with such correctable errors as it > confuses other kernel developers less familiar with PCI errors, support > staff, and users who happen to look at the logs, hence rate limit them. I want a better understanding of why we have so many errors before rate-limiting everybody. > A typical example log inside an HP TBT4 dock: > [54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0 > [54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) > [54912.661203] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001100/00002000 > [54912.661211] igc 0000:2b:00.0: [ 8] Rollover > [54912.661219] igc 0000:2b:00.0: [12] Timeout > [54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0 > [54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) > [54982.838808] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001000/00002000 > [54982.838817] igc 0000:2b:00.0: [12] Timeout Please remove the timestamps; they don't contribute to understanding the problem. > This gets repeated continuously, thus inundating the buffer. Did you verify that we actually clear the Correctable Error Status register? https://bugzilla.kernel.org/show_bug.cgi?id=216863 looks like a similar issue. The issue Frederick is seeing happens when resuming from sleep. Is there some event that triggers the correctable errors you see? Bjorn
WARNING: multiple messages have this Message-ID (diff)
From: Bjorn Helgaas <helgaas@kernel.org> To: Rajat Khandelwal <rajat.khandelwal@linux.intel.com> Cc: ruscur@russell.cc, oohall@gmail.com, bhelgaas@google.com, linuxppc-dev@lists.ozlabs.org, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, rajat.khandelwal@intel.com, Paul Menzel <pmenzel@molgen.mpg.de>, "Neftin, Sasha" <sasha.neftin@intel.com>, Leon Romanovsky <leon@kernel.org>, Frederick Zhang <frederick888@tsundere.moe> Subject: Re: [PATCH] PCI/AER: Rate limit the reporting of the correctable errors Date: Tue, 3 Jan 2023 13:14:18 -0600 [thread overview] Message-ID: <20230103191418.GA1011392@bhelgaas> (raw) In-Reply-To: <20230103165548.570377-1-rajat.khandelwal@linux.intel.com> [+cc Paul, Sasha, Leon, Frederick] (Please cc folks who have commented on previous versions of your patch.) On Tue, Jan 03, 2023 at 10:25:48PM +0530, Rajat Khandelwal wrote: > There are many instances where correctable errors tend to inundate > the message buffer. We observe such instances during thunderbolt PCIe > tunneling. > > It's true that they are mitigated by the hardware and are non-fatal > but we shouldn't be spamming the logs with such correctable errors as it > confuses other kernel developers less familiar with PCI errors, support > staff, and users who happen to look at the logs, hence rate limit them. I want a better understanding of why we have so many errors before rate-limiting everybody. > A typical example log inside an HP TBT4 dock: > [54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0 > [54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) > [54912.661203] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001100/00002000 > [54912.661211] igc 0000:2b:00.0: [ 8] Rollover > [54912.661219] igc 0000:2b:00.0: [12] Timeout > [54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0 > [54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) > [54982.838808] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001000/00002000 > [54982.838817] igc 0000:2b:00.0: [12] Timeout Please remove the timestamps; they don't contribute to understanding the problem. > This gets repeated continuously, thus inundating the buffer. Did you verify that we actually clear the Correctable Error Status register? https://bugzilla.kernel.org/show_bug.cgi?id=216863 looks like a similar issue. The issue Frederick is seeing happens when resuming from sleep. Is there some event that triggers the correctable errors you see? Bjorn
next prev parent reply other threads:[~2023-01-03 19:15 UTC|newest] Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-01-03 16:55 [PATCH] PCI/AER: Rate limit the reporting of the correctable errors Rajat Khandelwal 2023-01-03 16:55 ` Rajat Khandelwal 2023-01-03 19:14 ` Bjorn Helgaas [this message] 2023-01-03 19:14 ` Bjorn Helgaas 2023-01-04 4:57 ` Rajat Khandelwal 2023-01-04 6:46 ` Leon Romanovsky 2023-01-04 6:46 ` Leon Romanovsky 2023-01-04 13:04 ` Rajat Khandelwal
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20230103191418.GA1011392@bhelgaas \ --to=helgaas@kernel.org \ --cc=bhelgaas@google.com \ --cc=frederick888@tsundere.moe \ --cc=leon@kernel.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-pci@vger.kernel.org \ --cc=linuxppc-dev@lists.ozlabs.org \ --cc=oohall@gmail.com \ --cc=pmenzel@molgen.mpg.de \ --cc=rajat.khandelwal@intel.com \ --cc=rajat.khandelwal@linux.intel.com \ --cc=sasha.neftin@intel.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.