* [PATCH] PCI/AER: Rate limit the reporting of the correctable errors
@ 2023-01-03 16:55 ` Rajat Khandelwal
0 siblings, 0 replies; 8+ messages in thread
From: Rajat Khandelwal @ 2023-01-03 16:55 UTC (permalink / raw)
To: ruscur, oohall, bhelgaas
Cc: linuxppc-dev, linux-pci, linux-kernel, rajat.khandelwal,
Rajat Khandelwal
There are many instances where correctable errors tend to inundate
the message buffer. We observe such instances during thunderbolt PCIe
tunneling.
It's true that they are mitigated by the hardware and are non-fatal
but we shouldn't be spamming the logs with such correctable errors as it
confuses other kernel developers less familiar with PCI errors, support
staff, and users who happen to look at the logs, hence rate limit them.
A typical example log inside an HP TBT4 dock:
[54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0
[54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[54912.661203] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001100/00002000
[54912.661211] igc 0000:2b:00.0: [ 8] Rollover
[54912.661219] igc 0000:2b:00.0: [12] Timeout
[54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0
[54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[54982.838808] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001000/00002000
[54982.838817] igc 0000:2b:00.0: [12] Timeout
This gets repeated continuously, thus inundating the buffer.
Signed-off-by: Rajat Khandelwal <rajat.khandelwal@linux.intel.com>
---
drivers/pci/pcie/aer.c | 54 +++++++++++++++++++++++++++---------------
include/linux/pci.h | 3 +++
2 files changed, 38 insertions(+), 19 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e2d8a74f83c3..7ae6761a8e59 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -684,23 +684,24 @@ static void __aer_print_error(struct pci_dev *dev,
{
const char **strings;
unsigned long status = info->status & ~info->mask;
- const char *level, *errmsg;
+ const char *errmsg;
int i;
- if (info->severity == AER_CORRECTABLE) {
+ if (info->severity == AER_CORRECTABLE)
strings = aer_correctable_error_string;
- level = KERN_WARNING;
- } else {
+ else
strings = aer_uncorrectable_error_string;
- level = KERN_ERR;
- }
for_each_set_bit(i, &status, 32) {
errmsg = strings[i];
if (!errmsg)
errmsg = "Unknown Error Bit";
- pci_printk(level, dev, " [%2d] %-22s%s\n", i, errmsg,
+ if (info->severity == AER_CORRECTABLE)
+ pci_warn_ratelimited(dev, " [%2d] %-22s%s\n", i, errmsg,
+ info->first_error == i ? " (First)" : "");
+ else
+ pci_err(dev, " [%2d] %-22s%s\n", i, errmsg,
info->first_error == i ? " (First)" : "");
}
pci_dev_aer_stats_incr(dev, info);
@@ -710,7 +711,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
{
int layer, agent;
int id = ((dev->bus->number << 8) | dev->devfn);
- const char *level;
if (!info->status) {
pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
@@ -721,14 +718,21 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
layer = AER_GET_LAYER_ERROR(info->severity, info->status);
agent = AER_GET_AGENT(info->severity, info->status);
- level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
+ if (info->severity == AER_CORRECTABLE) {
+ pci_warn_ratelimited(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
+ aer_error_severity_string[info->severity],
+ aer_error_layer[layer], aer_agent_string[agent]);
- pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
- aer_error_severity_string[info->severity],
- aer_error_layer[layer], aer_agent_string[agent]);
+ pci_warn_ratelimited(dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
+ dev->vendor, dev->device, info->status, info->mask);
+ } else {
+ pci_err(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
+ aer_error_severity_string[info->severity],
+ aer_error_layer[layer], aer_agent_string[agent]);
- pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
- dev->vendor, dev->device, info->status, info->mask);
+ pci_err(dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
+ dev->vendor, dev->device, info->status, info->mask);
+ }
__aer_print_error(dev, info);
@@ -748,11 +755,19 @@ static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
u8 bus = info->id >> 8;
u8 devfn = info->id & 0xff;
- pci_info(dev, "%s%s error received: %04x:%02x:%02x.%d\n",
- info->multi_error_valid ? "Multiple " : "",
- aer_error_severity_string[info->severity],
- pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
- PCI_FUNC(devfn));
+ if (info->severity == AER_CORRECTABLE)
+ pci_info_ratelimited(dev, "%s%s error received: %04x:%02x:%02x.%d\n",
+ info->multi_error_valid ? "Multiple " : "",
+ aer_error_severity_string[info->severity],
+ pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
+ PCI_FUNC(devfn));
+ else
+ pci_info(dev, "%s%s error received: %04x:%02x:%02x.%d\n",
+ info->multi_error_valid ? "Multiple " : "",
+ aer_error_severity_string[info->severity],
+ pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
+ PCI_FUNC(devfn));
+
}
#ifdef CONFIG_ACPI_APEI_PCIEAER
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 060af91bafcd..d9434bae10c8 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2491,6 +2491,9 @@ void pci_uevent_ers(struct pci_dev *pdev, enum pci_ers_result err_type);
#define pci_info_ratelimited(pdev, fmt, arg...) \
dev_info_ratelimited(&(pdev)->dev, fmt, ##arg)
+#define pci_warn_ratelimited(pdev, fmt, arg...) \
+ dev_warn_ratelimited(&(pdev)->dev, fmt, ##arg)
+
#define pci_WARN(pdev, condition, fmt, arg...) \
WARN(condition, "%s %s: " fmt, \
dev_driver_string(&(pdev)->dev), pci_name(pdev), ##arg)
--
2.34.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH] PCI/AER: Rate limit the reporting of the correctable errors
@ 2023-01-03 16:55 ` Rajat Khandelwal
0 siblings, 0 replies; 8+ messages in thread
From: Rajat Khandelwal @ 2023-01-03 16:55 UTC (permalink / raw)
To: ruscur, oohall, bhelgaas
Cc: linux-pci, Rajat Khandelwal, linuxppc-dev, linux-kernel,
rajat.khandelwal
There are many instances where correctable errors tend to inundate
the message buffer. We observe such instances during thunderbolt PCIe
tunneling.
It's true that they are mitigated by the hardware and are non-fatal
but we shouldn't be spamming the logs with such correctable errors as it
confuses other kernel developers less familiar with PCI errors, support
staff, and users who happen to look at the logs, hence rate limit them.
A typical example log inside an HP TBT4 dock:
[54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0
[54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[54912.661203] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001100/00002000
[54912.661211] igc 0000:2b:00.0: [ 8] Rollover
[54912.661219] igc 0000:2b:00.0: [12] Timeout
[54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0
[54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
[54982.838808] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001000/00002000
[54982.838817] igc 0000:2b:00.0: [12] Timeout
This gets repeated continuously, thus inundating the buffer.
Signed-off-by: Rajat Khandelwal <rajat.khandelwal@linux.intel.com>
---
drivers/pci/pcie/aer.c | 54 +++++++++++++++++++++++++++---------------
include/linux/pci.h | 3 +++
2 files changed, 38 insertions(+), 19 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e2d8a74f83c3..7ae6761a8e59 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -684,23 +684,24 @@ static void __aer_print_error(struct pci_dev *dev,
{
const char **strings;
unsigned long status = info->status & ~info->mask;
- const char *level, *errmsg;
+ const char *errmsg;
int i;
- if (info->severity == AER_CORRECTABLE) {
+ if (info->severity == AER_CORRECTABLE)
strings = aer_correctable_error_string;
- level = KERN_WARNING;
- } else {
+ else
strings = aer_uncorrectable_error_string;
- level = KERN_ERR;
- }
for_each_set_bit(i, &status, 32) {
errmsg = strings[i];
if (!errmsg)
errmsg = "Unknown Error Bit";
- pci_printk(level, dev, " [%2d] %-22s%s\n", i, errmsg,
+ if (info->severity == AER_CORRECTABLE)
+ pci_warn_ratelimited(dev, " [%2d] %-22s%s\n", i, errmsg,
+ info->first_error == i ? " (First)" : "");
+ else
+ pci_err(dev, " [%2d] %-22s%s\n", i, errmsg,
info->first_error == i ? " (First)" : "");
}
pci_dev_aer_stats_incr(dev, info);
@@ -710,7 +711,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
{
int layer, agent;
int id = ((dev->bus->number << 8) | dev->devfn);
- const char *level;
if (!info->status) {
pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
@@ -721,14 +718,21 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
layer = AER_GET_LAYER_ERROR(info->severity, info->status);
agent = AER_GET_AGENT(info->severity, info->status);
- level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
+ if (info->severity == AER_CORRECTABLE) {
+ pci_warn_ratelimited(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
+ aer_error_severity_string[info->severity],
+ aer_error_layer[layer], aer_agent_string[agent]);
- pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
- aer_error_severity_string[info->severity],
- aer_error_layer[layer], aer_agent_string[agent]);
+ pci_warn_ratelimited(dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
+ dev->vendor, dev->device, info->status, info->mask);
+ } else {
+ pci_err(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
+ aer_error_severity_string[info->severity],
+ aer_error_layer[layer], aer_agent_string[agent]);
- pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
- dev->vendor, dev->device, info->status, info->mask);
+ pci_err(dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
+ dev->vendor, dev->device, info->status, info->mask);
+ }
__aer_print_error(dev, info);
@@ -748,11 +755,19 @@ static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
u8 bus = info->id >> 8;
u8 devfn = info->id & 0xff;
- pci_info(dev, "%s%s error received: %04x:%02x:%02x.%d\n",
- info->multi_error_valid ? "Multiple " : "",
- aer_error_severity_string[info->severity],
- pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
- PCI_FUNC(devfn));
+ if (info->severity == AER_CORRECTABLE)
+ pci_info_ratelimited(dev, "%s%s error received: %04x:%02x:%02x.%d\n",
+ info->multi_error_valid ? "Multiple " : "",
+ aer_error_severity_string[info->severity],
+ pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
+ PCI_FUNC(devfn));
+ else
+ pci_info(dev, "%s%s error received: %04x:%02x:%02x.%d\n",
+ info->multi_error_valid ? "Multiple " : "",
+ aer_error_severity_string[info->severity],
+ pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
+ PCI_FUNC(devfn));
+
}
#ifdef CONFIG_ACPI_APEI_PCIEAER
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 060af91bafcd..d9434bae10c8 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2491,6 +2491,9 @@ void pci_uevent_ers(struct pci_dev *pdev, enum pci_ers_result err_type);
#define pci_info_ratelimited(pdev, fmt, arg...) \
dev_info_ratelimited(&(pdev)->dev, fmt, ##arg)
+#define pci_warn_ratelimited(pdev, fmt, arg...) \
+ dev_warn_ratelimited(&(pdev)->dev, fmt, ##arg)
+
#define pci_WARN(pdev, condition, fmt, arg...) \
WARN(condition, "%s %s: " fmt, \
dev_driver_string(&(pdev)->dev), pci_name(pdev), ##arg)
--
2.34.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] PCI/AER: Rate limit the reporting of the correctable errors
2023-01-03 16:55 ` Rajat Khandelwal
@ 2023-01-03 19:14 ` Bjorn Helgaas
-1 siblings, 0 replies; 8+ messages in thread
From: Bjorn Helgaas @ 2023-01-03 19:14 UTC (permalink / raw)
To: Rajat Khandelwal
Cc: Paul Menzel, Neftin, Sasha, Leon Romanovsky, linux-pci,
Frederick Zhang, rajat.khandelwal, linux-kernel, oohall,
bhelgaas, linuxppc-dev
[+cc Paul, Sasha, Leon, Frederick]
(Please cc folks who have commented on previous versions of your
patch.)
On Tue, Jan 03, 2023 at 10:25:48PM +0530, Rajat Khandelwal wrote:
> There are many instances where correctable errors tend to inundate
> the message buffer. We observe such instances during thunderbolt PCIe
> tunneling.
>
> It's true that they are mitigated by the hardware and are non-fatal
> but we shouldn't be spamming the logs with such correctable errors as it
> confuses other kernel developers less familiar with PCI errors, support
> staff, and users who happen to look at the logs, hence rate limit them.
I want a better understanding of why we have so many errors before
rate-limiting everybody.
> A typical example log inside an HP TBT4 dock:
> [54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0
> [54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> [54912.661203] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001100/00002000
> [54912.661211] igc 0000:2b:00.0: [ 8] Rollover
> [54912.661219] igc 0000:2b:00.0: [12] Timeout
> [54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0
> [54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> [54982.838808] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001000/00002000
> [54982.838817] igc 0000:2b:00.0: [12] Timeout
Please remove the timestamps; they don't contribute to understanding
the problem.
> This gets repeated continuously, thus inundating the buffer.
Did you verify that we actually clear the Correctable Error Status
register?
https://bugzilla.kernel.org/show_bug.cgi?id=216863 looks like a
similar issue. The issue Frederick is seeing happens when resuming
from sleep. Is there some event that triggers the correctable errors
you see?
Bjorn
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] PCI/AER: Rate limit the reporting of the correctable errors
@ 2023-01-03 19:14 ` Bjorn Helgaas
0 siblings, 0 replies; 8+ messages in thread
From: Bjorn Helgaas @ 2023-01-03 19:14 UTC (permalink / raw)
To: Rajat Khandelwal
Cc: ruscur, oohall, bhelgaas, linuxppc-dev, linux-pci, linux-kernel,
rajat.khandelwal, Paul Menzel, Neftin, Sasha, Leon Romanovsky,
Frederick Zhang
[+cc Paul, Sasha, Leon, Frederick]
(Please cc folks who have commented on previous versions of your
patch.)
On Tue, Jan 03, 2023 at 10:25:48PM +0530, Rajat Khandelwal wrote:
> There are many instances where correctable errors tend to inundate
> the message buffer. We observe such instances during thunderbolt PCIe
> tunneling.
>
> It's true that they are mitigated by the hardware and are non-fatal
> but we shouldn't be spamming the logs with such correctable errors as it
> confuses other kernel developers less familiar with PCI errors, support
> staff, and users who happen to look at the logs, hence rate limit them.
I want a better understanding of why we have so many errors before
rate-limiting everybody.
> A typical example log inside an HP TBT4 dock:
> [54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0
> [54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> [54912.661203] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001100/00002000
> [54912.661211] igc 0000:2b:00.0: [ 8] Rollover
> [54912.661219] igc 0000:2b:00.0: [12] Timeout
> [54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0
> [54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> [54982.838808] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001000/00002000
> [54982.838817] igc 0000:2b:00.0: [12] Timeout
Please remove the timestamps; they don't contribute to understanding
the problem.
> This gets repeated continuously, thus inundating the buffer.
Did you verify that we actually clear the Correctable Error Status
register?
https://bugzilla.kernel.org/show_bug.cgi?id=216863 looks like a
similar issue. The issue Frederick is seeing happens when resuming
from sleep. Is there some event that triggers the correctable errors
you see?
Bjorn
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] PCI/AER: Rate limit the reporting of the correctable errors
2023-01-03 19:14 ` Bjorn Helgaas
(?)
@ 2023-01-04 4:57 ` Rajat Khandelwal
2023-01-04 6:46 ` Leon Romanovsky
-1 siblings, 1 reply; 8+ messages in thread
From: Rajat Khandelwal @ 2023-01-04 4:57 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Paul Menzel, Neftin, Sasha, Leon Romanovsky, linux-pci,
Frederick Zhang, rajat.khandelwal, linux-kernel, oohall,
bhelgaas, linuxppc-dev
[-- Attachment #1: Type: text/plain, Size: 3323 bytes --]
Hi Bjorn,
Thanks for the acknowledgement.
On 1/4/2023 12:44 AM, Bjorn Helgaas wrote:
> [+cc Paul, Sasha, Leon, Frederick]
>
> (Please cc folks who have commented on previous versions of your
> patch.)
>
> On Tue, Jan 03, 2023 at 10:25:48PM +0530, Rajat Khandelwal wrote:
>> There are many instances where correctable errors tend to inundate
>> the message buffer. We observe such instances during thunderbolt PCIe
>> tunneling.
>>
>> It's true that they are mitigated by the hardware and are non-fatal
>> but we shouldn't be spamming the logs with such correctable errors as it
>> confuses other kernel developers less familiar with PCI errors, support
>> staff, and users who happen to look at the logs, hence rate limit them.
> I want a better understanding of why we have so many errors before
> rate-limiting everybody.
--> So, we are debugging this inside Intel along with the thunderbolt/PCIe team. Apparently, it will
take some time to reach to a conclusion. Since I witness these errors in other thunderbolt devices
also, I am currently segregating all the TBT devices so that we have proper data to debug.
>
>> A typical example log inside an HP TBT4 dock:
>> [54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0
>> [54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
>> [54912.661203] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001100/00002000
>> [54912.661211] igc 0000:2b:00.0: [ 8] Rollover
>> [54912.661219] igc 0000:2b:00.0: [12] Timeout
>> [54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0
>> [54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
>> [54982.838808] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001000/00002000
>> [54982.838817] igc 0000:2b:00.0: [12] Timeout
> Please remove the timestamps; they don't contribute to understanding
> the problem.
--> Sure.
>
>> This gets repeated continuously, thus inundating the buffer.
> Did you verify that we actually clear the Correctable Error Status
> register?
--> This patch targets only rate limiting the correctable errors since they are
non-fatal, and they kind of inundate the CPU logs, particularly during thunderbolt
connections. It doesn't have an impact anywhere else.
As per your suggestion in the igc patch, I found rate limiting as a doable option
currently. Have eradicated any kind of masking the bits.
>
> https://bugzilla.kernel.org/show_bug.cgi?id=216863 looks like a
> similar issue. The issue Frederick is seeing happens when resuming
> from sleep. Is there some event that triggers the correctable errors
> you see?
--> The signatures look similar but there is no such event which triggers these errors.
I witness them in many situations (hot plug, cold boot, warm boot, s0ix, etc.).
Further, I think the replay correctable errors arise in thunderbolt PCIe devices because
the timeout values are not adjusted properly concerning thunderbolt daisy chains.
Not sure, but since these PCIe devices work directly on the motherboard, and only give issues
when they are inside thunderbolt devices, I think the addition of PCIe bridges in the daisy chain
is not synced with proper timeout values.
>
> Bjorn
[-- Attachment #2: Type: text/html, Size: 4754 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] PCI/AER: Rate limit the reporting of the correctable errors
2023-01-04 4:57 ` Rajat Khandelwal
@ 2023-01-04 6:46 ` Leon Romanovsky
0 siblings, 0 replies; 8+ messages in thread
From: Leon Romanovsky @ 2023-01-04 6:46 UTC (permalink / raw)
To: Rajat Khandelwal
Cc: Paul Menzel, Neftin, Sasha, linux-pci, Frederick Zhang,
rajat.khandelwal, linux-kernel, oohall, bhelgaas, Bjorn Helgaas,
linuxppc-dev
On Wed, Jan 04, 2023 at 10:27:33AM +0530, Rajat Khandelwal wrote:
> Hi Bjorn,
>
> Thanks for the acknowledgement.
>
> On 1/4/2023 12:44 AM, Bjorn Helgaas wrote:
> > [+cc Paul, Sasha, Leon, Frederick]
> >
> > (Please cc folks who have commented on previous versions of your
> > patch.)
> >
> > On Tue, Jan 03, 2023 at 10:25:48PM +0530, Rajat Khandelwal wrote:
> > > There are many instances where correctable errors tend to inundate
> > > the message buffer. We observe such instances during thunderbolt PCIe
> > > tunneling.
<...>
> > > [54982.838808] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001000/00002000
> > > [54982.838817] igc 0000:2b:00.0: [12] Timeout
> > Please remove the timestamps; they don't contribute to understanding
> > the problem.
>
> --> Sure.
Please don't add "-->" or any marker to replies. It breaks mail color
scheme.
>
> >
> > > This gets repeated continuously, thus inundating the buffer.
> > Did you verify that we actually clear the Correctable Error Status
> > register?
>
> --> This patch targets only rate limiting the correctable errors since they are
> non-fatal, and they kind of inundate the CPU logs, particularly during thunderbolt
> connections. It doesn't have an impact anywhere else.
> As per your suggestion in the igc patch, I found rate limiting as a doable option
> currently. Have eradicated any kind of masking the bits.
You didn't answer on the asked question. "Did you verify that we actually clear
the Correctable Error Status register?".
Thanks
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] PCI/AER: Rate limit the reporting of the correctable errors
@ 2023-01-04 6:46 ` Leon Romanovsky
0 siblings, 0 replies; 8+ messages in thread
From: Leon Romanovsky @ 2023-01-04 6:46 UTC (permalink / raw)
To: Rajat Khandelwal
Cc: Bjorn Helgaas, ruscur, oohall, bhelgaas, linuxppc-dev, linux-pci,
linux-kernel, rajat.khandelwal, Paul Menzel, Neftin, Sasha,
Frederick Zhang
On Wed, Jan 04, 2023 at 10:27:33AM +0530, Rajat Khandelwal wrote:
> Hi Bjorn,
>
> Thanks for the acknowledgement.
>
> On 1/4/2023 12:44 AM, Bjorn Helgaas wrote:
> > [+cc Paul, Sasha, Leon, Frederick]
> >
> > (Please cc folks who have commented on previous versions of your
> > patch.)
> >
> > On Tue, Jan 03, 2023 at 10:25:48PM +0530, Rajat Khandelwal wrote:
> > > There are many instances where correctable errors tend to inundate
> > > the message buffer. We observe such instances during thunderbolt PCIe
> > > tunneling.
<...>
> > > [54982.838808] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001000/00002000
> > > [54982.838817] igc 0000:2b:00.0: [12] Timeout
> > Please remove the timestamps; they don't contribute to understanding
> > the problem.
>
> --> Sure.
Please don't add "-->" or any marker to replies. It breaks mail color
scheme.
>
> >
> > > This gets repeated continuously, thus inundating the buffer.
> > Did you verify that we actually clear the Correctable Error Status
> > register?
>
> --> This patch targets only rate limiting the correctable errors since they are
> non-fatal, and they kind of inundate the CPU logs, particularly during thunderbolt
> connections. It doesn't have an impact anywhere else.
> As per your suggestion in the igc patch, I found rate limiting as a doable option
> currently. Have eradicated any kind of masking the bits.
You didn't answer on the asked question. "Did you verify that we actually clear
the Correctable Error Status register?".
Thanks
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] PCI/AER: Rate limit the reporting of the correctable errors
2023-01-04 6:46 ` Leon Romanovsky
(?)
@ 2023-01-04 13:04 ` Rajat Khandelwal
-1 siblings, 0 replies; 8+ messages in thread
From: Rajat Khandelwal @ 2023-01-04 13:04 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Paul Menzel, Neftin, Sasha, linux-pci, Frederick Zhang,
rajat.khandelwal, linux-kernel, oohall, bhelgaas, Bjorn Helgaas,
linuxppc-dev
[-- Attachment #1: Type: text/plain, Size: 1667 bytes --]
Hi Leon,
Thanks for the ack.
On 1/4/2023 12:16 PM, Leon Romanovsky wrote:
> On Wed, Jan 04, 2023 at 10:27:33AM +0530, Rajat Khandelwal wrote:
>> Hi Bjorn,
>>
>> Thanks for the acknowledgement.
>>
>> On 1/4/2023 12:44 AM, Bjorn Helgaas wrote:
>>> [+cc Paul, Sasha, Leon, Frederick]
>>>
>>> (Please cc folks who have commented on previous versions of your
>>> patch.)
>>>
>>> On Tue, Jan 03, 2023 at 10:25:48PM +0530, Rajat Khandelwal wrote:
>>>> There are many instances where correctable errors tend to inundate
>>>> the message buffer. We observe such instances during thunderbolt PCIe
>>>> tunneling.
> <...>
>
>>>> [54982.838808] igc 0000:2b:00.0: device [8086:5502] error status/mask=00001000/00002000
>>>> [54982.838817] igc 0000:2b:00.0: [12] Timeout
>>> Please remove the timestamps; they don't contribute to understanding
>>> the problem.
>> --> Sure.
> Please don't add "-->" or any marker to replies. It breaks mail color
> scheme.
>
>>>> This gets repeated continuously, thus inundating the buffer.
>>> Did you verify that we actually clear the Correctable Error Status
>>> register?
>> --> This patch targets only rate limiting the correctable errors since they are
>> non-fatal, and they kind of inundate the CPU logs, particularly during thunderbolt
>> connections. It doesn't have an impact anywhere else.
>> As per your suggestion in the igc patch, I found rate limiting as a doable option
>> currently. Have eradicated any kind of masking the bits.
> You didn't answer on the asked question. "Did you verify that we actually clear
> the Correctable Error Status register?".
Yes, I have verified. The status is cleared successfully.
>
> Thanks
[-- Attachment #2: Type: text/html, Size: 3251 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2023-01-04 16:11 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-03 16:55 [PATCH] PCI/AER: Rate limit the reporting of the correctable errors Rajat Khandelwal
2023-01-03 16:55 ` Rajat Khandelwal
2023-01-03 19:14 ` Bjorn Helgaas
2023-01-03 19:14 ` Bjorn Helgaas
2023-01-04 4:57 ` Rajat Khandelwal
2023-01-04 6:46 ` Leon Romanovsky
2023-01-04 6:46 ` Leon Romanovsky
2023-01-04 13:04 ` Rajat Khandelwal
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.