All of lore.kernel.org
 help / color / mirror / Atom feed
* [rfc] suppress excessive AER output
@ 2011-08-03 22:34 Dave Jones
  2011-08-04  5:54 ` Zhang, Yanmin
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Dave Jones @ 2011-08-03 22:34 UTC (permalink / raw)
  To: Linux Kernel; +Cc: tom.l.nguyen, yanmin.zhang

I have a machine that has developed some kind of problem with
its onboard ethernet.  It still boots, but spewed almost 1.5G of text
(2381585 instances of the warning below) before we realised what
was going on, and blacklisted the igb driver.

Is it worth logging every single error when we're flooding like this ?
It seems unlikely that we'll find useful information in amongst that much data
that wasn't already in the first 100 instances.

I picked 100 in the (untested) example patch below arbitarily, but the exact
value could be smaller, or slightly bigger..

could we do something like this maybe ?

	Dave

diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 3ea5173..4ec88c6 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -153,6 +153,17 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 {
 	int id = ((dev->bus->number << 8) | dev->devfn);
 	char prefix[44];
+	static unsigned long aer_printk_limit = 0;
+
+	aer_printk_limit++;
+
+	if (aer_printk_limit > 100)
+		return;
+
+	if (aer_printk_limit == 100) {
+		printk(KERN_ERR "Reached limit of 100 AER errors. Further AER output suppressed.\n");
+		return;
+	}
 
 	snprintf(prefix, sizeof(prefix), "%s%s %s: ",
 		 (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR,

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* RE: [rfc] suppress excessive AER output
  2011-08-03 22:34 [rfc] suppress excessive AER output Dave Jones
@ 2011-08-04  5:54 ` Zhang, Yanmin
  2011-08-05  1:50   ` Dave Jones
  2011-08-04  6:45 ` huang ying
  2011-08-05  2:24 ` Arnaud Lacombe
  2 siblings, 1 reply; 9+ messages in thread
From: Zhang, Yanmin @ 2011-08-04  5:54 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel; +Cc: Nguyen, Tom L, Huang, Ying

Dave,

How about adding a new module parameter aer_printk_limit, so user space could reset it any time?

Yanmin

-----Original Message-----
From: Dave Jones [mailto:davej@redhat.com] 
Sent: Thursday, August 04, 2011 6:34 AM
To: Linux Kernel
Cc: Nguyen, Tom L; Zhang, Yanmin
Subject: [rfc] suppress excessive AER output

I have a machine that has developed some kind of problem with
its onboard ethernet.  It still boots, but spewed almost 1.5G of text
(2381585 instances of the warning below) before we realised what
was going on, and blacklisted the igb driver.

Is it worth logging every single error when we're flooding like this ?
It seems unlikely that we'll find useful information in amongst that much data
that wasn't already in the first 100 instances.

I picked 100 in the (untested) example patch below arbitarily, but the exact
value could be smaller, or slightly bigger..

could we do something like this maybe ?

	Dave

diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 3ea5173..4ec88c6 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -153,6 +153,17 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 {
 	int id = ((dev->bus->number << 8) | dev->devfn);
 	char prefix[44];
+	static unsigned long aer_printk_limit = 0;
+
+	aer_printk_limit++;
+
+	if (aer_printk_limit > 100)
+		return;
+
+	if (aer_printk_limit == 100) {
+		printk(KERN_ERR "Reached limit of 100 AER errors. Further AER output suppressed.\n");
+		return;
+	}
 
 	snprintf(prefix, sizeof(prefix), "%s%s %s: ",
 		 (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR,

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [rfc] suppress excessive AER output
  2011-08-03 22:34 [rfc] suppress excessive AER output Dave Jones
  2011-08-04  5:54 ` Zhang, Yanmin
@ 2011-08-04  6:45 ` huang ying
  2011-08-05  1:50   ` Dave Jones
  2011-08-05  2:24 ` Arnaud Lacombe
  2 siblings, 1 reply; 9+ messages in thread
From: huang ying @ 2011-08-04  6:45 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, tom.l.nguyen, yanmin.zhang

On Thu, Aug 4, 2011 at 6:34 AM, Dave Jones <davej@redhat.com> wrote:
> I have a machine that has developed some kind of problem with
> its onboard ethernet.  It still boots, but spewed almost 1.5G of text
> (2381585 instances of the warning below) before we realised what
> was going on, and blacklisted the igb driver.
>
> Is it worth logging every single error when we're flooding like this ?
> It seems unlikely that we'll find useful information in amongst that much data
> that wasn't already in the first 100 instances.
>
> I picked 100 in the (untested) example patch below arbitarily, but the exact
> value could be smaller, or slightly bigger..
>
> could we do something like this maybe ?

Why not use __ratelimit to implement this feature?

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [rfc] suppress excessive AER output
  2011-08-04  5:54 ` Zhang, Yanmin
@ 2011-08-05  1:50   ` Dave Jones
  2011-08-05  1:59     ` Zhang, Yanmin
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Jones @ 2011-08-05  1:50 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: Linux Kernel, Nguyen, Tom L, Huang, Ying

On Thu, Aug 04, 2011 at 01:54:19PM +0800, Zhang, Yanmin wrote:
 > Dave,
 > 
 > How about adding a new module parameter aer_printk_limit, so user space could reset it any time?

Sure. Though for the built-in case, how would that work ?

	Dave


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [rfc] suppress excessive AER output
  2011-08-04  6:45 ` huang ying
@ 2011-08-05  1:50   ` Dave Jones
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Jones @ 2011-08-05  1:50 UTC (permalink / raw)
  To: huang ying; +Cc: Linux Kernel, tom.l.nguyen, yanmin.zhang

On Thu, Aug 04, 2011 at 02:45:01PM +0800, huang ying wrote:
 > On Thu, Aug 4, 2011 at 6:34 AM, Dave Jones <davej@redhat.com> wrote:
 > > I have a machine that has developed some kind of problem with
 > > its onboard ethernet.  It still boots, but spewed almost 1.5G of text
 > > (2381585 instances of the warning below) before we realised what
 > > was going on, and blacklisted the igb driver.
 > >
 > > Is it worth logging every single error when we're flooding like this ?
 > > It seems unlikely that we'll find useful information in amongst that much data
 > > that wasn't already in the first 100 instances.
 > >
 > > I picked 100 in the (untested) example patch below arbitarily, but the exact
 > > value could be smaller, or slightly bigger..
 > >
 > > could we do something like this maybe ?
 > 
 > Why not use __ratelimit to implement this feature?

that would be better than the current situation probably, but my gut feeling is
that there's still going to be a lot of spew.

	Dave


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [rfc] suppress excessive AER output
  2011-08-05  1:50   ` Dave Jones
@ 2011-08-05  1:59     ` Zhang, Yanmin
  2011-08-05  2:09       ` Dave Jones
  0 siblings, 1 reply; 9+ messages in thread
From: Zhang, Yanmin @ 2011-08-05  1:59 UTC (permalink / raw)
  To: Dave Jones; +Cc: Linux Kernel, Nguyen, Tom L, Huang, Ying

Dave,

My idea is application has opportunity to reset it. Consider the critical mission environment, admin could hot-unplug any failed devices without rebooting system. Then, admin doesn't want to lose AER monitoring.

With the module parameter, admin could change it under /sys/module/aer_drv/parameters.

Yanmin

-----Original Message-----
From: Dave Jones [mailto:davej@redhat.com] 
Sent: Friday, August 05, 2011 9:50 AM
To: Zhang, Yanmin
Cc: Linux Kernel; Nguyen, Tom L; Huang, Ying
Subject: Re: [rfc] suppress excessive AER output

On Thu, Aug 04, 2011 at 01:54:19PM +0800, Zhang, Yanmin wrote:
 > Dave,
 > 
 > How about adding a new module parameter aer_printk_limit, so user space could reset it any time?

Sure. Though for the built-in case, how would that work ?

	Dave


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [rfc] suppress excessive AER output
  2011-08-05  1:59     ` Zhang, Yanmin
@ 2011-08-05  2:09       ` Dave Jones
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Jones @ 2011-08-05  2:09 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: Linux Kernel, Nguyen, Tom L, Huang, Ying

On Fri, Aug 05, 2011 at 09:59:39AM +0800, Zhang, Yanmin wrote:
 > Dave,
 > 
 > My idea is application has opportunity to reset it. Consider the critical mission environment, admin could hot-unplug any failed devices without rebooting system. Then, admin doesn't want to lose AER monitoring.
 > 
 > With the module parameter, admin could change it under /sys/module/aer_drv/parameters.
 
ah, I see. Yes, that sounds sensible.

	Dave.
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [rfc] suppress excessive AER output
  2011-08-03 22:34 [rfc] suppress excessive AER output Dave Jones
  2011-08-04  5:54 ` Zhang, Yanmin
  2011-08-04  6:45 ` huang ying
@ 2011-08-05  2:24 ` Arnaud Lacombe
  2011-08-05  2:33   ` Dave Jones
  2 siblings, 1 reply; 9+ messages in thread
From: Arnaud Lacombe @ 2011-08-05  2:24 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel, tom.l.nguyen, yanmin.zhang

Hi,

On Wed, Aug 3, 2011 at 6:34 PM, Dave Jones <davej@redhat.com> wrote:
> I have a machine that has developed some kind of problem with
> its onboard ethernet.  It still boots, but spewed almost 1.5G of text
> (2381585 instances of the warning below) before we realised what
> was going on, and blacklisted the igb driver.
>
> Is it worth logging every single error when we're flooding like this ?
> It seems unlikely that we'll find useful information in amongst that much data
> that wasn't already in the first 100 instances.
>
> I picked 100 in the (untested) example patch below arbitarily, but the exact
> value could be smaller, or slightly bigger..
>
> could we do something like this maybe ?
>
Please do not reinvent the wheel and use printk_ratelimited().

Thanks,
 - Arnaud

>        Dave
>
> diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
> index 3ea5173..4ec88c6 100644
> --- a/drivers/pci/pcie/aer/aerdrv_errprint.c
> +++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
> @@ -153,6 +153,17 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  {
>        int id = ((dev->bus->number << 8) | dev->devfn);
>        char prefix[44];
> +       static unsigned long aer_printk_limit = 0;
> +
> +       aer_printk_limit++;
> +
> +       if (aer_printk_limit > 100)
> +               return;
> +
> +       if (aer_printk_limit == 100) {
> +               printk(KERN_ERR "Reached limit of 100 AER errors. Further AER output suppressed.\n");
> +               return;
> +       }
>
>        snprintf(prefix, sizeof(prefix), "%s%s %s: ",
>                 (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR,
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [rfc] suppress excessive AER output
  2011-08-05  2:24 ` Arnaud Lacombe
@ 2011-08-05  2:33   ` Dave Jones
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Jones @ 2011-08-05  2:33 UTC (permalink / raw)
  To: Arnaud Lacombe; +Cc: Linux Kernel, tom.l.nguyen, yanmin.zhang

On Thu, Aug 04, 2011 at 10:24:07PM -0400, Arnaud Lacombe wrote:
 > Hi,
 > 
 > On Wed, Aug 3, 2011 at 6:34 PM, Dave Jones <davej@redhat.com> wrote:
 > > I have a machine that has developed some kind of problem with
 > > its onboard ethernet.  It still boots, but spewed almost 1.5G of text
 > > (2381585 instances of the warning below) before we realised what
 > > was going on, and blacklisted the igb driver.
 > >
 > > Is it worth logging every single error when we're flooding like this ?
 > > It seems unlikely that we'll find useful information in amongst that much data
 > > that wasn't already in the first 100 instances.
 > >
 > > I picked 100 in the (untested) example patch below arbitarily, but the exact
 > > value could be smaller, or slightly bigger..
 > >
 > > could we do something like this maybe ?
 > >
 > Please do not reinvent the wheel and use printk_ratelimited().

It's a different wheel.

printk_ratelimit slows down the output, but would still cause a lot of messages.

my diff turns it off completely after a threshold (apart from at the ulong wrap,
which I overlooked).

	Dave


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-08-05  2:33 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-03 22:34 [rfc] suppress excessive AER output Dave Jones
2011-08-04  5:54 ` Zhang, Yanmin
2011-08-05  1:50   ` Dave Jones
2011-08-05  1:59     ` Zhang, Yanmin
2011-08-05  2:09       ` Dave Jones
2011-08-04  6:45 ` huang ying
2011-08-05  1:50   ` Dave Jones
2011-08-05  2:24 ` Arnaud Lacombe
2011-08-05  2:33   ` Dave Jones

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.