[PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead
@ 2019-01-07 15:12 Waiman Long
  2019-01-07 15:12 ` [PATCH 1/2] /proc/stat: Extract irqs counting code into show_stat_irqs() Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Waiman Long @ 2019-01-07 15:12 UTC (permalink / raw)
  To: Andrew Morton, Alexey Dobriyan, Luis Chamberlain, Kees Cook,
	Jonathan Corbet
  Cc: linux-kernel, linux-doc, linux-fsdevel, Davidlohr Bueso,
	Miklos Szeredi, Daniel Colascione, Dave Chinner, Randy Dunlap,
	Waiman Long

As newer systems have more and more IRQs and CPUs available in their
system, the performance of reading /proc/stat frequently is getting
worse and worse.

Last year, I had proposed patch to extract out the IRQ line of /proc/stat
into a new procfs file. However, this may break existing applications
that depend on the presence of the IRQ line in /proc/stat.

Davidlohr had proposed a new /proc/stat2 file which is almost the same
as /proc/stat except that it has no IRQ line. This will require change
in applications to use the new procfs file.

This patchset uses a different approach to reduce the performance impact
by using a sysctl parameter to control the maximum frequency at which the
IRQ count computation can be done. The result is that the IRQ counts from
/proc/stat may be a bit stale, but still provide the needed information.

Waiman Long (2):
  /proc/stat: Extract irqs counting code into show_stat_irqs()
  /proc/stat: Add sysctl parameter to control irq counts latency

 Documentation/sysctl/fs.txt | 16 +++++++++
 fs/proc/stat.c              | 87 +++++++++++++++++++++++++++++++++++++++------
 kernel/sysctl.c             | 12 +++++++
 3 files changed, 105 insertions(+), 10 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/2] /proc/stat: Extract irqs counting code into show_stat_irqs()
  2019-01-07 15:12 [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead Waiman Long
@ 2019-01-07 15:12 ` Waiman Long
  2019-01-07 21:42   ` Kees Cook
  2019-01-07 15:12 ` [PATCH 2/2] /proc/stat: Add sysctl parameter to control irq counts latency Waiman Long
  2019-01-07 22:32 ` [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead Dave Chinner
  2 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2019-01-07 15:12 UTC (permalink / raw)
  To: Andrew Morton, Alexey Dobriyan, Luis Chamberlain, Kees Cook,
	Jonathan Corbet
  Cc: linux-kernel, linux-doc, linux-fsdevel, Davidlohr Bueso,
	Miklos Szeredi, Daniel Colascione, Dave Chinner, Randy Dunlap,
	Waiman Long

The code that generates the "intr" line of /proc/stat is now moved
from show_stat() into a new function - show_stat_irqs(). There is no
functional change.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/proc/stat.c | 39 +++++++++++++++++++++++++++++----------
 1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 535eda7..4b06f1b 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -79,12 +79,38 @@ static u64 get_iowait_time(int cpu)
 
 #endif
 
+static u64 compute_stat_irqs_sum(void)
+{
+	int i;
+	u64 sum = 0;
+
+	for_each_possible_cpu(i) {
+		sum += kstat_cpu_irqs_sum(i);
+		sum += arch_irq_stat_cpu(i);
+	}
+	sum += arch_irq_stat();
+	return sum;
+}
+
+/*
+ * Print out the "intr" line of /proc/stat.
+ */
+static void show_stat_irqs(struct seq_file *p)
+{
+	int i;
+
+	seq_put_decimal_ull(p, "intr ", compute_stat_irqs_sum());
+	for_each_irq_nr(i)
+		seq_put_decimal_ull(p, " ", kstat_irqs_usr(i));
+
+	seq_putc(p, '\n');
+}
+
 static int show_stat(struct seq_file *p, void *v)
 {
 	int i, j;
 	u64 user, nice, system, idle, iowait, irq, softirq, steal;
 	u64 guest, guest_nice;
-	u64 sum = 0;
 	u64 sum_softirq = 0;
 	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
 	struct timespec64 boottime;
@@ -105,8 +131,6 @@ static int show_stat(struct seq_file *p, void *v)
 		steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
 		guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
 		guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
-		sum += kstat_cpu_irqs_sum(i);
-		sum += arch_irq_stat_cpu(i);
 
 		for (j = 0; j < NR_SOFTIRQS; j++) {
 			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
@@ -115,7 +139,6 @@ static int show_stat(struct seq_file *p, void *v)
 			sum_softirq += softirq_stat;
 		}
 	}
-	sum += arch_irq_stat();
 
 	seq_put_decimal_ull(p, "cpu  ", nsec_to_clock_t(user));
 	seq_put_decimal_ull(p, " ", nsec_to_clock_t(nice));
@@ -154,14 +177,10 @@ static int show_stat(struct seq_file *p, void *v)
 		seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest_nice));
 		seq_putc(p, '\n');
 	}
-	seq_put_decimal_ull(p, "intr ", (unsigned long long)sum);
-
-	/* sum again ? it could be updated? */
-	for_each_irq_nr(j)
-		seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
+	show_stat_irqs(p);
 
 	seq_printf(p,
-		"\nctxt %llu\n"
+		"ctxt %llu\n"
 		"btime %llu\n"
 		"processes %lu\n"
 		"procs_running %lu\n"
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/2] /proc/stat: Add sysctl parameter to control irq counts latency
  2019-01-07 15:12 [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead Waiman Long
  2019-01-07 15:12 ` [PATCH 1/2] /proc/stat: Extract irqs counting code into show_stat_irqs() Waiman Long
@ 2019-01-07 15:12 ` Waiman Long
  2019-01-07 15:58   ` Matthew Wilcox
       [not found]   ` <20190118084456.GA10690@shao2-debian>
  2019-01-07 22:32 ` [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead Dave Chinner
  2 siblings, 2 replies; 22+ messages in thread
From: Waiman Long @ 2019-01-07 15:12 UTC (permalink / raw)
  To: Andrew Morton, Alexey Dobriyan, Luis Chamberlain, Kees Cook,
	Jonathan Corbet
  Cc: linux-kernel, linux-doc, linux-fsdevel, Davidlohr Bueso,
	Miklos Szeredi, Daniel Colascione, Dave Chinner, Randy Dunlap,
	Waiman Long

Reading /proc/stat can be slow especially if there are many irqs and on
systems with many CPUs as summation of all the percpu counts for each
of the irqs is required. On some newer systems, there can be more than
1000 irqs per socket.

Applications that need to read /proc/stat many times per seconds will
easily hit a bottleneck. In reality, the irq counts are seldom looked
at. Even those applications that read them don't really need up-to-date
information. One way to reduce the performance impact of irq counts
computation is to do it less frequently.

A new "fs/proc-stat-irqs-latency-ms" sysctl parameter is now added to
control the maximum latency in milliseconds allowed between the time
when the computation was done and when the values are reported. Setting
this parameter to an appropriate value will allow us to reduce the
performance impact of reading /proc/stat repetitively. If /proc/stat
is read once in a while, the irq counts will be accurate. Reading
/proc/stat repetitively, however, may make the counts somewhat stale.

On a 4-socket 96-core Broadwell system (HT off) with 2824 irqs,
the times for reading /proc/stat 10,000 times with various values of
proc-stat-irqs-latency-ms were:

  proc-stat-irqs-latency-ms   elapsed time   sys time
  -------------------------   ------------   --------
              0                 11.041s       9.452s
              1                 12.983s      10.314s
             10                  8.452s       5.466s
            100                  8.003s       4.882s
           1000                  8.000s       4.740s

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/sysctl/fs.txt | 16 +++++++++++++++
 fs/proc/stat.c              | 48 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/sysctl.c             | 12 ++++++++++++
 3 files changed, 76 insertions(+)

diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt
index 819caf8..603d1b5 100644
--- a/Documentation/sysctl/fs.txt
+++ b/Documentation/sysctl/fs.txt
@@ -34,6 +34,7 @@ Currently, these files are in /proc/sys/fs:
 - overflowgid
 - pipe-user-pages-hard
 - pipe-user-pages-soft
+- proc-stat-irqs-latency-ms
 - protected_fifos
 - protected_hardlinks
 - protected_regular
@@ -184,6 +185,21 @@ applied.
 
 ==============================================================
 
+proc-stat-irqs-latency-ms:
+
+The maximum latency (in mseconds) between the time when the IRQ counts
+in the "intr" line of /proc/stat were computed and the time when they
+are reported.
+
+The default is 0 which means the counts are computed every time
+/proc/stat is read. As computing the IRQ counts can be the most time
+consuming part of accessing /proc/stat, setting a high enough value
+will shorten the time to read it in most cases.
+
+The actual maximum latency is rounded up to the next multiple of jiffies.
+
+==============================================================
+
 protected_fifos:
 
 The intent of this protection is to avoid unintentional writes to
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 4b06f1b..52f5845 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -13,6 +13,7 @@
 #include <linux/irqnr.h>
 #include <linux/sched/cputime.h>
 #include <linux/tick.h>
+#include <linux/jiffies.h>
 
 #ifndef arch_irq_stat_cpu
 #define arch_irq_stat_cpu(cpu) 0
@@ -21,6 +22,12 @@
 #define arch_irq_stat() 0
 #endif
 
+/*
+ * Maximum latency (in ms) of the irq values reported in the "intr" line.
+ * This is converted internally to multiple of jiffies.
+ */
+unsigned int proc_stat_irqs_latency_ms;
+
 #ifdef arch_idle_time
 
 static u64 get_idle_time(int cpu)
@@ -98,7 +105,48 @@ static u64 compute_stat_irqs_sum(void)
 static void show_stat_irqs(struct seq_file *p)
 {
 	int i;
+#ifdef CONFIG_PROC_SYSCTL
+	static char *irqs_buf;		   /* Buffer for irqs values */
+	static int buflen;
+	static unsigned long last_jiffies; /* Last buffer update jiffies */
+	static DEFINE_MUTEX(irqs_mutex);
+	unsigned int latency = proc_stat_irqs_latency_ms;
+
+	if (latency) {
+		char *ptr;
+
+		latency = _msecs_to_jiffies(latency);
+
+		mutex_lock(&irqs_mutex);
+		if (irqs_buf && time_before(jiffies, last_jiffies + latency))
+			goto print_out;
+
+		/*
+		 * Each irq value may require up to 11 bytes.
+		 */
+		if (!irqs_buf) {
+			irqs_buf = kmalloc(nr_irqs * 11 + 32,
+					   GFP_KERNEL | __GFP_ZERO);
+			if (!irqs_buf) {
+				mutex_unlock(&irqs_mutex);
+				goto fallback;
+			}
+		}
 
+		ptr = irqs_buf;
+		ptr += sprintf(ptr, "intr %llu", compute_stat_irqs_sum());
+		for_each_irq_nr(i)
+			ptr += sprintf(ptr, " %u", kstat_irqs_usr(i));
+		*ptr++ = '\n';
+		buflen = ptr - irqs_buf;
+		last_jiffies = jiffies;
+print_out:
+		seq_write(p, irqs_buf, buflen);
+		mutex_unlock(&irqs_mutex);
+		return;
+	}
+fallback:
+#endif
 	seq_put_decimal_ull(p, "intr ", compute_stat_irqs_sum());
 	for_each_irq_nr(i)
 		seq_put_decimal_ull(p, " ", kstat_irqs_usr(i));
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1825f71..07010c9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -114,6 +114,9 @@
 #ifndef CONFIG_MMU
 extern int sysctl_nr_trim_pages;
 #endif
+#ifdef CONFIG_PROC_FS
+extern unsigned int proc_stat_irqs_latency_ms;
+#endif
 
 /* Constants used for minimum and  maximum */
 #ifdef CONFIG_LOCKUP_DETECTOR
@@ -1890,6 +1893,15 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &one,
 	},
+#ifdef CONFIG_PROC_FS
+	{
+		.procname	= "proc-stat-irqs-latency-ms",
+		.data		= &proc_stat_irqs_latency_ms,
+		.maxlen		= sizeof(proc_stat_irqs_latency_ms),
+		.mode		= 0644,
+		.proc_handler	= proc_douintvec,
+	},
+#endif
 	{ }
 };
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] /proc/stat: Add sysctl parameter to control irq counts latency
  2019-01-07 15:12 ` [PATCH 2/2] /proc/stat: Add sysctl parameter to control irq counts latency Waiman Long
@ 2019-01-07 15:58   ` Matthew Wilcox
  2019-01-07 16:07     ` Waiman Long
  2019-01-07 16:33     ` Alexey Dobriyan
       [not found]   ` <20190118084456.GA10690@shao2-debian>
  1 sibling, 2 replies; 22+ messages in thread
From: Matthew Wilcox @ 2019-01-07 15:58 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Alexey Dobriyan, Luis Chamberlain, Kees Cook,
	Jonathan Corbet, linux-kernel, linux-doc, linux-fsdevel,
	Davidlohr Bueso, Miklos Szeredi, Daniel Colascione, Dave Chinner,
	Randy Dunlap

On Mon, Jan 07, 2019 at 10:12:58AM -0500, Waiman Long wrote:
> Reading /proc/stat can be slow especially if there are many irqs and on
> systems with many CPUs as summation of all the percpu counts for each
> of the irqs is required. On some newer systems, there can be more than
> 1000 irqs per socket.
> 
> Applications that need to read /proc/stat many times per seconds will
> easily hit a bottleneck. In reality, the irq counts are seldom looked
> at. Even those applications that read them don't really need up-to-date
> information. One way to reduce the performance impact of irq counts
> computation is to do it less frequently.
> 
> A new "fs/proc-stat-irqs-latency-ms" sysctl parameter is now added to

No.  No, no, no, no, no.  No.

Stop adding new sysctls for this kind of thing.  It's just a way to shift
blame from us (who allegedly know what we're doing) to the sysadmin
(who probably has better things to be doing than keeping up with the
intricacies of development of every single piece of software running on
their system).

Let's figure out what this _should_ be.  As a strawman, I propose we
update these stats once a second.  That's easy to document and understand.

> @@ -98,7 +105,48 @@ static u64 compute_stat_irqs_sum(void)
>  static void show_stat_irqs(struct seq_file *p)
>  {
>  	int i;
> +#ifdef CONFIG_PROC_SYSCTL
> +	static char *irqs_buf;		   /* Buffer for irqs values */
> +	static int buflen;
> +	static unsigned long last_jiffies; /* Last buffer update jiffies */
> +	static DEFINE_MUTEX(irqs_mutex);
> +	unsigned int latency = proc_stat_irqs_latency_ms;
> +
> +	if (latency) {
> +		char *ptr;
> +
> +		latency = _msecs_to_jiffies(latency);
> +
> +		mutex_lock(&irqs_mutex);
> +		if (irqs_buf && time_before(jiffies, last_jiffies + latency))
> +			goto print_out;
> +
> +		/*
> +		 * Each irq value may require up to 11 bytes.
> +		 */
> +		if (!irqs_buf) {
> +			irqs_buf = kmalloc(nr_irqs * 11 + 32,
> +					   GFP_KERNEL | __GFP_ZERO);

Why are you caching the _output_ of calling sprintf(), rather than caching the
values of each interrupt?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] /proc/stat: Add sysctl parameter to control irq counts latency
  2019-01-07 15:58   ` Matthew Wilcox
@ 2019-01-07 16:07     ` Waiman Long
  2019-01-07 16:14       ` Matthew Wilcox
  2019-01-07 16:33     ` Alexey Dobriyan
  1 sibling, 1 reply; 22+ messages in thread
From: Waiman Long @ 2019-01-07 16:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Alexey Dobriyan, Luis Chamberlain, Kees Cook,
	Jonathan Corbet, linux-kernel, linux-doc, linux-fsdevel,
	Davidlohr Bueso, Miklos Szeredi, Daniel Colascione, Dave Chinner,
	Randy Dunlap

On 01/07/2019 10:58 AM, Matthew Wilcox wrote:
> On Mon, Jan 07, 2019 at 10:12:58AM -0500, Waiman Long wrote:
>> Reading /proc/stat can be slow especially if there are many irqs and on
>> systems with many CPUs as summation of all the percpu counts for each
>> of the irqs is required. On some newer systems, there can be more than
>> 1000 irqs per socket.
>>
>> Applications that need to read /proc/stat many times per seconds will
>> easily hit a bottleneck. In reality, the irq counts are seldom looked
>> at. Even those applications that read them don't really need up-to-date
>> information. One way to reduce the performance impact of irq counts
>> computation is to do it less frequently.
>>
>> A new "fs/proc-stat-irqs-latency-ms" sysctl parameter is now added to
> No.  No, no, no, no, no.  No.
>
> Stop adding new sysctls for this kind of thing.  It's just a way to shift
> blame from us (who allegedly know what we're doing) to the sysadmin
> (who probably has better things to be doing than keeping up with the
> intricacies of development of every single piece of software running on
> their system).
>
> Let's figure out what this _should_ be.  As a strawman, I propose we
> update these stats once a second.  That's easy to document and understand.

I am fine with having a fixed value (like 1s) for reporting purpose. It
is just people may have many different opinions on what the right value
should be. That is why I opt for flexibility in the initial patch.

>
>> @@ -98,7 +105,48 @@ static u64 compute_stat_irqs_sum(void)
>>  static void show_stat_irqs(struct seq_file *p)
>>  {
>>  	int i;
>> +#ifdef CONFIG_PROC_SYSCTL
>> +	static char *irqs_buf;		   /* Buffer for irqs values */
>> +	static int buflen;
>> +	static unsigned long last_jiffies; /* Last buffer update jiffies */
>> +	static DEFINE_MUTEX(irqs_mutex);
>> +	unsigned int latency = proc_stat_irqs_latency_ms;
>> +
>> +	if (latency) {
>> +		char *ptr;
>> +
>> +		latency = _msecs_to_jiffies(latency);
>> +
>> +		mutex_lock(&irqs_mutex);
>> +		if (irqs_buf && time_before(jiffies, last_jiffies + latency))
>> +			goto print_out;
>> +
>> +		/*
>> +		 * Each irq value may require up to 11 bytes.
>> +		 */
>> +		if (!irqs_buf) {
>> +			irqs_buf = kmalloc(nr_irqs * 11 + 32,
>> +					   GFP_KERNEL | __GFP_ZERO);
> Why are you caching the _output_ of calling sprintf(), rather than caching the
> values of each interrupt?
>
It is just faster to dump the whole string buffer than redoing the
number formatting each time when the values don't change. I can cache
the individual sums instead if it is the preferred by most.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] /proc/stat: Add sysctl parameter to control irq counts latency
  2019-01-07 16:07     ` Waiman Long
@ 2019-01-07 16:14       ` Matthew Wilcox
  2019-01-07 16:19         ` Waiman Long
  0 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2019-01-07 16:14 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Alexey Dobriyan, Luis Chamberlain, Kees Cook,
	Jonathan Corbet, linux-kernel, linux-doc, linux-fsdevel,
	Davidlohr Bueso, Miklos Szeredi, Daniel Colascione, Dave Chinner,
	Randy Dunlap

On Mon, Jan 07, 2019 at 11:07:47AM -0500, Waiman Long wrote:
> > Why are you caching the _output_ of calling sprintf(), rather than caching the
> > values of each interrupt?
> >
> It is just faster to dump the whole string buffer than redoing the
> number formatting each time when the values don't change. I can cache
> the individual sums instead if it is the preferred by most.

But it also consumes more memory.  Can you gather some measurements to
find out what the performance difference is if you cache the values
instead of the strings?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] /proc/stat: Add sysctl parameter to control irq counts latency
  2019-01-07 16:14       ` Matthew Wilcox
@ 2019-01-07 16:19         ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2019-01-07 16:19 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Alexey Dobriyan, Luis Chamberlain, Kees Cook,
	Jonathan Corbet, linux-kernel, linux-doc, linux-fsdevel,
	Davidlohr Bueso, Miklos Szeredi, Daniel Colascione, Dave Chinner,
	Randy Dunlap

On 01/07/2019 11:14 AM, Matthew Wilcox wrote:
> On Mon, Jan 07, 2019 at 11:07:47AM -0500, Waiman Long wrote:
>>> Why are you caching the _output_ of calling sprintf(), rather than caching the
>>> values of each interrupt?
>>>
>> It is just faster to dump the whole string buffer than redoing the
>> number formatting each time when the values don't change. I can cache
>> the individual sums instead if it is the preferred by most.
> But it also consumes more memory.  Can you gather some measurements to
> find out what the performance difference is if you cache the values
> instead of the strings?

I allocate 11 bytes (10 digits + space) for each unsigned int IRQ value.
I needs 4 bytes for the unsigned value itself. So it is a saving of 7
bytes. With 4k irqs, it will be 28k.

I will run some measurements of caching the values versus saving the
string later this week.

Thanks,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] /proc/stat: Add sysctl parameter to control irq counts latency
  2019-01-07 15:58   ` Matthew Wilcox
  2019-01-07 16:07     ` Waiman Long
@ 2019-01-07 16:33     ` Alexey Dobriyan
  2019-01-07 16:59       ` Waiman Long
  1 sibling, 1 reply; 22+ messages in thread
From: Alexey Dobriyan @ 2019-01-07 16:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Waiman Long, Andrew Morton, Luis Chamberlain, Kees Cook,
	Jonathan Corbet, linux-kernel, linux-doc, linux-fsdevel,
	Davidlohr Bueso, Miklos Szeredi, Daniel Colascione, Dave Chinner,
	Randy Dunlap

On Mon, Jan 07, 2019 at 07:58:40AM -0800, Matthew Wilcox wrote:
> On Mon, Jan 07, 2019 at 10:12:58AM -0500, Waiman Long wrote:

> > A new "fs/proc-stat-irqs-latency-ms" sysctl parameter is now added to
> 
> No.  No, no, no, no, no.  No.
> 
> Stop adding new sysctls for this kind of thing.  It's just a way to shift
> blame from us (who allegedly know what we're doing) to the sysadmin
> (who probably has better things to be doing than keeping up with the
> intricacies of development of every single piece of software running on
> their system).
> 
> Let's figure out what this _should_ be.

Yeah, let's start interrogating about which values specifically this
super sekret applications wants.

I assume CPU stats, so system call which returns CPU statistics in binary form.

> Why are you caching the _output_ of calling sprintf(), rather than caching the
> values of each interrupt?

For output caching string is better, but I'm not defending the patch.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] /proc/stat: Add sysctl parameter to control irq counts latency
  2019-01-07 16:33     ` Alexey Dobriyan
@ 2019-01-07 16:59       ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2019-01-07 16:59 UTC (permalink / raw)
  To: Alexey Dobriyan, Matthew Wilcox
  Cc: Andrew Morton, Luis Chamberlain, Kees Cook, Jonathan Corbet,
	linux-kernel, linux-doc, linux-fsdevel, Davidlohr Bueso,
	Miklos Szeredi, Daniel Colascione, Dave Chinner, Randy Dunlap

On 01/07/2019 11:33 AM, Alexey Dobriyan wrote:
> On Mon, Jan 07, 2019 at 07:58:40AM -0800, Matthew Wilcox wrote:
>> On Mon, Jan 07, 2019 at 10:12:58AM -0500, Waiman Long wrote:
>>> A new "fs/proc-stat-irqs-latency-ms" sysctl parameter is now added to
>> No.  No, no, no, no, no.  No.
>>
>> Stop adding new sysctls for this kind of thing.  It's just a way to shift
>> blame from us (who allegedly know what we're doing) to the sysadmin
>> (who probably has better things to be doing than keeping up with the
>> intricacies of development of every single piece of software running on
>> their system).
>>
>> Let's figure out what this _should_ be.
> Yeah, let's start interrogating about which values specifically this
> super sekret applications wants.
>
> I assume CPU stats, so system call which returns CPU statistics in binary form.

The /proc/stat file contains a bunch of statistics for different system
parameters. I doubt the irq values are what most applications that read
/proc/stats are looking for. The problem some customers see was the fact
that they saw their application performance dropped quite significantly
when moving to a newer system with more CPUs and more irqs. The issue
here is their applications read /proc/stats at a relatively high
frequency. The irq counting overhead can weight quite significantly and
slow their applications down. Of course, they can modify their
applications to try not to read /proc/stats that often.

One instance where I was notified recently was the Oracle database that
used lseek() to move the current file pointer of /proc/stats causing the
system to recompute the stats every time the file pointer was moved. Of
course, that is the not the right way to read a procfs file and they are
probably going to change that. My point is customers are going to hit
problem like that when they move to larger and newer systems.

Thanks,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] /proc/stat: Extract irqs counting code into show_stat_irqs()
  2019-01-07 15:12 ` [PATCH 1/2] /proc/stat: Extract irqs counting code into show_stat_irqs() Waiman Long
@ 2019-01-07 21:42   ` Kees Cook
  0 siblings, 0 replies; 22+ messages in thread
From: Kees Cook @ 2019-01-07 21:42 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Alexey Dobriyan, Luis Chamberlain,
	Jonathan Corbet, LKML, open list:DOCUMENTATION, linux-fsdevel,
	Davidlohr Bueso, Miklos Szeredi, Daniel Colascione, Dave Chinner,
	Randy Dunlap

On Mon, Jan 7, 2019 at 7:13 AM Waiman Long <longman@redhat.com> wrote:
>
> The code that generates the "intr" line of /proc/stat is now moved
> from show_stat() into a new function - show_stat_irqs(). There is no
> functional change.
>
> Signed-off-by: Waiman Long <longman@redhat.com>

Reviewed-by: Kees Cook <keescook@chromium.org>

-Kees

> ---
>  fs/proc/stat.c | 39 +++++++++++++++++++++++++++++----------
>  1 file changed, 29 insertions(+), 10 deletions(-)
>
> diff --git a/fs/proc/stat.c b/fs/proc/stat.c
> index 535eda7..4b06f1b 100644
> --- a/fs/proc/stat.c
> +++ b/fs/proc/stat.c
> @@ -79,12 +79,38 @@ static u64 get_iowait_time(int cpu)
>
>  #endif
>
> +static u64 compute_stat_irqs_sum(void)
> +{
> +       int i;
> +       u64 sum = 0;
> +
> +       for_each_possible_cpu(i) {
> +               sum += kstat_cpu_irqs_sum(i);
> +               sum += arch_irq_stat_cpu(i);
> +       }
> +       sum += arch_irq_stat();
> +       return sum;
> +}
> +
> +/*
> + * Print out the "intr" line of /proc/stat.
> + */
> +static void show_stat_irqs(struct seq_file *p)
> +{
> +       int i;
> +
> +       seq_put_decimal_ull(p, "intr ", compute_stat_irqs_sum());
> +       for_each_irq_nr(i)
> +               seq_put_decimal_ull(p, " ", kstat_irqs_usr(i));
> +
> +       seq_putc(p, '\n');
> +}
> +
>  static int show_stat(struct seq_file *p, void *v)
>  {
>         int i, j;
>         u64 user, nice, system, idle, iowait, irq, softirq, steal;
>         u64 guest, guest_nice;
> -       u64 sum = 0;
>         u64 sum_softirq = 0;
>         unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
>         struct timespec64 boottime;
> @@ -105,8 +131,6 @@ static int show_stat(struct seq_file *p, void *v)
>                 steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
>                 guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
>                 guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
> -               sum += kstat_cpu_irqs_sum(i);
> -               sum += arch_irq_stat_cpu(i);
>
>                 for (j = 0; j < NR_SOFTIRQS; j++) {
>                         unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
> @@ -115,7 +139,6 @@ static int show_stat(struct seq_file *p, void *v)
>                         sum_softirq += softirq_stat;
>                 }
>         }
> -       sum += arch_irq_stat();
>
>         seq_put_decimal_ull(p, "cpu  ", nsec_to_clock_t(user));
>         seq_put_decimal_ull(p, " ", nsec_to_clock_t(nice));
> @@ -154,14 +177,10 @@ static int show_stat(struct seq_file *p, void *v)
>                 seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest_nice));
>                 seq_putc(p, '\n');
>         }
> -       seq_put_decimal_ull(p, "intr ", (unsigned long long)sum);
> -
> -       /* sum again ? it could be updated? */
> -       for_each_irq_nr(j)
> -               seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
> +       show_stat_irqs(p);
>
>         seq_printf(p,
> -               "\nctxt %llu\n"
> +               "ctxt %llu\n"
>                 "btime %llu\n"
>                 "processes %lu\n"
>                 "procs_running %lu\n"
> --
> 1.8.3.1
>


-- 
Kees Cook

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead
  2019-01-07 15:12 [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead Waiman Long
  2019-01-07 15:12 ` [PATCH 1/2] /proc/stat: Extract irqs counting code into show_stat_irqs() Waiman Long
  2019-01-07 15:12 ` [PATCH 2/2] /proc/stat: Add sysctl parameter to control irq counts latency Waiman Long
@ 2019-01-07 22:32 ` Dave Chinner
  2019-01-07 22:41   ` Daniel Colascione
  2019-01-07 22:41   ` Waiman Long
  2 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2019-01-07 22:32 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Alexey Dobriyan, Luis Chamberlain, Kees Cook,
	Jonathan Corbet, linux-kernel, linux-doc, linux-fsdevel,
	Davidlohr Bueso, Miklos Szeredi, Daniel Colascione, Randy Dunlap

On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
> As newer systems have more and more IRQs and CPUs available in their
> system, the performance of reading /proc/stat frequently is getting
> worse and worse.

Because the "roll-your-own" per-cpu counter implementaiton has been
optimised for low possible addition overhead on the premise that
summing the counters is rare and isn't a performance issue. This
patchset is a direct indication that this "summing is rare and can
be slow" premise is now invalid.

We have percpu counter infrastructure that trades off a small amount
of addition overhead for zero-cost reading of the counter value.
i.e. why not just convert this whole mess to percpu_counters and
then just use percpu_counter_read_positive()? Then we just don't
care how often userspace reads the /proc file because there is no
summing involved at all...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead
  2019-01-07 22:32 ` [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead Dave Chinner
@ 2019-01-07 22:41   ` Daniel Colascione
  2019-01-07 23:49     ` Alexey Dobriyan
  2019-01-07 22:41   ` Waiman Long
  1 sibling, 1 reply; 22+ messages in thread
From: Daniel Colascione @ 2019-01-07 22:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Waiman Long, Andrew Morton, Alexey Dobriyan, Luis Chamberlain,
	Kees Cook, Jonathan Corbet, linux-kernel,
	open list:DOCUMENTATION, Linux FS Devel, Davidlohr Bueso,
	Miklos Szeredi, Randy Dunlap

On Mon, Jan 7, 2019 at 5:32 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
> > As newer systems have more and more IRQs and CPUs available in their
> > system, the performance of reading /proc/stat frequently is getting
> > worse and worse.
>
> Because the "roll-your-own" per-cpu counter implementaiton has been
> optimised for low possible addition overhead on the premise that
> summing the counters is rare and isn't a performance issue. This
> patchset is a direct indication that this "summing is rare and can
> be slow" premise is now invalid.

Focusing on counter performance is, IMHO, missing the mark. Even if
interrupt count collection were made fast, there's *something* in any
particular /proc file that a particular reader doesn't need and that,
by being uselessly collected, needlessly slows that reader. There
should be a general-purpose way for /proc file readers to tell the
kernel which bits of information interest them on a particular read
syscall sequence or particular open(2) or something. Creating a new
proc file for every useful combination of attributes doesn't scale
either.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead
  2019-01-07 22:32 ` [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead Dave Chinner
  2019-01-07 22:41   ` Daniel Colascione
@ 2019-01-07 22:41   ` Waiman Long
  2019-01-08  2:04     ` Dave Chinner
  1 sibling, 1 reply; 22+ messages in thread
From: Waiman Long @ 2019-01-07 22:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Alexey Dobriyan, Luis Chamberlain, Kees Cook,
	Jonathan Corbet, linux-kernel, linux-doc, linux-fsdevel,
	Davidlohr Bueso, Miklos Szeredi, Daniel Colascione, Randy Dunlap

On 01/07/2019 05:32 PM, Dave Chinner wrote:
> On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
>> As newer systems have more and more IRQs and CPUs available in their
>> system, the performance of reading /proc/stat frequently is getting
>> worse and worse.
> Because the "roll-your-own" per-cpu counter implementaiton has been
> optimised for low possible addition overhead on the premise that
> summing the counters is rare and isn't a performance issue. This
> patchset is a direct indication that this "summing is rare and can
> be slow" premise is now invalid.
>
> We have percpu counter infrastructure that trades off a small amount
> of addition overhead for zero-cost reading of the counter value.
> i.e. why not just convert this whole mess to percpu_counters and
> then just use percpu_counter_read_positive()? Then we just don't
> care how often userspace reads the /proc file because there is no
> summing involved at all...
>
> Cheers,
>
> Dave.

Yes, percpu_counter_read_positive() is cheap. However, you still need to
pay the price somewhere. In the case of percpu_counter, the update is
more expensive.

I would say the percentage of applications that will hit this problem is
small. But for them, this problem has some significant performance overhead.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead
  2019-01-07 22:41   ` Daniel Colascione
@ 2019-01-07 23:49     ` Alexey Dobriyan
  0 siblings, 0 replies; 22+ messages in thread
From: Alexey Dobriyan @ 2019-01-07 23:49 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Dave Chinner, Waiman Long, Andrew Morton, Luis Chamberlain,
	Kees Cook, Jonathan Corbet, linux-kernel,
	open list:DOCUMENTATION, Linux FS Devel, Davidlohr Bueso,
	Miklos Szeredi, Randy Dunlap

On Mon, Jan 07, 2019 at 05:41:28PM -0500, Daniel Colascione wrote:
> There should be a general-purpose way for /proc file readers to tell
> the kernel which bits of information interest them on a particular read
> syscall sequence or particular open(2) or something.

This opens a pandora box full of ioctls. /proc supports ioctls on files
which aren't /proc/$PID and /proc/sys/* though.

So people made a mess with allegedly easy text files and now they
continue to make even more mess adding ioctls on top.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead
  2019-01-07 22:41   ` Waiman Long
@ 2019-01-08  2:04     ` Dave Chinner
  2019-01-08 16:11       ` Michal Hocko
  2019-01-08 16:58       ` Waiman Long
  0 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2019-01-08  2:04 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Alexey Dobriyan, Luis Chamberlain, Kees Cook,
	Jonathan Corbet, linux-kernel, linux-doc, linux-fsdevel,
	Davidlohr Bueso, Miklos Szeredi, Daniel Colascione, Randy Dunlap

On Mon, Jan 07, 2019 at 05:41:39PM -0500, Waiman Long wrote:
> On 01/07/2019 05:32 PM, Dave Chinner wrote:
> > On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
> >> As newer systems have more and more IRQs and CPUs available in their
> >> system, the performance of reading /proc/stat frequently is getting
> >> worse and worse.
> > Because the "roll-your-own" per-cpu counter implementaiton has been
> > optimised for low possible addition overhead on the premise that
> > summing the counters is rare and isn't a performance issue. This
> > patchset is a direct indication that this "summing is rare and can
> > be slow" premise is now invalid.
> >
> > We have percpu counter infrastructure that trades off a small amount
> > of addition overhead for zero-cost reading of the counter value.
> > i.e. why not just convert this whole mess to percpu_counters and
> > then just use percpu_counter_read_positive()? Then we just don't
> > care how often userspace reads the /proc file because there is no
> > summing involved at all...
> >
> > Cheers,
> >
> > Dave.
> 
> Yes, percpu_counter_read_positive() is cheap. However, you still need to
> pay the price somewhere. In the case of percpu_counter, the update is
> more expensive.

Ummm, that's exactly what I just said. It's a percpu counter that
solves the "sum is expensive and frequent" problem, just like you
are encountering here. I do not need basic scalability algorithms
explained to me.

> I would say the percentage of applications that will hit this problem is
> small. But for them, this problem has some significant performance overhead.

Well, duh!

What I was suggesting is that you change the per-cpu counter
implementation to the /generic infrastructure/ that solves this
problem, and then determine if the extra update overhead is at all
measurable. If you can't measure any difference in update overhead,
then slapping complexity on the existing counter to attempt to
mitigate the summing overhead is the wrong solution.

Indeed, it may be that you need o use a custom batch scaling curve
for the generic per-cpu coutner infrastructure to mitigate the
update overhead, but the fact is we already have generic
infrastructure that solves your problem and so the solution should
be "use the generic infrastructure" until it can be proven not to
work.

i.e. prove the generic infrastructure is not fit for purpose and
cannot be improved sufficiently to work for this use case before
implementing a complex, one-off snowflake counter implementation...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead
  2019-01-08  2:04     ` Dave Chinner
@ 2019-01-08 16:11       ` Michal Hocko
  2019-01-08 17:05         ` Waiman Long
  2019-01-08 16:58       ` Waiman Long
  1 sibling, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2019-01-08 16:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Waiman Long, Andrew Morton, Alexey Dobriyan, Luis Chamberlain,
	Kees Cook, Jonathan Corbet, linux-kernel, linux-doc,
	linux-fsdevel, Davidlohr Bueso, Miklos Szeredi,
	Daniel Colascione, Randy Dunlap

On Tue 08-01-19 13:04:22, Dave Chinner wrote:
> On Mon, Jan 07, 2019 at 05:41:39PM -0500, Waiman Long wrote:
> > On 01/07/2019 05:32 PM, Dave Chinner wrote:
> > > On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
> > >> As newer systems have more and more IRQs and CPUs available in their
> > >> system, the performance of reading /proc/stat frequently is getting
> > >> worse and worse.
> > > Because the "roll-your-own" per-cpu counter implementaiton has been
> > > optimised for low possible addition overhead on the premise that
> > > summing the counters is rare and isn't a performance issue. This
> > > patchset is a direct indication that this "summing is rare and can
> > > be slow" premise is now invalid.
> > >
> > > We have percpu counter infrastructure that trades off a small amount
> > > of addition overhead for zero-cost reading of the counter value.
> > > i.e. why not just convert this whole mess to percpu_counters and
> > > then just use percpu_counter_read_positive()? Then we just don't
> > > care how often userspace reads the /proc file because there is no
> > > summing involved at all...
> > >
> > > Cheers,
> > >
> > > Dave.
> > 
> > Yes, percpu_counter_read_positive() is cheap. However, you still need to
> > pay the price somewhere. In the case of percpu_counter, the update is
> > more expensive.
> 
> Ummm, that's exactly what I just said. It's a percpu counter that
> solves the "sum is expensive and frequent" problem, just like you
> are encountering here. I do not need basic scalability algorithms
> explained to me.
> 
> > I would say the percentage of applications that will hit this problem is
> > small. But for them, this problem has some significant performance overhead.
> 
> Well, duh!
> 
> What I was suggesting is that you change the per-cpu counter
> implementation to the /generic infrastructure/ that solves this
> problem, and then determine if the extra update overhead is at all
> measurable. If you can't measure any difference in update overhead,
> then slapping complexity on the existing counter to attempt to
> mitigate the summing overhead is the wrong solution.
> 
> Indeed, it may be that you need o use a custom batch scaling curve
> for the generic per-cpu coutner infrastructure to mitigate the
> update overhead, but the fact is we already have generic
> infrastructure that solves your problem and so the solution should
> be "use the generic infrastructure" until it can be proven not to
> work.
> 
> i.e. prove the generic infrastructure is not fit for purpose and
> cannot be improved sufficiently to work for this use case before
> implementing a complex, one-off snowflake counter implementation...

Completely agreed! Apart from that I find that conversion to a generic
infrastructure worth even if that doesn't solve the problem at hands
completely. If for no other reasons then the sheer code removal as kstat
is not really used for anything apart from this accounting AFAIR. The
less ad-hoc code we have the better IMHO.

And to the underlying problem. Some proc files do not scale on large
machines. Maybe it is time to explain that to application writers that
if they are collecting data too agressively then it won't scale. We can
only do this much. Lying about numbers by hiding updates is, well,
lying and won't solve the underlying problem. 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead
  2019-01-08  2:04     ` Dave Chinner
  2019-01-08 16:11       ` Michal Hocko
@ 2019-01-08 16:58       ` Waiman Long
  2019-01-08 22:27         ` Dave Chinner
  1 sibling, 1 reply; 22+ messages in thread
From: Waiman Long @ 2019-01-08 16:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Alexey Dobriyan, Luis Chamberlain, Kees Cook,
	Jonathan Corbet, linux-kernel, linux-doc, linux-fsdevel,
	Davidlohr Bueso, Miklos Szeredi, Daniel Colascione, Randy Dunlap

On 01/07/2019 09:04 PM, Dave Chinner wrote:
> On Mon, Jan 07, 2019 at 05:41:39PM -0500, Waiman Long wrote:
>> On 01/07/2019 05:32 PM, Dave Chinner wrote:
>>> On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
>>>> As newer systems have more and more IRQs and CPUs available in their
>>>> system, the performance of reading /proc/stat frequently is getting
>>>> worse and worse.
>>> Because the "roll-your-own" per-cpu counter implementaiton has been
>>> optimised for low possible addition overhead on the premise that
>>> summing the counters is rare and isn't a performance issue. This
>>> patchset is a direct indication that this "summing is rare and can
>>> be slow" premise is now invalid.
>>>
>>> We have percpu counter infrastructure that trades off a small amount
>>> of addition overhead for zero-cost reading of the counter value.
>>> i.e. why not just convert this whole mess to percpu_counters and
>>> then just use percpu_counter_read_positive()? Then we just don't
>>> care how often userspace reads the /proc file because there is no
>>> summing involved at all...
>>>
>>> Cheers,
>>>
>>> Dave.
>> Yes, percpu_counter_read_positive() is cheap. However, you still need to
>> pay the price somewhere. In the case of percpu_counter, the update is
>> more expensive.
> Ummm, that's exactly what I just said. It's a percpu counter that
> solves the "sum is expensive and frequent" problem, just like you
> are encountering here. I do not need basic scalability algorithms
> explained to me.

What I am trying to say is that the "sum is expensive and frequent" is
only true of a very small percentage of applications. It is not true for
most of them. I am hesitating to add latency to the interrupt path that
will affect all applications.
 
>> I would say the percentage of applications that will hit this problem is
>> small. But for them, this problem has some significant performance overhead.
> Well, duh!
>
> What I was suggesting is that you change the per-cpu counter
> implementation to the /generic infrastructure/ that solves this
> problem, and then determine if the extra update overhead is at all
> measurable. If you can't measure any difference in update overhead,
> then slapping complexity on the existing counter to attempt to
> mitigate the summing overhead is the wrong solution.
>
> Indeed, it may be that you need o use a custom batch scaling curve
> for the generic per-cpu coutner infrastructure to mitigate the
> update overhead, but the fact is we already have generic
> infrastructure that solves your problem and so the solution should
> be "use the generic infrastructure" until it can be proven not to
> work.
>
> i.e. prove the generic infrastructure is not fit for purpose and
> cannot be improved sufficiently to work for this use case before
> implementing a complex, one-off snowflake counter implementation...

I see your point. I like the deferred summation approach that I am
currently using. If I have to modify the current per-cpu counter
implementation to support that and I probably need to add counter
grouping support to amortize the overhead, that can be a major
undertaking. This is not a high priority item for me at the moment, so I
may have to wait until I have some spare time left.

Thanks,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead
  2019-01-08 16:11       ` Michal Hocko
@ 2019-01-08 17:05         ` Waiman Long
  2019-01-08 17:32           ` Waiman Long
  0 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2019-01-08 17:05 UTC (permalink / raw)
  To: Michal Hocko, Dave Chinner
  Cc: Andrew Morton, Alexey Dobriyan, Luis Chamberlain, Kees Cook,
	Jonathan Corbet, linux-kernel, linux-doc, linux-fsdevel,
	Davidlohr Bueso, Miklos Szeredi, Daniel Colascione, Randy Dunlap

On 01/08/2019 11:11 AM, Michal Hocko wrote:
> On Tue 08-01-19 13:04:22, Dave Chinner wrote:
>> On Mon, Jan 07, 2019 at 05:41:39PM -0500, Waiman Long wrote:
>>> On 01/07/2019 05:32 PM, Dave Chinner wrote:
>>>> On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
>>>>> As newer systems have more and more IRQs and CPUs available in their
>>>>> system, the performance of reading /proc/stat frequently is getting
>>>>> worse and worse.
>>>> Because the "roll-your-own" per-cpu counter implementaiton has been
>>>> optimised for low possible addition overhead on the premise that
>>>> summing the counters is rare and isn't a performance issue. This
>>>> patchset is a direct indication that this "summing is rare and can
>>>> be slow" premise is now invalid.
>>>>
>>>> We have percpu counter infrastructure that trades off a small amount
>>>> of addition overhead for zero-cost reading of the counter value.
>>>> i.e. why not just convert this whole mess to percpu_counters and
>>>> then just use percpu_counter_read_positive()? Then we just don't
>>>> care how often userspace reads the /proc file because there is no
>>>> summing involved at all...
>>>>
>>>> Cheers,
>>>>
>>>> Dave.
>>> Yes, percpu_counter_read_positive() is cheap. However, you still need to
>>> pay the price somewhere. In the case of percpu_counter, the update is
>>> more expensive.
>> Ummm, that's exactly what I just said. It's a percpu counter that
>> solves the "sum is expensive and frequent" problem, just like you
>> are encountering here. I do not need basic scalability algorithms
>> explained to me.
>>
>>> I would say the percentage of applications that will hit this problem is
>>> small. But for them, this problem has some significant performance overhead.
>> Well, duh!
>>
>> What I was suggesting is that you change the per-cpu counter
>> implementation to the /generic infrastructure/ that solves this
>> problem, and then determine if the extra update overhead is at all
>> measurable. If you can't measure any difference in update overhead,
>> then slapping complexity on the existing counter to attempt to
>> mitigate the summing overhead is the wrong solution.
>>
>> Indeed, it may be that you need o use a custom batch scaling curve
>> for the generic per-cpu coutner infrastructure to mitigate the
>> update overhead, but the fact is we already have generic
>> infrastructure that solves your problem and so the solution should
>> be "use the generic infrastructure" until it can be proven not to
>> work.
>>
>> i.e. prove the generic infrastructure is not fit for purpose and
>> cannot be improved sufficiently to work for this use case before
>> implementing a complex, one-off snowflake counter implementation...
> Completely agreed! Apart from that I find that conversion to a generic
> infrastructure worth even if that doesn't solve the problem at hands
> completely. If for no other reasons then the sheer code removal as kstat
> is not really used for anything apart from this accounting AFAIR. The
> less ad-hoc code we have the better IMHO.
>
> And to the underlying problem. Some proc files do not scale on large
> machines. Maybe it is time to explain that to application writers that
> if they are collecting data too agressively then it won't scale. We can
> only do this much. Lying about numbers by hiding updates is, well,
> lying and won't solve the underlying problem. 

I would not say it is lying. As I said in the changelog, reading
/proc/stat infrequently will give the right counts. Only when it is read
frequently that the data may not be up-to-date. Using
percpu_counter_sum_positive() as suggested by Dave will guarantee that
the counts will likely be off by a certain amount too. So it is also a
trade-off between accuracy and performance.

Cheers,
Longman

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead
  2019-01-08 17:05         ` Waiman Long
@ 2019-01-08 17:32           ` Waiman Long
  0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2019-01-08 17:32 UTC (permalink / raw)
  To: Michal Hocko, Dave Chinner
  Cc: Andrew Morton, Alexey Dobriyan, Luis Chamberlain, Kees Cook,
	Jonathan Corbet, linux-kernel, linux-doc, linux-fsdevel,
	Davidlohr Bueso, Miklos Szeredi, Daniel Colascione, Randy Dunlap

On 01/08/2019 12:05 PM, Waiman Long wrote:
> On 01/08/2019 11:11 AM, Michal Hocko wrote:
>> On Tue 08-01-19 13:04:22, Dave Chinner wrote:
>>> On Mon, Jan 07, 2019 at 05:41:39PM -0500, Waiman Long wrote:
>>>> On 01/07/2019 05:32 PM, Dave Chinner wrote:
>>>>> On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
>>>>>> As newer systems have more and more IRQs and CPUs available in their
>>>>>> system, the performance of reading /proc/stat frequently is getting
>>>>>> worse and worse.
>>>>> Because the "roll-your-own" per-cpu counter implementaiton has been
>>>>> optimised for low possible addition overhead on the premise that
>>>>> summing the counters is rare and isn't a performance issue. This
>>>>> patchset is a direct indication that this "summing is rare and can
>>>>> be slow" premise is now invalid.
>>>>>
>>>>> We have percpu counter infrastructure that trades off a small amount
>>>>> of addition overhead for zero-cost reading of the counter value.
>>>>> i.e. why not just convert this whole mess to percpu_counters and
>>>>> then just use percpu_counter_read_positive()? Then we just don't
>>>>> care how often userspace reads the /proc file because there is no
>>>>> summing involved at all...
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Dave.
>>>> Yes, percpu_counter_read_positive() is cheap. However, you still need to
>>>> pay the price somewhere. In the case of percpu_counter, the update is
>>>> more expensive.
>>> Ummm, that's exactly what I just said. It's a percpu counter that
>>> solves the "sum is expensive and frequent" problem, just like you
>>> are encountering here. I do not need basic scalability algorithms
>>> explained to me.
>>>
>>>> I would say the percentage of applications that will hit this problem is
>>>> small. But for them, this problem has some significant performance overhead.
>>> Well, duh!
>>>
>>> What I was suggesting is that you change the per-cpu counter
>>> implementation to the /generic infrastructure/ that solves this
>>> problem, and then determine if the extra update overhead is at all
>>> measurable. If you can't measure any difference in update overhead,
>>> then slapping complexity on the existing counter to attempt to
>>> mitigate the summing overhead is the wrong solution.
>>>
>>> Indeed, it may be that you need o use a custom batch scaling curve
>>> for the generic per-cpu coutner infrastructure to mitigate the
>>> update overhead, but the fact is we already have generic
>>> infrastructure that solves your problem and so the solution should
>>> be "use the generic infrastructure" until it can be proven not to
>>> work.
>>>
>>> i.e. prove the generic infrastructure is not fit for purpose and
>>> cannot be improved sufficiently to work for this use case before
>>> implementing a complex, one-off snowflake counter implementation...
>> Completely agreed! Apart from that I find that conversion to a generic
>> infrastructure worth even if that doesn't solve the problem at hands
>> completely. If for no other reasons then the sheer code removal as kstat
>> is not really used for anything apart from this accounting AFAIR. The
>> less ad-hoc code we have the better IMHO.

Another point that I want to make is that I don't see the kstat code
will ever going to be removed unless we scrap the whole /proc/stat file.
IRQ counts reporting is a performance problem simply because of the
large number of them (in thousands). The other percpu counts are
currently fine as they will only go up to hundreds at most.

Cheers,
Longman
>> And to the underlying problem. Some proc files do not scale on large
>> machines. Maybe it is time to explain that to application writers that
>> if they are collecting data too agressively then it won't scale. We can
>> only do this much. Lying about numbers by hiding updates is, well,
>> lying and won't solve the underlying problem. 
> I would not say it is lying. As I said in the changelog, reading
> /proc/stat infrequently will give the right counts. Only when it is read
> frequently that the data may not be up-to-date. Using
> percpu_counter_sum_positive() as suggested by Dave will guarantee that
> the counts will likely be off by a certain amount too. So it is also a
> trade-off between accuracy and performance.
>
> Cheers,
> Longman
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead
  2019-01-08 16:58       ` Waiman Long
@ 2019-01-08 22:27         ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2019-01-08 22:27 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Alexey Dobriyan, Luis Chamberlain, Kees Cook,
	Jonathan Corbet, linux-kernel, linux-doc, linux-fsdevel,
	Davidlohr Bueso, Miklos Szeredi, Daniel Colascione, Randy Dunlap

On Tue, Jan 08, 2019 at 11:58:26AM -0500, Waiman Long wrote:
> On 01/07/2019 09:04 PM, Dave Chinner wrote:
> > On Mon, Jan 07, 2019 at 05:41:39PM -0500, Waiman Long wrote:
> >> On 01/07/2019 05:32 PM, Dave Chinner wrote:
> >>> On Mon, Jan 07, 2019 at 10:12:56AM -0500, Waiman Long wrote:
> > What I was suggesting is that you change the per-cpu counter
> > implementation to the /generic infrastructure/ that solves this
> > problem, and then determine if the extra update overhead is at all
> > measurable. If you can't measure any difference in update overhead,
> > then slapping complexity on the existing counter to attempt to
> > mitigate the summing overhead is the wrong solution.
> >
> > Indeed, it may be that you need o use a custom batch scaling curve
> > for the generic per-cpu coutner infrastructure to mitigate the
> > update overhead, but the fact is we already have generic
> > infrastructure that solves your problem and so the solution should
> > be "use the generic infrastructure" until it can be proven not to
> > work.
> >
> > i.e. prove the generic infrastructure is not fit for purpose and
> > cannot be improved sufficiently to work for this use case before
> > implementing a complex, one-off snowflake counter implementation...
> 
> I see your point. I like the deferred summation approach that I am
> currently using. If I have to modify the current per-cpu counter
> implementation to support that

No! Stop that already. The "deferred counter summation" is exactly
the problem the current algorithm has and exactly the problem the
generic counters /don't have/. Changing the generic percpu counter
algorithm to match this specific hand rolled implementation is not
desirable as it will break implementations that rely on the
bound maximum summation deviation of the existing algorithm (e.g.
the ext4 and XFS ENOSPC accounting algorithms).

> and I probably need to add counter
> grouping support to amortize the overhead, that can be a major

The per-cpu counters already have configurable update batching
to amortise the summation update cost across multiple individual
per-cpu updates. You don't need to change the implementation at all,
just tweak the amortisation curve appropriately for the desired
requirements for update scalability (high) vs read accuracy (low).

Let's face it, on large systems where counters are frequently
updated, the resultant sum can be highly inaccurate by the time a
thousand CPU counters have been summed. The generic counters have a
predictable and bound "fast sum" maximum deviation (batch size *
nr_cpus), so over large machines are likely to be much more accurate
than "unlocked summing on demand" algorithms.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LKP] [/proc/stat] 3047027b34: reaim.jobs_per_min -4.8% regression
       [not found]   ` <20190118084456.GA10690@shao2-debian>
@ 2019-01-21 20:02     ` Kees Cook
  2019-01-21 21:25       ` Alexey Dobriyan
  0 siblings, 1 reply; 22+ messages in thread
From: Kees Cook @ 2019-01-21 20:02 UTC (permalink / raw)
  To: kernel test robot, Waiman Long
  Cc: Andrew Morton, Alexey Dobriyan, Luis Chamberlain,
	Jonathan Corbet, LKML, open list:DOCUMENTATION, linux-fsdevel,
	Davidlohr Bueso, Miklos Szeredi, Daniel Colascione, Dave Chinner,
	Randy Dunlap, LKP

On Fri, Jan 18, 2019 at 9:44 PM kernel test robot <rong.a.chen@intel.com> wrote:
>
> Greeting,
>
> FYI, we noticed a -4.8% regression of reaim.jobs_per_min due to commit:
>
>
> commit: 3047027b34b8c6404b509903058b89836093acc7 ("[PATCH 2/2] /proc/stat: Add sysctl parameter to control irq counts latency")
> url: https://github.com/0day-ci/linux/commits/Waiman-Long/proc-stat-Reduce-irqs-counting-performance-overhead/20190108-104818

Is this expected? (And it seems like other things in the report below
are faster? I don't understand why this particular regression was
called out?)

-Kees

>
>
> in testcase: reaim
> on test machine: 56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory
> with following parameters:
>
>         runtime: 300s
>         nr_task: 5000
>         test: shared_memory
>         cpufreq_governor: performance
>         ucode: 0x3d
>
> test-description: REAIM is an updated and improved version of AIM 7 benchmark.
> test-url: https://sourceforge.net/projects/re-aim-7/
>
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
>         git clone https://github.com/intel/lkp-tests.git
>         cd lkp-tests
>         bin/lkp install job.yaml  # job file is attached in this email
>         bin/lkp run     job.yaml
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/nr_task/rootfs/runtime/tbox_group/test/testcase/ucode:
>   gcc-7/performance/x86_64-rhel-7.2/5000/debian-x86_64-2018-04-03.cgz/300s/lkp-hsw-ep5/shared_memory/reaim/0x3d
>
> commit:
>   51e8bce392 ("/proc/stat: Extract irqs counting code into show_stat_irqs()")
>   3047027b34 ("/proc/stat: Add sysctl parameter to control irq counts latency")
>
> 51e8bce392dd2cc9 3047027b34b8c6404b50990305
> ---------------- --------------------------
>        fail:runs  %reproduction    fail:runs
>            |             |             |
>           1:4          -25%            :4     kmsg.igb#:#:#:exceed_max#second
>          %stddev     %change         %stddev
>              \          |                \
>     101.96            +7.5%     109.60        reaim.child_systime
>      33.32            -1.8%      32.73        reaim.child_utime
>    5534451            -4.8%    5271308        reaim.jobs_per_min
>       1106            -4.8%       1054        reaim.jobs_per_min_child
>    5800927            -4.9%    5517884        reaim.max_jobs_per_min
>       5.42            +5.0%       5.69        reaim.parent_time
>       1.51            +5.3%       1.59        reaim.std_dev_time
>   29374932            -2.8%   28558608        reaim.time.minor_page_faults
>       1681            +1.6%       1708        reaim.time.percent_of_cpu_this_job_got
>       3841            +4.5%       4012        reaim.time.system_time
>       1234            -4.4%       1180        reaim.time.user_time
>   18500000            -2.7%   18000000        reaim.workload
>    5495296 ±  9%      -9.5%    4970496        meminfo.DirectMap2M
>       5142 ± 18%     -43.2%       2920 ± 46%  numa-vmstat.node0.nr_shmem
>      29.00 ± 32%     +56.9%      45.50 ± 10%  vmstat.procs.r
>      67175 ± 37%     +66.6%     111910 ± 20%  numa-meminfo.node0.AnonHugePages
>      20591 ± 18%     -43.2%      11691 ± 46%  numa-meminfo.node0.Shmem
>      64688 ±  6%     -36.8%      40906 ± 19%  slabinfo.kmalloc-8.active_objs
>      64691 ±  6%     -36.8%      40908 ± 19%  slabinfo.kmalloc-8.num_objs
>      37.36 ±  7%     +11.1%      41.53 ±  4%  boot-time.boot
>      29.15 ±  6%     +14.3%      33.31 ±  3%  boot-time.dhcp
>     847.73 ±  9%     +12.9%     957.09 ±  4%  boot-time.idle
>     202.50 ±100%    +101.7%     408.50        proc-vmstat.nr_mlock
>       8018 ±  9%     -12.3%       7034 ±  2%  proc-vmstat.nr_shmem
>   29175944            -2.8%   28369676        proc-vmstat.numa_hit
>   29170351            -2.8%   28364111        proc-vmstat.numa_local
>       5439 ±  5%     -18.7%       4423 ±  7%  proc-vmstat.pgactivate
>   30220220            -2.8%   29374906        proc-vmstat.pgalloc_normal
>   30182224            -2.7%   29368266        proc-vmstat.pgfault
>   30186671            -2.8%   29341792        proc-vmstat.pgfree
>      69510 ± 12%     -34.2%      45759 ± 33%  sched_debug.cfs_rq:/.load.avg
>      30.21 ± 24%     -33.6%      20.05 ± 20%  sched_debug.cfs_rq:/.runnable_load_avg.avg
>      66447 ± 12%     -37.6%      41460 ± 37%  sched_debug.cfs_rq:/.runnable_weight.avg
>      12.35 ±  4%     +88.0%      23.22 ± 15%  sched_debug.cpu.clock.stddev
>      12.35 ±  4%     +88.0%      23.22 ± 15%  sched_debug.cpu.clock_task.stddev
>      30.06 ± 12%     -26.5%      22.10 ± 13%  sched_debug.cpu.cpu_load[0].avg
>      29.37 ±  9%     -22.6%      22.72 ± 13%  sched_debug.cpu.cpu_load[1].avg
>      28.71 ±  6%     -21.1%      22.66 ± 16%  sched_debug.cpu.cpu_load[2].avg
>      17985           -12.0%      15823 ±  2%  sched_debug.cpu.curr->pid.max
>      67478 ±  6%     -32.5%      45531 ± 24%  sched_debug.cpu.load.avg
>      10369 ± 49%    -100.0%       0.00        sched_debug.cpu.load.min
>       0.21 ± 34%    -100.0%       0.00        sched_debug.cpu.nr_running.min
>      12.98           -16.3%      10.86 ± 11%  perf-stat.i.MPKI
>  5.712e+09            -3.8%  5.492e+09        perf-stat.i.branch-instructions
>  1.024e+08            -3.7%   98557208        perf-stat.i.branch-misses
>       8.17            +0.4        8.58 ±  2%  perf-stat.i.cache-miss-rate%
>    7839589           +10.5%    8659798        perf-stat.i.cache-misses
>   86324420            +3.8%   89595898 ±  2%  perf-stat.i.cache-references
>       1.55 ±  2%      -4.0%       1.49        perf-stat.i.cpi
>       2290            -1.9%       2246        perf-stat.i.cpu-migrations
>       4667           -10.9%       4160        perf-stat.i.cycles-between-cache-misses
>  8.749e+09            -3.9%  8.409e+09        perf-stat.i.dTLB-loads
>     527660 ±  3%     -15.0%     448539 ±  5%  perf-stat.i.dTLB-store-misses
>  5.747e+09            -4.3%  5.499e+09        perf-stat.i.dTLB-stores
>   53047071            -3.5%   51190942        perf-stat.i.iTLB-load-misses
>   20576112           -13.7%   17759009 ±  2%  perf-stat.i.iTLB-loads
>  3.207e+10            -3.9%  3.083e+10        perf-stat.i.instructions
>       0.77            -2.3%       0.75        perf-stat.i.ipc
>      99933            -3.8%      96127        perf-stat.i.minor-faults
>    4325719            +5.6%    4568226        perf-stat.i.node-load-misses
>      52.39            -2.0       50.36        perf-stat.i.node-store-miss-rate%
>    1411700           +20.9%    1706321        perf-stat.i.node-store-misses
>     883790           +34.1%    1184836        perf-stat.i.node-stores
>      99933            -3.8%      96127        perf-stat.i.page-faults
>       2.69            +7.9%       2.91        perf-stat.overall.MPKI
>       9.08            +0.6        9.67 ±  2%  perf-stat.overall.cache-miss-rate%
>       1.13            +5.0%       1.19        perf-stat.overall.cpi
>       4633            -8.6%       4233        perf-stat.overall.cycles-between-cache-misses
>       0.01 ±  2%      -0.0        0.01 ±  4%  perf-stat.overall.dTLB-store-miss-rate%
>      72.05            +2.2       74.24        perf-stat.overall.iTLB-load-miss-rate%
>       0.88            -4.8%       0.84        perf-stat.overall.ipc
>      78.12            +1.4       79.52        perf-stat.overall.node-load-miss-rate%
>      61.49            -2.5       59.01        perf-stat.overall.node-store-miss-rate%
>  5.688e+09            -3.8%  5.471e+09        perf-stat.ps.branch-instructions
>   1.02e+08            -3.7%   98177745        perf-stat.ps.branch-misses
>    7807912           +10.5%    8626353        perf-stat.ps.cache-misses
>   85999504            +3.8%   89266688 ±  2%  perf-stat.ps.cache-references
>       2282            -1.9%       2239        perf-stat.ps.cpu-migrations
>  8.713e+09            -3.9%  8.376e+09        perf-stat.ps.dTLB-loads
>     525761 ±  3%     -15.0%     446967 ±  5%  perf-stat.ps.dTLB-store-misses
>  5.723e+09            -4.3%  5.478e+09        perf-stat.ps.dTLB-stores
>   52823322            -3.5%   50990190        perf-stat.ps.iTLB-load-misses
>   20490261           -13.7%   17689923 ±  2%  perf-stat.ps.iTLB-loads
>  3.193e+10            -3.8%  3.071e+10        perf-stat.ps.instructions
>      99560            -3.8%      95786        perf-stat.ps.minor-faults
>    4308031            +5.6%    4550453        perf-stat.ps.node-load-misses
>    1405805           +20.9%    1699670        perf-stat.ps.node-store-misses
>     880319           +34.1%    1180378        perf-stat.ps.node-stores
>      99560            -3.8%      95786        perf-stat.ps.page-faults
>  9.664e+12            -2.8%  9.397e+12        perf-stat.total.instructions
>      33.09 ±  7%     -25.0        8.10 ± 45%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
>      32.21 ±  7%     -24.3        7.92 ± 45%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      22.36 ±  7%     -17.1        5.24 ± 49%  perf-profile.calltrace.cycles-pp.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      10.57 ±  5%      -8.3        2.31 ± 48%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64
>       9.60 ±  4%      -7.6        2.03 ± 49%  perf-profile.calltrace.cycles-pp.syscall_return_via_sysret
>       3.68 ±  8%      -3.3        0.43 ±102%  perf-profile.calltrace.cycles-pp.security_ipc_permission.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       4.37 ±  9%      -2.8        1.60 ± 50%  perf-profile.calltrace.cycles-pp.ipc_obtain_object_check.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       2.30 ±  2%      -1.9        0.44 ±102%  perf-profile.calltrace.cycles-pp.security_sem_semop.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       5.10 ±  3%      -1.6        3.46 ± 10%  perf-profile.calltrace.cycles-pp.ipc_has_perm.security_ipc_permission.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       3.20            -1.3        1.95 ± 23%  perf-profile.calltrace.cycles-pp.avc_has_perm.ipc_has_perm.security_ipc_permission.do_semtimedop.do_syscall_64
>       2.12 ±  3%      -0.4        1.76 ±  4%  perf-profile.calltrace.cycles-pp.avc_has_perm.ipc_has_perm.security_sem_semop.do_semtimedop.do_syscall_64
>       1.03 ±  2%      -0.1        0.94 ±  3%  perf-profile.calltrace.cycles-pp.__do_page_fault.do_page_fault.page_fault
>       1.06 ±  2%      -0.1        0.97 ±  3%  perf-profile.calltrace.cycles-pp.do_page_fault.page_fault
>       1.06 ±  2%      -0.1        0.98 ±  3%  perf-profile.calltrace.cycles-pp.page_fault
>       0.91 ±  3%      -0.1        0.82 ±  3%  perf-profile.calltrace.cycles-pp.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
>       0.86 ±  3%      -0.1        0.78 ±  3%  perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
>       0.61 ± 12%      +0.2        0.79 ± 12%  perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.cpuidle_enter_state.do_idle.cpu_startup_entry
>       0.61 ± 10%      +0.2        0.79 ± 10%  perf-profile.calltrace.cycles-pp.shm_close.remove_vma.__do_munmap.ksys_shmdt.do_syscall_64
>       0.67 ± 11%      +0.2        0.86 ± 10%  perf-profile.calltrace.cycles-pp.apic_timer_interrupt.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary
>       0.40 ± 57%      +0.2        0.62 ±  8%  perf-profile.calltrace.cycles-pp.call_rwsem_down_write_failed.down_write.do_shmat.__x64_sys_shmat.do_syscall_64
>       0.40 ± 57%      +0.2        0.62 ±  7%  perf-profile.calltrace.cycles-pp.rwsem_down_write_failed.call_rwsem_down_write_failed.down_write.do_shmat.__x64_sys_shmat
>       0.43 ± 57%      +0.2        0.66 ±  7%  perf-profile.calltrace.cycles-pp.down_write.do_shmat.__x64_sys_shmat.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.41 ± 57%      +0.3        0.67 ±  9%  perf-profile.calltrace.cycles-pp.call_rwsem_down_write_failed.down_write.shm_close.remove_vma.__do_munmap
>       0.40 ± 57%      +0.3        0.66 ±  7%  perf-profile.calltrace.cycles-pp.down_write.ipcget.__x64_sys_shmget.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.41 ± 57%      +0.3        0.67 ±  9%  perf-profile.calltrace.cycles-pp.rwsem_down_write_failed.call_rwsem_down_write_failed.down_write.shm_close.remove_vma
>       0.44 ± 57%      +0.3        0.71 ±  9%  perf-profile.calltrace.cycles-pp.down_write.shm_close.remove_vma.__do_munmap.ksys_shmdt
>       0.26 ±100%      +0.4        0.67 ±  8%  perf-profile.calltrace.cycles-pp.dput.__fput.task_work_run.exit_to_usermode_loop.do_syscall_64
>       0.28 ±100%      +0.4        0.71 ±  7%  perf-profile.calltrace.cycles-pp.__might_fault._copy_from_user.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.58 ± 10%      +0.4        1.03 ±  6%  perf-profile.calltrace.cycles-pp.shmctl_down.ksys_shmctl.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
>       0.59 ± 10%      +0.5        1.05 ±  6%  perf-profile.calltrace.cycles-pp.ksys_shmctl.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
>       0.13 ±173%      +0.5        0.62 ±  7%  perf-profile.calltrace.cycles-pp.call_rwsem_down_write_failed.down_write.ipcget.__x64_sys_shmget.do_syscall_64
>       0.13 ±173%      +0.5        0.62 ±  7%  perf-profile.calltrace.cycles-pp.rwsem_down_write_failed.call_rwsem_down_write_failed.down_write.ipcget.__x64_sys_shmget
>       0.12 ±173%      +0.5        0.66 ±  8%  perf-profile.calltrace.cycles-pp.dentry_kill.dput.__fput.task_work_run.exit_to_usermode_loop
>       0.00            +0.6        0.58 ±  7%  perf-profile.calltrace.cycles-pp.copy_user_generic_unrolled._copy_from_user.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.00            +0.6        0.60 ±  7%  perf-profile.calltrace.cycles-pp.unmap_region.__do_munmap.ksys_shmdt.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.00            +0.6        0.61 ±  7%  perf-profile.calltrace.cycles-pp.wait_consider_task.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
>       0.00            +0.6        0.62 ±  4%  perf-profile.calltrace.cycles-pp.semctl_down.ksys_semctl.do_syscall_64.entry_SYSCALL_64_after_hwframe.semctl
>       0.75 ±  8%      +0.6        1.38 ±  8%  perf-profile.calltrace.cycles-pp.__x64_sys_semop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>       0.00            +0.6        0.65 ±  9%  perf-profile.calltrace.cycles-pp.ipcperms.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>       0.00            +0.6        0.65 ±  8%  perf-profile.calltrace.cycles-pp.__dentry_kill.dentry_kill.dput.__fput.task_work_run
>       0.89 ± 11%      +0.7        1.57 ±  8%  perf-profile.calltrace.cycles-pp.do_shmat.__x64_sys_shmat.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmat
>       0.90 ± 11%      +0.7        1.58 ±  8%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmat
>       0.90 ± 10%      +0.7        1.58 ±  8%  perf-profile.calltrace.cycles-pp.__x64_sys_shmat.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmat
>       0.00            +0.7        0.68 ±  5%  perf-profile.calltrace.cycles-pp.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.00            +0.7        0.69 ±  6%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.wait
>       0.00            +0.7        0.69 ±  6%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.wait
>       0.00            +0.7        0.69 ±  6%  perf-profile.calltrace.cycles-pp.__do_sys_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe.wait
>       0.00            +0.7        0.69 ±  6%  perf-profile.calltrace.cycles-pp.kernel_wait4.__do_sys_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe.wait
>       0.90 ± 11%      +0.7        1.59 ±  8%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.shmat
>       0.00            +0.7        0.69 ±  5%  perf-profile.calltrace.cycles-pp.wait
>       0.27 ±100%      +0.7        0.99 ±  7%  perf-profile.calltrace.cycles-pp.perform_atomic_semop.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>       0.95 ± 11%      +0.7        1.67 ±  8%  perf-profile.calltrace.cycles-pp.shmat
>       0.00            +0.7        0.73 ±  6%  perf-profile.calltrace.cycles-pp.ipcget.ksys_semget.do_syscall_64.entry_SYSCALL_64_after_hwframe.semget
>       0.00            +0.7        0.74 ±  8%  perf-profile.calltrace.cycles-pp.task_work_run.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
>       0.72 ±  9%      +0.7        1.46 ±  7%  perf-profile.calltrace.cycles-pp._raw_spin_lock.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>       0.00            +0.8        0.75 ±  5%  perf-profile.calltrace.cycles-pp.ksys_semget.do_syscall_64.entry_SYSCALL_64_after_hwframe.semget
>       0.00            +0.8        0.76 ±  8%  perf-profile.calltrace.cycles-pp.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
>       0.89 ±  9%      +0.8        1.65 ±  9%  perf-profile.calltrace.cycles-pp.__do_munmap.ksys_shmdt.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmdt
>       0.00            +0.8        0.76 ±  5%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.semget
>       0.00            +0.8        0.76 ±  5%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.semget
>       0.90 ±  9%      +0.8        1.68 ±  9%  perf-profile.calltrace.cycles-pp.ksys_shmdt.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmdt
>       0.00            +0.8        0.78 ±  6%  perf-profile.calltrace.cycles-pp.ksys_semctl.do_syscall_64.entry_SYSCALL_64_after_hwframe.semctl
>       0.00            +0.8        0.80 ±  5%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.semctl
>       0.00            +0.8        0.80 ±  5%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.semctl
>       0.00            +0.8        0.85 ±  5%  perf-profile.calltrace.cycles-pp.semget
>       0.97 ±  9%      +0.8        1.82 ±  6%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmctl
>       0.97 ±  9%      +0.8        1.82 ±  7%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.shmctl
>       0.94 ± 10%      +0.9        1.79 ±  6%  perf-profile.calltrace.cycles-pp._copy_from_user.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>       1.04 ±  9%      +0.9        1.91 ±  6%  perf-profile.calltrace.cycles-pp.ipcget.__x64_sys_shmget.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmget
>       1.05 ±  9%      +0.9        1.92 ±  5%  perf-profile.calltrace.cycles-pp.__x64_sys_shmget.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmget
>       1.05 ±  9%      +0.9        1.93 ±  6%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.shmget
>       1.05 ±  9%      +0.9        1.93 ±  6%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmget
>       1.02 ±  9%      +0.9        1.92 ±  7%  perf-profile.calltrace.cycles-pp.shmctl
>       1.00 ±  9%      +0.9        1.93 ±  9%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.shmdt
>       1.00 ±  8%      +0.9        1.94 ±  9%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.shmdt
>       1.13 ±  8%      +0.9        2.08 ±  6%  perf-profile.calltrace.cycles-pp.shmget
>       0.00            +1.0        0.97 ±  5%  perf-profile.calltrace.cycles-pp.semctl
>       1.03 ±  9%      +1.0        1.99 ±  5%  perf-profile.calltrace.cycles-pp.do_smart_update.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>       1.04 ±  8%      +1.0        2.03 ±  9%  perf-profile.calltrace.cycles-pp.shmdt
>       2.61 ±  9%      +1.3        3.89 ± 14%  perf-profile.calltrace.cycles-pp.security_ipc_permission.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>       1.65 ± 13%      +2.0        3.65 ±  7%  perf-profile.calltrace.cycles-pp.security_sem_semop.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>       1.61 ±  6%      +5.3        6.89 ±  2%  perf-profile.calltrace.cycles-pp.idr_find.ipc_obtain_object_check.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       6.45 ±  9%      +6.2       12.63 ±  6%  perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.semop
>       2.92 ±  5%      +8.0       10.88 ±  7%  perf-profile.calltrace.cycles-pp.ipc_obtain_object_check.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>       9.07 ±  8%      +8.8       17.85 ±  6%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.semop
>      15.56 ±  8%     +18.8       34.38 ±  6%  perf-profile.calltrace.cycles-pp.do_semtimedop.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>      17.40 ±  8%     +20.4       37.82 ±  6%  perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.semop
>      17.99 ±  8%     +21.0       39.00 ±  6%  perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.semop
>      33.85 ±  8%     +36.1       69.98 ±  6%  perf-profile.calltrace.cycles-pp.semop
>
>
>
>                       reaim.time.percent_of_cpu_this_job_got
>
>   1800 +-+------------------------------------------------------------------+
>        |         O                                                          |
>   1780 +-OO O O     O                                                       |
>   1760 O-+     O   O  OO                                                    |
>        |                                                                    |
>   1740 +-+                                                                  |
>        |                                                                    |
>   1720 +-+                  O                       .+.                     |
>        |                   O  O                    +   +                    |
>   1700 +-+               O                       .+    :                    |
>   1680 +-+      .+      .+.    .+ .+.+         .+       :         +.  .+.++.|
>        |.++.+.++  :  .++   ++.+  +    +.+.++.++         +.+.++.+.+  ++      |
>   1660 +-+        : +                                                       |
>        |           +                                                        |
>   1640 +-+------------------------------------------------------------------+
>
>
> [*] bisect-good sample
> [O] bisect-bad  sample
>
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
>
> Thanks,
> Rong Chen



-- 
Kees Cook

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [LKP] [/proc/stat] 3047027b34: reaim.jobs_per_min -4.8% regression
  2019-01-21 20:02     ` [LKP] [/proc/stat] 3047027b34: reaim.jobs_per_min -4.8% regression Kees Cook
@ 2019-01-21 21:25       ` Alexey Dobriyan
  0 siblings, 0 replies; 22+ messages in thread
From: Alexey Dobriyan @ 2019-01-21 21:25 UTC (permalink / raw)
  To: Kees Cook
  Cc: kernel test robot, Waiman Long, Andrew Morton, Luis Chamberlain,
	Jonathan Corbet, LKML, open list:DOCUMENTATION, linux-fsdevel,
	Davidlohr Bueso, Miklos Szeredi, Daniel Colascione, Dave Chinner,
	Randy Dunlap, LKP

On Tue, Jan 22, 2019 at 09:02:53AM +1300, Kees Cook wrote:
> On Fri, Jan 18, 2019 at 9:44 PM kernel test robot <rong.a.chen@intel.com> wrote:
> >
> > Greeting,
> >
> > FYI, we noticed a -4.8% regression of reaim.jobs_per_min due to commit:
> >
> >
> > commit: 3047027b34b8c6404b509903058b89836093acc7 ("[PATCH 2/2] /proc/stat: Add sysctl parameter to control irq counts latency")
> > url: https://github.com/0day-ci/linux/commits/Waiman-Long/proc-stat-Reduce-irqs-counting-performance-overhead/20190108-104818
> 
> Is this expected? (And it seems like other things in the report below
> are faster? I don't understand why this particular regression was
> called out?)

No, but the sysctl has been dropped, so the point is moot.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-01-21 21:25 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-07 15:12 [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead Waiman Long
2019-01-07 15:12 ` [PATCH 1/2] /proc/stat: Extract irqs counting code into show_stat_irqs() Waiman Long
2019-01-07 21:42   ` Kees Cook
2019-01-07 15:12 ` [PATCH 2/2] /proc/stat: Add sysctl parameter to control irq counts latency Waiman Long
2019-01-07 15:58   ` Matthew Wilcox
2019-01-07 16:07     ` Waiman Long
2019-01-07 16:14       ` Matthew Wilcox
2019-01-07 16:19         ` Waiman Long
2019-01-07 16:33     ` Alexey Dobriyan
2019-01-07 16:59       ` Waiman Long
     [not found]   ` <20190118084456.GA10690@shao2-debian>
2019-01-21 20:02     ` [LKP] [/proc/stat] 3047027b34: reaim.jobs_per_min -4.8% regression Kees Cook
2019-01-21 21:25       ` Alexey Dobriyan
2019-01-07 22:32 ` [PATCH 0/2] /proc/stat: Reduce irqs counting performance overhead Dave Chinner
2019-01-07 22:41   ` Daniel Colascione
2019-01-07 23:49     ` Alexey Dobriyan
2019-01-07 22:41   ` Waiman Long
2019-01-08  2:04     ` Dave Chinner
2019-01-08 16:11       ` Michal Hocko
2019-01-08 17:05         ` Waiman Long
2019-01-08 17:32           ` Waiman Long
2019-01-08 16:58       ` Waiman Long
2019-01-08 22:27         ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).