linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead
@ 2019-01-09 17:23 Waiman Long
  2019-01-09 17:23 ` [PATCH v2 1/4] /proc/stat: Extract irqs counting code into show_stat_irqs() Waiman Long
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Waiman Long @ 2019-01-09 17:23 UTC (permalink / raw)
  To: Andrew Morton, Alexey Dobriyan, Kees Cook, Thomas Gleixner
  Cc: linux-kernel, linux-fsdevel, Davidlohr Bueso, Miklos Szeredi,
	Daniel Colascione, Dave Chinner, Randy Dunlap, Matthew Wilcox,
	Waiman Long

v1: https://lkml.org/lkml/2019/1/7/899

As newer systems have more and more IRQs and CPUs available in their
system, the performance of reading /proc/stat frequently is getting
worse and worse.

It appears that the idea of caching the IRQ counts in the v1 patch to
reduce the frequency of doing percpu summation and use a sysctl parameter
to control it was not well received.

This v2 patch optimizes the way the IRQ counts are retrieved and computed
and getting rid of the sysctl parameter altogether to achieve a performance
gain that is close to the v1 patch. This is based on the idea that while
many IRQs can be supported by a system, only a handful of them are actually
being used in most cases. We can save a lot of time by focusing on those
active IRQs only and ignore the rests.

Patch 1 is the same as that in v1 while the other 3 patches are new.

Waiman Long (4):
  /proc/stat: Extract irqs counting code into show_stat_irqs()
  /proc/stat: Only do percpu sum of active IRQs
  genirq: Track the number of active IRQs
  /proc/stat: Call kstat_irqs_usr() only for active IRQs

 fs/proc/stat.c          | 123 ++++++++++++++++++++++++++++++++++++++++++++----
 include/linux/irqdesc.h |   1 +
 kernel/irq/internals.h  |   6 ++-
 kernel/irq/irqdesc.c    |   7 ++-
 4 files changed, 125 insertions(+), 12 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/4] /proc/stat: Extract irqs counting code into show_stat_irqs()
  2019-01-09 17:23 [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead Waiman Long
@ 2019-01-09 17:23 ` Waiman Long
  2019-01-09 17:23 ` [PATCH v2 2/4] /proc/stat: Only do percpu sum of active IRQs Waiman Long
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2019-01-09 17:23 UTC (permalink / raw)
  To: Andrew Morton, Alexey Dobriyan, Kees Cook, Thomas Gleixner
  Cc: linux-kernel, linux-fsdevel, Davidlohr Bueso, Miklos Szeredi,
	Daniel Colascione, Dave Chinner, Randy Dunlap, Matthew Wilcox,
	Waiman Long

The code that generates the "intr" line of /proc/stat is now moved
from show_stat() into a new function - show_stat_irqs(). There is no
functional change.

Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 fs/proc/stat.c | 39 +++++++++++++++++++++++++++++----------
 1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 535eda7..4b06f1b 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -79,12 +79,38 @@ static u64 get_iowait_time(int cpu)
 
 #endif
 
+static u64 compute_stat_irqs_sum(void)
+{
+	int i;
+	u64 sum = 0;
+
+	for_each_possible_cpu(i) {
+		sum += kstat_cpu_irqs_sum(i);
+		sum += arch_irq_stat_cpu(i);
+	}
+	sum += arch_irq_stat();
+	return sum;
+}
+
+/*
+ * Print out the "intr" line of /proc/stat.
+ */
+static void show_stat_irqs(struct seq_file *p)
+{
+	int i;
+
+	seq_put_decimal_ull(p, "intr ", compute_stat_irqs_sum());
+	for_each_irq_nr(i)
+		seq_put_decimal_ull(p, " ", kstat_irqs_usr(i));
+
+	seq_putc(p, '\n');
+}
+
 static int show_stat(struct seq_file *p, void *v)
 {
 	int i, j;
 	u64 user, nice, system, idle, iowait, irq, softirq, steal;
 	u64 guest, guest_nice;
-	u64 sum = 0;
 	u64 sum_softirq = 0;
 	unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
 	struct timespec64 boottime;
@@ -105,8 +131,6 @@ static int show_stat(struct seq_file *p, void *v)
 		steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
 		guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
 		guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
-		sum += kstat_cpu_irqs_sum(i);
-		sum += arch_irq_stat_cpu(i);
 
 		for (j = 0; j < NR_SOFTIRQS; j++) {
 			unsigned int softirq_stat = kstat_softirqs_cpu(j, i);
@@ -115,7 +139,6 @@ static int show_stat(struct seq_file *p, void *v)
 			sum_softirq += softirq_stat;
 		}
 	}
-	sum += arch_irq_stat();
 
 	seq_put_decimal_ull(p, "cpu  ", nsec_to_clock_t(user));
 	seq_put_decimal_ull(p, " ", nsec_to_clock_t(nice));
@@ -154,14 +177,10 @@ static int show_stat(struct seq_file *p, void *v)
 		seq_put_decimal_ull(p, " ", nsec_to_clock_t(guest_nice));
 		seq_putc(p, '\n');
 	}
-	seq_put_decimal_ull(p, "intr ", (unsigned long long)sum);
-
-	/* sum again ? it could be updated? */
-	for_each_irq_nr(j)
-		seq_put_decimal_ull(p, " ", kstat_irqs_usr(j));
+	show_stat_irqs(p);
 
 	seq_printf(p,
-		"\nctxt %llu\n"
+		"ctxt %llu\n"
 		"btime %llu\n"
 		"processes %lu\n"
 		"procs_running %lu\n"
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 2/4] /proc/stat: Only do percpu sum of active IRQs
  2019-01-09 17:23 [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead Waiman Long
  2019-01-09 17:23 ` [PATCH v2 1/4] /proc/stat: Extract irqs counting code into show_stat_irqs() Waiman Long
@ 2019-01-09 17:23 ` Waiman Long
  2019-01-09 17:23 ` [PATCH v2 3/4] genirq: Track the number " Waiman Long
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2019-01-09 17:23 UTC (permalink / raw)
  To: Andrew Morton, Alexey Dobriyan, Kees Cook, Thomas Gleixner
  Cc: linux-kernel, linux-fsdevel, Davidlohr Bueso, Miklos Szeredi,
	Daniel Colascione, Dave Chinner, Randy Dunlap, Matthew Wilcox,
	Waiman Long

Recent computer systems may have hundreds or even thousands of IRQs
available. However, most of them may not be active and their IRQ counts
are zero. It is just a waste of CPU cycles to do percpu summation of
those zero counts.

In order to find out if an IRQ is active, we track the transition of the
percpu count from 0 to 1 and atomically increment a new kstat_irq_cpus
counter which counts the number of CPUs that handle this particular IRQ.

The IRQ descriptor is zalloc'ed, so there is no need to initialize the
new counter.

On a 4-socket Broadwell server wwith 112 vCPUs and 2952 IRQs (2877 of
them are 0), the system time needs to read /proc/stat 50k times was
reduced from 11.200s to 8.048s. That was a execution time reduction
of 28%.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/irqdesc.h | 1 +
 kernel/irq/internals.h  | 3 ++-
 kernel/irq/irqdesc.c    | 2 +-
 3 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index dd1e40d..86bbad2 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -61,6 +61,7 @@ struct irq_desc {
 	irq_preflow_handler_t	preflow_handler;
 #endif
 	struct irqaction	*action;	/* IRQ action list */
+	atomic_t		kstat_irq_cpus;	/* #cpus handling this IRQ */
 	unsigned int		status_use_accessors;
 	unsigned int		core_internal_state__do_not_mess_with_it;
 	unsigned int		depth;		/* nested irq disables */
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index ca6afa2..31787c1 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -244,7 +244,8 @@ static inline void irq_state_set_masked(struct irq_desc *desc)
 
 static inline void kstat_incr_irqs_this_cpu(struct irq_desc *desc)
 {
-	__this_cpu_inc(*desc->kstat_irqs);
+	if (unlikely(__this_cpu_inc_return(*desc->kstat_irqs) == 1))
+		atomic_inc(&desc->kstat_irq_cpus);
 	__this_cpu_inc(kstat.irqs_sum);
 }
 
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index ee062b7..3d2c38b 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -922,7 +922,7 @@ unsigned int kstat_irqs(unsigned int irq)
 	int cpu;
 	unsigned int sum = 0;
 
-	if (!desc || !desc->kstat_irqs)
+	if (!desc || !desc->kstat_irqs || !atomic_read(&desc->kstat_irq_cpus))
 		return 0;
 	for_each_possible_cpu(cpu)
 		sum += *per_cpu_ptr(desc->kstat_irqs, cpu);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 3/4] genirq: Track the number of active IRQs
  2019-01-09 17:23 [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead Waiman Long
  2019-01-09 17:23 ` [PATCH v2 1/4] /proc/stat: Extract irqs counting code into show_stat_irqs() Waiman Long
  2019-01-09 17:23 ` [PATCH v2 2/4] /proc/stat: Only do percpu sum of active IRQs Waiman Long
@ 2019-01-09 17:23 ` Waiman Long
  2019-01-09 17:23 ` [PATCH v2 4/4] /proc/stat: Call kstat_irqs_usr() only for " Waiman Long
  2019-01-09 17:44 ` [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead Matthew Wilcox
  4 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2019-01-09 17:23 UTC (permalink / raw)
  To: Andrew Morton, Alexey Dobriyan, Kees Cook, Thomas Gleixner
  Cc: linux-kernel, linux-fsdevel, Davidlohr Bueso, Miklos Szeredi,
	Daniel Colascione, Dave Chinner, Randy Dunlap, Matthew Wilcox,
	Waiman Long

A new atomic variable nr_active_irqs is added to track the number of
active IRQs that have one or more interrupts serviced.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/irq/internals.h | 7 +++++--
 kernel/irq/irqdesc.c   | 5 +++++
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 31787c1..239ae51 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -242,10 +242,13 @@ static inline void irq_state_set_masked(struct irq_desc *desc)
 
 #undef __irqd_to_state
 
+extern atomic_t nr_active_irqs;	/* # of active IRQs */
 static inline void kstat_incr_irqs_this_cpu(struct irq_desc *desc)
 {
-	if (unlikely(__this_cpu_inc_return(*desc->kstat_irqs) == 1))
-		atomic_inc(&desc->kstat_irq_cpus);
+	if (unlikely(__this_cpu_inc_return(*desc->kstat_irqs) == 1)) {
+		if (atomic_inc_return(&desc->kstat_irq_cpus) == 1)
+			atomic_inc(&nr_active_irqs);
+	}
 	__this_cpu_inc(kstat.irqs_sum);
 }
 
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 3d2c38b..7e00c4d 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -20,6 +20,11 @@
 #include "internals.h"
 
 /*
+ * Number of active IRQs
+ */
+atomic_t nr_active_irqs;
+
+/*
  * lockdep: we want to handle all irq_desc locks as a single lock-class:
  */
 static struct lock_class_key irq_desc_lock_class;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v2 4/4] /proc/stat: Call kstat_irqs_usr() only for active IRQs
  2019-01-09 17:23 [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead Waiman Long
                   ` (2 preceding siblings ...)
  2019-01-09 17:23 ` [PATCH v2 3/4] genirq: Track the number " Waiman Long
@ 2019-01-09 17:23 ` Waiman Long
  2019-01-09 17:39   ` Waiman Long
  2019-01-09 17:44 ` [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead Matthew Wilcox
  4 siblings, 1 reply; 15+ messages in thread
From: Waiman Long @ 2019-01-09 17:23 UTC (permalink / raw)
  To: Andrew Morton, Alexey Dobriyan, Kees Cook, Thomas Gleixner
  Cc: linux-kernel, linux-fsdevel, Davidlohr Bueso, Miklos Szeredi,
	Daniel Colascione, Dave Chinner, Randy Dunlap, Matthew Wilcox,
	Waiman Long

After skipping the percpu summation of non-active IRQs on a 4-socket
Broadwell system with about 3k IRQs, about half of the CPU cycles were
spent in the kstat_irqs() call. The majority of which were used to look
up the IRQ descriptors for the corresponding IRQ numbers.

We can recoup a lot of those lost cycles by calling kstat_irqs_usr()
only for those IRQs that are active. A bitmap is now used to keep track
of the list of the active IRQs. Changes in nr_active_irqs count will
cause the code to rescan all the IRQs and repopulate the bitmap.

On the same 4-socket server, the introduction of this patch further
reduces the system time of reading /proc/stat 5k times from 8.048s
to 5.817s. This is a another time reduction of 28%.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 fs/proc/stat.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 84 insertions(+)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 4b06f1b..5e2a398 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -93,6 +93,25 @@ static u64 compute_stat_irqs_sum(void)
 }
 
 /*
+ * Write the given number of space separated '0' into the sequence file.
+ */
+static void write_zeros(struct seq_file *p, int cnt)
+{
+	/* String of 16 '0's */
+	static const char zeros[] = " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0";
+
+	while (cnt > 0) {
+		if (cnt >= 16) {
+			seq_write(p, zeros, 32);
+			cnt -= 16;
+		} else {
+			seq_write(p, zeros, 2 * cnt);
+			cnt = 0;
+		}
+	}
+}
+
+/*
  * Print out the "intr" line of /proc/stat.
  */
 static void show_stat_irqs(struct seq_file *p)
@@ -100,9 +119,74 @@ static void show_stat_irqs(struct seq_file *p)
 	int i;
 
 	seq_put_decimal_ull(p, "intr ", compute_stat_irqs_sum());
+
+	if (IS_ENABLED(CONFIG_SMP) && (nr_cpu_ids >= 10) && (nr_irqs >= 256)) {
+		/*
+		 * On systems with 10 or more CPUs and 256 or more IRQs,
+		 * we used a bitmap to keep track of the number of active
+		 * IRQs and call kstat_irqs_usr() only for those IRQs.
+		 * The bitmap will be refreshed whenever nr_active_irqs
+		 * changes.
+		 */
+		extern atomic_t nr_active_irqs;
+		static DEFINE_MUTEX(irqs_mutex);
+		static int last_irq = -1;
+		static int bitmap_size, active_irqs;
+		static unsigned long *bitmap;
+		int current_irqs = atomic_read(&nr_active_irqs);
+
+		mutex_lock(&irqs_mutex);
+		if (current_irqs != active_irqs) {
+			/*
+			 * Rescan all the IRQs for active ones.
+			 */
+			if (nr_irqs > bitmap_size) {
+				static unsigned long *new_bitmap;
+				static int new_size;
+
+				new_size = BITS_TO_LONGS(nr_irqs)*sizeof(long);
+				new_bitmap = (unsigned long *)krealloc(bitmap,
+						new_size, GFP_KERNEL);
+				if (!new_bitmap)
+					goto fallback;
+				bitmap = new_bitmap;
+				bitmap_size = new_size;
+			}
+			memset(bitmap, 0, bitmap_size/BITS_PER_BYTE);
+			last_irq = 0;
+			for_each_irq_nr(i) {
+				int cnt = kstat_irqs_usr(i);
+
+				if (cnt) {
+					bitmap_set(bitmap, 0, i);
+					last_irq = i;
+				}
+				seq_put_decimal_ull(p, " ", cnt);
+			}
+			active_irqs = current_irqs;
+			mutex_unlock(&irqs_mutex);
+			goto out;
+		}
+		/*
+		 * Retrieve counts from active IRQs only.
+		 */
+		for (i = 0; i <= last_irq; i++) {
+			int next = find_next_bit(bitmap, last_irq + 1, i);
+
+			if (next > i)
+				write_zeros(p, next - i);
+			i = next;
+			seq_put_decimal_ull(p, " ", kstat_irqs_usr(i));
+		}
+		mutex_unlock(&irqs_mutex);
+		write_zeros(p, nr_irqs - i);
+		goto out;
+	}
+fallback:
 	for_each_irq_nr(i)
 		seq_put_decimal_ull(p, " ", kstat_irqs_usr(i));
 
+out:
 	seq_putc(p, '\n');
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 4/4] /proc/stat: Call kstat_irqs_usr() only for active IRQs
  2019-01-09 17:23 ` [PATCH v2 4/4] /proc/stat: Call kstat_irqs_usr() only for " Waiman Long
@ 2019-01-09 17:39   ` Waiman Long
  0 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2019-01-09 17:39 UTC (permalink / raw)
  To: Andrew Morton, Alexey Dobriyan, Kees Cook, Thomas Gleixner
  Cc: linux-kernel, linux-fsdevel, Davidlohr Bueso, Miklos Szeredi,
	Daniel Colascione, Dave Chinner, Randy Dunlap, Matthew Wilcox

On 01/09/2019 12:23 PM, Waiman Long wrote:
> After skipping the percpu summation of non-active IRQs on a 4-socket
> Broadwell system with about 3k IRQs, about half of the CPU cycles were
> spent in the kstat_irqs() call. The majority of which were used to look
> up the IRQ descriptors for the corresponding IRQ numbers.
>
> We can recoup a lot of those lost cycles by calling kstat_irqs_usr()
> only for those IRQs that are active. A bitmap is now used to keep track
> of the list of the active IRQs. Changes in nr_active_irqs count will
> cause the code to rescan all the IRQs and repopulate the bitmap.
>
> On the same 4-socket server, the introduction of this patch further
> reduces the system time of reading /proc/stat 5k times from 8.048s
> to 5.817s. This is a another time reduction of 28%.
>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  fs/proc/stat.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 84 insertions(+)
>
> diff --git a/fs/proc/stat.c b/fs/proc/stat.c
> index 4b06f1b..5e2a398 100644
> --- a/fs/proc/stat.c
> +++ b/fs/proc/stat.c
> @@ -93,6 +93,25 @@ static u64 compute_stat_irqs_sum(void)
>  }
>  
>  /*
> + * Write the given number of space separated '0' into the sequence file.
> + */
> +static void write_zeros(struct seq_file *p, int cnt)
> +{
> +	/* String of 16 '0's */
> +	static const char zeros[] = " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0";
> +
> +	while (cnt > 0) {
> +		if (cnt >= 16) {
> +			seq_write(p, zeros, 32);
> +			cnt -= 16;
> +		} else {
> +			seq_write(p, zeros, 2 * cnt);
> +			cnt = 0;
> +		}
> +	}
> +}
> +
> +/*
>   * Print out the "intr" line of /proc/stat.
>   */
>  static void show_stat_irqs(struct seq_file *p)
> @@ -100,9 +119,74 @@ static void show_stat_irqs(struct seq_file *p)
>  	int i;
>  
>  	seq_put_decimal_ull(p, "intr ", compute_stat_irqs_sum());
> +
> +	if (IS_ENABLED(CONFIG_SMP) && (nr_cpu_ids >= 10) && (nr_irqs >= 256)) {
> +		/*
> +		 * On systems with 10 or more CPUs and 256 or more IRQs,
> +		 * we used a bitmap to keep track of the number of active
> +		 * IRQs and call kstat_irqs_usr() only for those IRQs.
> +		 * The bitmap will be refreshed whenever nr_active_irqs
> +		 * changes.
> +		 */
> +		extern atomic_t nr_active_irqs;
> +		static DEFINE_MUTEX(irqs_mutex);
> +		static int last_irq = -1;
> +		static int bitmap_size, active_irqs;
> +		static unsigned long *bitmap;
> +		int current_irqs = atomic_read(&nr_active_irqs);
> +
> +		mutex_lock(&irqs_mutex);
> +		if (current_irqs != active_irqs) {
> +			/*
> +			 * Rescan all the IRQs for active ones.
> +			 */
> +			if (nr_irqs > bitmap_size) {
> +				static unsigned long *new_bitmap;
> +				static int new_size;
> +
> +				new_size = BITS_TO_LONGS(nr_irqs)*sizeof(long);
> +				new_bitmap = (unsigned long *)krealloc(bitmap,
> +						new_size, GFP_KERNEL);
> +				if (!new_bitmap)
> +					goto fallback;
> +				bitmap = new_bitmap;
> +				bitmap_size = new_size;
> +			}
> +			memset(bitmap, 0, bitmap_size/BITS_PER_BYTE);
> +			last_irq = 0;

Sorry, last_irq should be initialized to -1 here.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead
  2019-01-09 17:23 [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead Waiman Long
                   ` (3 preceding siblings ...)
  2019-01-09 17:23 ` [PATCH v2 4/4] /proc/stat: Call kstat_irqs_usr() only for " Waiman Long
@ 2019-01-09 17:44 ` Matthew Wilcox
  2019-01-09 18:03   ` Waiman Long
  4 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2019-01-09 17:44 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Alexey Dobriyan, Kees Cook, Thomas Gleixner,
	linux-kernel, linux-fsdevel, Davidlohr Bueso, Miklos Szeredi,
	Daniel Colascione, Dave Chinner, Randy Dunlap

On Wed, Jan 09, 2019 at 12:23:44PM -0500, Waiman Long wrote:
> This v2 patch optimizes the way the IRQ counts are retrieved and computed
> and getting rid of the sysctl parameter altogether to achieve a performance
> gain that is close to the v1 patch. This is based on the idea that while
> many IRQs can be supported by a system, only a handful of them are actually
> being used in most cases. We can save a lot of time by focusing on those
> active IRQs only and ignore the rests.

So your reaction to being told "Make this the same as every other thing
we have to sum across all CPUs" is to make it _even more different_ and
special-cased?  I'm done.  NAK.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead
  2019-01-09 17:44 ` [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead Matthew Wilcox
@ 2019-01-09 18:03   ` Waiman Long
  2019-01-09 18:24     ` Matthew Wilcox
  0 siblings, 1 reply; 15+ messages in thread
From: Waiman Long @ 2019-01-09 18:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Alexey Dobriyan, Kees Cook, Thomas Gleixner,
	linux-kernel, linux-fsdevel, Davidlohr Bueso, Miklos Szeredi,
	Daniel Colascione, Dave Chinner, Randy Dunlap

On 01/09/2019 12:44 PM, Matthew Wilcox wrote:
> On Wed, Jan 09, 2019 at 12:23:44PM -0500, Waiman Long wrote:
>> This v2 patch optimizes the way the IRQ counts are retrieved and computed
>> and getting rid of the sysctl parameter altogether to achieve a performance
>> gain that is close to the v1 patch. This is based on the idea that while
>> many IRQs can be supported by a system, only a handful of them are actually
>> being used in most cases. We can save a lot of time by focusing on those
>> active IRQs only and ignore the rests.
> So your reaction to being told "Make this the same as every other thing
> we have to sum across all CPUs" is to make it _even more different_ and
> special-cased?  I'm done.  NAK.

The paragraph above may be a bit misleading. This v2 patch actually
touches very little on percpu accounting aspect of the IRQ counts. See
patches 2 and 3 for the relevant changes which is just a few line of new
codes. Please review the individual patches before Nak'ing.

I could theoretically generalize them into a new set of percpu counting
helpers, but the idea behind it is quite different from the use cases of
percpu counter. So it may not be a good idea of adding it to there.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead
  2019-01-09 18:03   ` Waiman Long
@ 2019-01-09 18:24     ` Matthew Wilcox
  2019-01-09 18:37       ` Waiman Long
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2019-01-09 18:24 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Alexey Dobriyan, Kees Cook, Thomas Gleixner,
	linux-kernel, linux-fsdevel, Davidlohr Bueso, Miklos Szeredi,
	Daniel Colascione, Dave Chinner, Randy Dunlap

On Wed, Jan 09, 2019 at 01:03:33PM -0500, Waiman Long wrote:
> The paragraph above may be a bit misleading. This v2 patch actually
> touches very little on percpu accounting aspect of the IRQ counts. See
> patches 2 and 3 for the relevant changes which is just a few line of new
> codes. Please review the individual patches before Nak'ing.
> 
> I could theoretically generalize them into a new set of percpu counting
> helpers, but the idea behind it is quite different from the use cases of
> percpu counter. So it may not be a good idea of adding it to there.

Did you even try just using the general purpose infrastructure that's
in place?  If that shows a performance problem _then_ it's time to make
this special snowflake just a little more special.  Not before.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead
  2019-01-09 18:24     ` Matthew Wilcox
@ 2019-01-09 18:37       ` Waiman Long
  2019-01-09 18:52         ` Matthew Wilcox
  2019-01-09 18:54         ` Waiman Long
  0 siblings, 2 replies; 15+ messages in thread
From: Waiman Long @ 2019-01-09 18:37 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Alexey Dobriyan, Kees Cook, Thomas Gleixner,
	linux-kernel, linux-fsdevel, Davidlohr Bueso, Miklos Szeredi,
	Daniel Colascione, Dave Chinner, Randy Dunlap

On 01/09/2019 01:24 PM, Matthew Wilcox wrote:
> On Wed, Jan 09, 2019 at 01:03:33PM -0500, Waiman Long wrote:
>> The paragraph above may be a bit misleading. This v2 patch actually
>> touches very little on percpu accounting aspect of the IRQ counts. See
>> patches 2 and 3 for the relevant changes which is just a few line of new
>> codes. Please review the individual patches before Nak'ing.
>>
>> I could theoretically generalize them into a new set of percpu counting
>> helpers, but the idea behind it is quite different from the use cases of
>> percpu counter. So it may not be a good idea of adding it to there.
> Did you even try just using the general purpose infrastructure that's
> in place?  If that shows a performance problem _then_ it's time to make
> this special snowflake just a little more special.  Not before.

I have looked into the percpu counter code. There are two aspects that I
don't like to introduce to the interrupt handler's code path for
updating the counts.

1) There is a raw spinlock in the percpu_counter structure that may need
to be acquired in the update path. This can be a performance drag
especially if lockdep is enabled.

2) The percpu_counter structure is 40 bytes in size on 64-bit systems
compared with just 8 bytes for the percpu count pointer and an
additional 4 bytes that I introduced in patch 2. With thousands of irq
descriptors, it can consume quite a lot more memory. Memory consumption
was a point that you brought up in one of your previous mails.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead
  2019-01-09 18:37       ` Waiman Long
@ 2019-01-09 18:52         ` Matthew Wilcox
  2019-01-09 18:57           ` Waiman Long
  2019-01-09 18:54         ` Waiman Long
  1 sibling, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2019-01-09 18:52 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Alexey Dobriyan, Kees Cook, Thomas Gleixner,
	linux-kernel, linux-fsdevel, Davidlohr Bueso, Miklos Szeredi,
	Daniel Colascione, Dave Chinner, Randy Dunlap

On Wed, Jan 09, 2019 at 01:37:23PM -0500, Waiman Long wrote:
> On 01/09/2019 01:24 PM, Matthew Wilcox wrote:
> > Did you even try just using the general purpose infrastructure that's
> > in place?  If that shows a performance problem _then_ it's time to make
> > this special snowflake just a little more special.  Not before.
> 
> I have looked into the percpu counter code. There are two aspects that I
> don't like to introduce to the interrupt handler's code path for
> updating the counts.
> 
> 1) There is a raw spinlock in the percpu_counter structure that may need
> to be acquired in the update path. This can be a performance drag
> especially if lockdep is enabled.
> 
> 2) The percpu_counter structure is 40 bytes in size on 64-bit systems
> compared with just 8 bytes for the percpu count pointer and an
> additional 4 bytes that I introduced in patch 2. With thousands of irq
> descriptors, it can consume quite a lot more memory. Memory consumption
> was a point that you brought up in one of your previous mails.

Then _argue that_.  Don't just go off and do something random without
explaining to the rest of us why we're wrong.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead
  2019-01-09 18:37       ` Waiman Long
  2019-01-09 18:52         ` Matthew Wilcox
@ 2019-01-09 18:54         ` Waiman Long
  2019-01-09 19:59           ` Matthew Wilcox
  1 sibling, 1 reply; 15+ messages in thread
From: Waiman Long @ 2019-01-09 18:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Alexey Dobriyan, Kees Cook, Thomas Gleixner,
	linux-kernel, linux-fsdevel, Davidlohr Bueso, Miklos Szeredi,
	Daniel Colascione, Dave Chinner, Randy Dunlap

On 01/09/2019 01:37 PM, Waiman Long wrote:
> On 01/09/2019 01:24 PM, Matthew Wilcox wrote:
>> On Wed, Jan 09, 2019 at 01:03:33PM -0500, Waiman Long wrote:
>>> The paragraph above may be a bit misleading. This v2 patch actually
>>> touches very little on percpu accounting aspect of the IRQ counts. See
>>> patches 2 and 3 for the relevant changes which is just a few line of new
>>> codes. Please review the individual patches before Nak'ing.
>>>
>>> I could theoretically generalize them into a new set of percpu counting
>>> helpers, but the idea behind it is quite different from the use cases of
>>> percpu counter. So it may not be a good idea of adding it to there.
>> Did you even try just using the general purpose infrastructure that's
>> in place?  If that shows a performance problem _then_ it's time to make
>> this special snowflake just a little more special.  Not before.
> I have looked into the percpu counter code. There are two aspects that I
> don't like to introduce to the interrupt handler's code path for
> updating the counts.
>
> 1) There is a raw spinlock in the percpu_counter structure that may need
> to be acquired in the update path. This can be a performance drag
> especially if lockdep is enabled.
>
> 2) The percpu_counter structure is 40 bytes in size on 64-bit systems
> compared with just 8 bytes for the percpu count pointer and an
> additional 4 bytes that I introduced in patch 2. With thousands of irq
> descriptors, it can consume quite a lot more memory. Memory consumption
> was a point that you brought up in one of your previous mails.

If you read patch 4, you can see that quite a bit of CPU cycles was
spent looking up the radix tree to locate the IRQ descriptor for each of
the interrupts. Those overhead will still be there even if I use percpu
counters. So using percpu counter alone won't be as performant as this
patch or my previous v1 patch.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead
  2019-01-09 18:52         ` Matthew Wilcox
@ 2019-01-09 18:57           ` Waiman Long
  0 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2019-01-09 18:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Alexey Dobriyan, Kees Cook, Thomas Gleixner,
	linux-kernel, linux-fsdevel, Davidlohr Bueso, Miklos Szeredi,
	Daniel Colascione, Dave Chinner, Randy Dunlap

On 01/09/2019 01:52 PM, Matthew Wilcox wrote:
> On Wed, Jan 09, 2019 at 01:37:23PM -0500, Waiman Long wrote:
>> On 01/09/2019 01:24 PM, Matthew Wilcox wrote:
>>> Did you even try just using the general purpose infrastructure that's
>>> in place?  If that shows a performance problem _then_ it's time to make
>>> this special snowflake just a little more special.  Not before.
>> I have looked into the percpu counter code. There are two aspects that I
>> don't like to introduce to the interrupt handler's code path for
>> updating the counts.
>>
>> 1) There is a raw spinlock in the percpu_counter structure that may need
>> to be acquired in the update path. This can be a performance drag
>> especially if lockdep is enabled.
>>
>> 2) The percpu_counter structure is 40 bytes in size on 64-bit systems
>> compared with just 8 bytes for the percpu count pointer and an
>> additional 4 bytes that I introduced in patch 2. With thousands of irq
>> descriptors, it can consume quite a lot more memory. Memory consumption
>> was a point that you brought up in one of your previous mails.
> Then _argue that_.  Don't just go off and do something random without
> explaining to the rest of us why we're wrong.

Sorry about that. I should have included that in the cover-letter.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead
  2019-01-09 18:54         ` Waiman Long
@ 2019-01-09 19:59           ` Matthew Wilcox
  2019-01-09 20:14             ` Waiman Long
  0 siblings, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2019-01-09 19:59 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, Alexey Dobriyan, Kees Cook, Thomas Gleixner,
	linux-kernel, linux-fsdevel, Davidlohr Bueso, Miklos Szeredi,
	Daniel Colascione, Dave Chinner, Randy Dunlap

On Wed, Jan 09, 2019 at 01:54:36PM -0500, Waiman Long wrote:
> If you read patch 4, you can see that quite a bit of CPU cycles was
> spent looking up the radix tree to locate the IRQ descriptor for each of
> the interrupts. Those overhead will still be there even if I use percpu
> counters. So using percpu counter alone won't be as performant as this
> patch or my previous v1 patch.

Hm, if that's the overhead, then the radix tree (and the XArray) have
APIs that can reduce that overhead.  Right now, there's only one caller
of kstat_irqs_usr() (the proc code).  If we change that to fill an array
instead of returning a single value, it can look something like this:

void kstat_irqs_usr(unsigned int *sums)
{
	XA_STATE(xas, &irq_descs, 0);
	struct irq_desc *desc;

	xas_for_each(&xas, desc, ULONG_MAX) {
		unsigned int sum = 0;

		if (!desc->kstat_irqs)
			continue;
		for_each_possible_cpu(cpu)
			sum += *per_cpu_ptr(desc->kstat_irqs, cpu);

		sums[xas->xa_index] = sum;
	}
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead
  2019-01-09 19:59           ` Matthew Wilcox
@ 2019-01-09 20:14             ` Waiman Long
  0 siblings, 0 replies; 15+ messages in thread
From: Waiman Long @ 2019-01-09 20:14 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Alexey Dobriyan, Kees Cook, Thomas Gleixner,
	linux-kernel, linux-fsdevel, Davidlohr Bueso, Miklos Szeredi,
	Daniel Colascione, Dave Chinner, Randy Dunlap

On 01/09/2019 02:59 PM, Matthew Wilcox wrote:
> On Wed, Jan 09, 2019 at 01:54:36PM -0500, Waiman Long wrote:
>> If you read patch 4, you can see that quite a bit of CPU cycles was
>> spent looking up the radix tree to locate the IRQ descriptor for each of
>> the interrupts. Those overhead will still be there even if I use percpu
>> counters. So using percpu counter alone won't be as performant as this
>> patch or my previous v1 patch.
> Hm, if that's the overhead, then the radix tree (and the XArray) have
> APIs that can reduce that overhead.  Right now, there's only one caller
> of kstat_irqs_usr() (the proc code).  If we change that to fill an array
> instead of returning a single value, it can look something like this:
>
> void kstat_irqs_usr(unsigned int *sums)
> {
> 	XA_STATE(xas, &irq_descs, 0);
> 	struct irq_desc *desc;
>
> 	xas_for_each(&xas, desc, ULONG_MAX) {
> 		unsigned int sum = 0;
>
> 		if (!desc->kstat_irqs)
> 			continue;
> 		for_each_possible_cpu(cpu)
> 			sum += *per_cpu_ptr(desc->kstat_irqs, cpu);
>
> 		sums[xas->xa_index] = sum;
> 	}
> }

OK, I will try something like that as a replacement of patch 4 to see
how it compares with my current patch.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2019-01-09 20:14 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-09 17:23 [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead Waiman Long
2019-01-09 17:23 ` [PATCH v2 1/4] /proc/stat: Extract irqs counting code into show_stat_irqs() Waiman Long
2019-01-09 17:23 ` [PATCH v2 2/4] /proc/stat: Only do percpu sum of active IRQs Waiman Long
2019-01-09 17:23 ` [PATCH v2 3/4] genirq: Track the number " Waiman Long
2019-01-09 17:23 ` [PATCH v2 4/4] /proc/stat: Call kstat_irqs_usr() only for " Waiman Long
2019-01-09 17:39   ` Waiman Long
2019-01-09 17:44 ` [PATCH v2 0/4] /proc/stat: Reduce irqs counting performance overhead Matthew Wilcox
2019-01-09 18:03   ` Waiman Long
2019-01-09 18:24     ` Matthew Wilcox
2019-01-09 18:37       ` Waiman Long
2019-01-09 18:52         ` Matthew Wilcox
2019-01-09 18:57           ` Waiman Long
2019-01-09 18:54         ` Waiman Long
2019-01-09 19:59           ` Matthew Wilcox
2019-01-09 20:14             ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).