All of lore.kernel.org
 help / color / mirror / Atom feed
From: Frederic Weisbecker <fweisbec@gmail.com>
To: Christoph Lameter <cl@gentwo.org>
Cc: akpm@linux-foundation.org, Gilad Ben-Yossef <gilad@benyossef.com>,
	Thomas Gleixner <tglx@linutronix.de>, Tejun Heo <tj@kernel.org>,
	John Stultz <johnstul@us.ibm.com>,
	Mike Frysinger <vapier@gentoo.org>,
	Minchan Kim <minchan.kim@gmail.com>,
	Hakan Akkan <hakanakkan@gmail.com>,
	Max Krasnyansky <maxk@qualcomm.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hughd@google.com, viresh.kumar@linaro.org, hpa@zytor.com,
	mingo@kernel.org, peterz@infradead.org
Subject: Re: vmstat: On demand vmstat workers V8
Date: Fri, 11 Jul 2014 15:20:34 +0200	[thread overview]
Message-ID: <20140711132032.GB26045@localhost.localdomain> (raw)
In-Reply-To: <alpine.DEB.2.11.1407100903130.12483@gentwo.org>

On Thu, Jul 10, 2014 at 09:04:55AM -0500, Christoph Lameter wrote:
> 
> V7->V8
> - hackbench regression test shows a tiny performance increase due
>   to reduced OS processing.
> - Rediff against 3.16-rc4.
> 
> V6->V7
> - Remove /sysfs support.
> 
> V5->V6:
> - Shepherd thread as a general worker thread. This means
>   that the general mechanism to control worker thread
>   cpu use by Frederic Weisbecker is necessary to
>   restrict the shepherd thread to the cpus not used
>   for low latency tasks. Hopefully that is ready to be
>   merged soon. No need anymore to have a specific
>   cpu be the housekeeper cpu.
> 
> V4->V5:
> - Shepherd thread on a specific cpu (HOUSEKEEPING_CPU).
> - Incorporate Andrew's feedback
> - Work out the races.
> - Make visible which CPUs have stat updates switched off
>   in /sys/devices/system/cpu/stat_off
> 
> V3->V4:
> - Make the shepherd task not deferrable. It runs on the tick cpu
>   anyways. Deferral could get deltas too far out of sync if
>   vmstat operations are disabled for a certain processor.
> 
> V2->V3:
> - Introduce a new tick_get_housekeeping_cpu() function. Not sure
>   if that is exactly what we want but it is a start. Thomas?
> - Migrate the shepherd task if the output of
>   tick_get_housekeeping_cpu() changes.
> - Fixes recommended by Andrew.
> 
> V1->V2:
> - Optimize the need_update check by using memchr_inv.
> - Clean up.
> 
> vmstat workers are used for folding counter differentials into the
> zone, per node and global counters at certain time intervals.
> They currently run at defined intervals on all processors which will
> cause some holdoff for processors that need minimal intrusion by the
> OS.
> 
> The current vmstat_update mechanism depends on a deferrable timer
> firing every other second by default which registers a work queue item
> that runs on the local CPU, with the result that we have 1 interrupt
> and one additional schedulable task on each CPU every 2 seconds
> If a workload indeed causes VM activity or multiple tasks are running
> on a CPU, then there are probably bigger issues to deal with.
> 
> However, some workloads dedicate a CPU for a single CPU bound task.
> This is done in high performance computing, in high frequency
> financial applications, in networking (Intel DPDK, EZchip NPS) and with
> the advent of systems with more and more CPUs over time, this may become
> more and more common to do since when one has enough CPUs
> one cares less about efficiently sharing a CPU with other tasks and
> more about efficiently monopolizing a CPU per task.
> 
> The difference of having this timer firing and workqueue kernel thread
> scheduled per second can be enormous. An artificial test measuring the
> worst case time to do a simple "i++" in an endless loop on a bare metal
> system and under Linux on an isolated CPU with dynticks and with and
> without this patch, have Linux match the bare metal performance
> (~700 cycles) with this patch and loose by couple of orders of magnitude
> (~200k cycles) without it[*].  The loss occurs for something that just
> calculates statistics. For networking applications, for example, this
> could be the difference between dropping packets or sustaining line rate.
> 
> Statistics are important and useful, but it would be great if there
> would be a way to not cause statistics gathering produce a huge
> performance difference. This patche does just that.
> 
> This patch creates a vmstat shepherd worker that monitors the
> per cpu differentials on all processors. If there are differentials
> on a processor then a vmstat worker local to the processors
> with the differentials is created. That worker will then start
> folding the diffs in regular intervals. Should the worker
> find that there is no work to be done then it will make the shepherd
> worker monitor the differentials again.
> 
> With this patch it is possible then to have periods longer than
> 2 seconds without any OS event on a "cpu" (hardware thread).
> 
> The patch shows a very minor increased in system performance.
> 
> 
> hackbench -s 512 -l 2000 -g 15 -f 25 -P
> 
> Results before the patch:
> 
> Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
> Each sender will pass 2000 messages of 512 bytes
> Time: 4.992
> Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
> Each sender will pass 2000 messages of 512 bytes
> Time: 4.971
> Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
> Each sender will pass 2000 messages of 512 bytes
> Time: 5.063
> 
> Hackbench after the patch:
> 
> Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
> Each sender will pass 2000 messages of 512 bytes
> Time: 4.973
> Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
> Each sender will pass 2000 messages of 512 bytes
> Time: 4.990
> Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
> Each sender will pass 2000 messages of 512 bytes
> Time: 4.993
> 
> 
> 
> Reviewed-by: Gilad Ben-Yossef <gilad@benyossef.com>
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> 
> Index: linux/mm/vmstat.c
> ===================================================================
> --- linux.orig/mm/vmstat.c	2014-07-07 10:15:01.790099463 -0500
> +++ linux/mm/vmstat.c	2014-07-07 10:17:17.397891143 -0500
> @@ -7,6 +7,7 @@
>   *  zoned VM statistics
>   *  Copyright (C) 2006 Silicon Graphics, Inc.,
>   *		Christoph Lameter <christoph@lameter.com>
> + *  Copyright (C) 2008-2014 Christoph Lameter
>   */
>  #include <linux/fs.h>
>  #include <linux/mm.h>
> @@ -14,6 +15,7 @@
>  #include <linux/module.h>
>  #include <linux/slab.h>
>  #include <linux/cpu.h>
> +#include <linux/cpumask.h>
>  #include <linux/vmstat.h>
>  #include <linux/sched.h>
>  #include <linux/math64.h>
> @@ -419,13 +421,22 @@ void dec_zone_page_state(struct page *pa
>  EXPORT_SYMBOL(dec_zone_page_state);
>  #endif
> 
> -static inline void fold_diff(int *diff)
> +
> +/*
> + * Fold a differential into the global counters.
> + * Returns the number of counters updated.
> + */
> +static int fold_diff(int *diff)
>  {
>  	int i;
> +	int changes = 0;
> 
>  	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
> -		if (diff[i])
> +		if (diff[i]) {
>  			atomic_long_add(diff[i], &vm_stat[i]);
> +			changes++;
> +	}
> +	return changes;
>  }
> 
>  /*
> @@ -441,12 +452,15 @@ static inline void fold_diff(int *diff)
>   * statistics in the remote zone struct as well as the global cachelines
>   * with the global counters. These could cause remote node cache line
>   * bouncing and will have to be only done when necessary.
> + *
> + * The function returns the number of global counters updated.
>   */
> -static void refresh_cpu_vm_stats(void)
> +static int refresh_cpu_vm_stats(void)
>  {
>  	struct zone *zone;
>  	int i;
>  	int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
> +	int changes = 0;
> 
>  	for_each_populated_zone(zone) {
>  		struct per_cpu_pageset __percpu *p = zone->pageset;
> @@ -486,15 +500,17 @@ static void refresh_cpu_vm_stats(void)
>  			continue;
>  		}
> 
> -
>  		if (__this_cpu_dec_return(p->expire))
>  			continue;
> 
> -		if (__this_cpu_read(p->pcp.count))
> +		if (__this_cpu_read(p->pcp.count)) {
>  			drain_zone_pages(zone, this_cpu_ptr(&p->pcp));
> +			changes++;
> +		}
>  #endif
>  	}
> -	fold_diff(global_diff);
> +	changes += fold_diff(global_diff);
> +	return changes;
>  }
> 
>  /*
> @@ -1228,20 +1244,105 @@ static const struct file_operations proc
>  #ifdef CONFIG_SMP
>  static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
>  int sysctl_stat_interval __read_mostly = HZ;
> +struct cpumask *cpu_stat_off;

I thought you converted it?

WARNING: multiple messages have this Message-ID (diff)
From: Frederic Weisbecker <fweisbec@gmail.com>
To: Christoph Lameter <cl@gentwo.org>
Cc: akpm@linux-foundation.org, Gilad Ben-Yossef <gilad@benyossef.com>,
	Thomas Gleixner <tglx@linutronix.de>, Tejun Heo <tj@kernel.org>,
	John Stultz <johnstul@us.ibm.com>,
	Mike Frysinger <vapier@gentoo.org>,
	Minchan Kim <minchan.kim@gmail.com>,
	Hakan Akkan <hakanakkan@gmail.com>,
	Max Krasnyansky <maxk@qualcomm.com>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hughd@google.com, viresh.kumar@linaro.org, hpa@zytor.com,
	mingo@kernel.org, peterz@infradead.org
Subject: Re: vmstat: On demand vmstat workers V8
Date: Fri, 11 Jul 2014 15:20:34 +0200	[thread overview]
Message-ID: <20140711132032.GB26045@localhost.localdomain> (raw)
In-Reply-To: <alpine.DEB.2.11.1407100903130.12483@gentwo.org>

On Thu, Jul 10, 2014 at 09:04:55AM -0500, Christoph Lameter wrote:
> 
> V7->V8
> - hackbench regression test shows a tiny performance increase due
>   to reduced OS processing.
> - Rediff against 3.16-rc4.
> 
> V6->V7
> - Remove /sysfs support.
> 
> V5->V6:
> - Shepherd thread as a general worker thread. This means
>   that the general mechanism to control worker thread
>   cpu use by Frederic Weisbecker is necessary to
>   restrict the shepherd thread to the cpus not used
>   for low latency tasks. Hopefully that is ready to be
>   merged soon. No need anymore to have a specific
>   cpu be the housekeeper cpu.
> 
> V4->V5:
> - Shepherd thread on a specific cpu (HOUSEKEEPING_CPU).
> - Incorporate Andrew's feedback
> - Work out the races.
> - Make visible which CPUs have stat updates switched off
>   in /sys/devices/system/cpu/stat_off
> 
> V3->V4:
> - Make the shepherd task not deferrable. It runs on the tick cpu
>   anyways. Deferral could get deltas too far out of sync if
>   vmstat operations are disabled for a certain processor.
> 
> V2->V3:
> - Introduce a new tick_get_housekeeping_cpu() function. Not sure
>   if that is exactly what we want but it is a start. Thomas?
> - Migrate the shepherd task if the output of
>   tick_get_housekeeping_cpu() changes.
> - Fixes recommended by Andrew.
> 
> V1->V2:
> - Optimize the need_update check by using memchr_inv.
> - Clean up.
> 
> vmstat workers are used for folding counter differentials into the
> zone, per node and global counters at certain time intervals.
> They currently run at defined intervals on all processors which will
> cause some holdoff for processors that need minimal intrusion by the
> OS.
> 
> The current vmstat_update mechanism depends on a deferrable timer
> firing every other second by default which registers a work queue item
> that runs on the local CPU, with the result that we have 1 interrupt
> and one additional schedulable task on each CPU every 2 seconds
> If a workload indeed causes VM activity or multiple tasks are running
> on a CPU, then there are probably bigger issues to deal with.
> 
> However, some workloads dedicate a CPU for a single CPU bound task.
> This is done in high performance computing, in high frequency
> financial applications, in networking (Intel DPDK, EZchip NPS) and with
> the advent of systems with more and more CPUs over time, this may become
> more and more common to do since when one has enough CPUs
> one cares less about efficiently sharing a CPU with other tasks and
> more about efficiently monopolizing a CPU per task.
> 
> The difference of having this timer firing and workqueue kernel thread
> scheduled per second can be enormous. An artificial test measuring the
> worst case time to do a simple "i++" in an endless loop on a bare metal
> system and under Linux on an isolated CPU with dynticks and with and
> without this patch, have Linux match the bare metal performance
> (~700 cycles) with this patch and loose by couple of orders of magnitude
> (~200k cycles) without it[*].  The loss occurs for something that just
> calculates statistics. For networking applications, for example, this
> could be the difference between dropping packets or sustaining line rate.
> 
> Statistics are important and useful, but it would be great if there
> would be a way to not cause statistics gathering produce a huge
> performance difference. This patche does just that.
> 
> This patch creates a vmstat shepherd worker that monitors the
> per cpu differentials on all processors. If there are differentials
> on a processor then a vmstat worker local to the processors
> with the differentials is created. That worker will then start
> folding the diffs in regular intervals. Should the worker
> find that there is no work to be done then it will make the shepherd
> worker monitor the differentials again.
> 
> With this patch it is possible then to have periods longer than
> 2 seconds without any OS event on a "cpu" (hardware thread).
> 
> The patch shows a very minor increased in system performance.
> 
> 
> hackbench -s 512 -l 2000 -g 15 -f 25 -P
> 
> Results before the patch:
> 
> Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
> Each sender will pass 2000 messages of 512 bytes
> Time: 4.992
> Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
> Each sender will pass 2000 messages of 512 bytes
> Time: 4.971
> Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
> Each sender will pass 2000 messages of 512 bytes
> Time: 5.063
> 
> Hackbench after the patch:
> 
> Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
> Each sender will pass 2000 messages of 512 bytes
> Time: 4.973
> Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
> Each sender will pass 2000 messages of 512 bytes
> Time: 4.990
> Running in process mode with 15 groups using 50 file descriptors each (== 750 tasks)
> Each sender will pass 2000 messages of 512 bytes
> Time: 4.993
> 
> 
> 
> Reviewed-by: Gilad Ben-Yossef <gilad@benyossef.com>
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> 
> Index: linux/mm/vmstat.c
> ===================================================================
> --- linux.orig/mm/vmstat.c	2014-07-07 10:15:01.790099463 -0500
> +++ linux/mm/vmstat.c	2014-07-07 10:17:17.397891143 -0500
> @@ -7,6 +7,7 @@
>   *  zoned VM statistics
>   *  Copyright (C) 2006 Silicon Graphics, Inc.,
>   *		Christoph Lameter <christoph@lameter.com>
> + *  Copyright (C) 2008-2014 Christoph Lameter
>   */
>  #include <linux/fs.h>
>  #include <linux/mm.h>
> @@ -14,6 +15,7 @@
>  #include <linux/module.h>
>  #include <linux/slab.h>
>  #include <linux/cpu.h>
> +#include <linux/cpumask.h>
>  #include <linux/vmstat.h>
>  #include <linux/sched.h>
>  #include <linux/math64.h>
> @@ -419,13 +421,22 @@ void dec_zone_page_state(struct page *pa
>  EXPORT_SYMBOL(dec_zone_page_state);
>  #endif
> 
> -static inline void fold_diff(int *diff)
> +
> +/*
> + * Fold a differential into the global counters.
> + * Returns the number of counters updated.
> + */
> +static int fold_diff(int *diff)
>  {
>  	int i;
> +	int changes = 0;
> 
>  	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
> -		if (diff[i])
> +		if (diff[i]) {
>  			atomic_long_add(diff[i], &vm_stat[i]);
> +			changes++;
> +	}
> +	return changes;
>  }
> 
>  /*
> @@ -441,12 +452,15 @@ static inline void fold_diff(int *diff)
>   * statistics in the remote zone struct as well as the global cachelines
>   * with the global counters. These could cause remote node cache line
>   * bouncing and will have to be only done when necessary.
> + *
> + * The function returns the number of global counters updated.
>   */
> -static void refresh_cpu_vm_stats(void)
> +static int refresh_cpu_vm_stats(void)
>  {
>  	struct zone *zone;
>  	int i;
>  	int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
> +	int changes = 0;
> 
>  	for_each_populated_zone(zone) {
>  		struct per_cpu_pageset __percpu *p = zone->pageset;
> @@ -486,15 +500,17 @@ static void refresh_cpu_vm_stats(void)
>  			continue;
>  		}
> 
> -
>  		if (__this_cpu_dec_return(p->expire))
>  			continue;
> 
> -		if (__this_cpu_read(p->pcp.count))
> +		if (__this_cpu_read(p->pcp.count)) {
>  			drain_zone_pages(zone, this_cpu_ptr(&p->pcp));
> +			changes++;
> +		}
>  #endif
>  	}
> -	fold_diff(global_diff);
> +	changes += fold_diff(global_diff);
> +	return changes;
>  }
> 
>  /*
> @@ -1228,20 +1244,105 @@ static const struct file_operations proc
>  #ifdef CONFIG_SMP
>  static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
>  int sysctl_stat_interval __read_mostly = HZ;
> +struct cpumask *cpu_stat_off;

I thought you converted it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2014-07-11 13:20 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-10 14:04 vmstat: On demand vmstat workers V8 Christoph Lameter
2014-07-10 14:04 ` Christoph Lameter
2014-07-11 13:20 ` Frederic Weisbecker [this message]
2014-07-11 13:20   ` Frederic Weisbecker
2014-07-11 13:56   ` Christoph Lameter
2014-07-11 13:56     ` Christoph Lameter
2014-07-11 13:58     ` Frederic Weisbecker
2014-07-11 13:58       ` Frederic Weisbecker
2014-07-11 15:17       ` Christoph Lameter
2014-07-11 15:17         ` Christoph Lameter
2014-07-11 15:19         ` Frederic Weisbecker
2014-07-11 15:19           ` Frederic Weisbecker
2014-07-11 15:22           ` Christoph Lameter
2014-07-11 15:22             ` Christoph Lameter
2014-07-14 20:10             ` Hugh Dickins
2014-07-14 20:10               ` Hugh Dickins
2014-07-14 20:51               ` Christoph Lameter
2014-07-14 20:51                 ` Christoph Lameter
2014-07-30  3:04         ` Lai Jiangshan
2014-07-30  3:04           ` Lai Jiangshan
2014-07-26  2:22 ` Sasha Levin
2014-07-26  2:22   ` Sasha Levin
2014-07-28 18:55   ` Christoph Lameter
2014-07-28 18:55     ` Christoph Lameter
2014-07-28 21:54     ` Andrew Morton
2014-07-28 21:54       ` Andrew Morton
2014-07-28 22:00       ` Sasha Levin
2014-07-28 22:00         ` Sasha Levin
2014-07-29 15:17       ` Christoph Lameter
2014-07-29 15:17         ` Christoph Lameter
2014-07-29  7:56     ` Peter Zijlstra
2014-07-29 12:05       ` Tejun Heo
2014-07-29 12:05         ` Tejun Heo
2014-07-29 12:23         ` Peter Zijlstra
2014-07-29 12:23           ` Peter Zijlstra
2014-07-29 13:12           ` Tejun Heo
2014-07-29 13:12             ` Tejun Heo
2014-07-29 15:10             ` Christoph Lameter
2014-07-29 15:10               ` Christoph Lameter
2014-07-29 15:14               ` Tejun Heo
2014-07-29 15:14                 ` Tejun Heo
2014-07-29 15:26                 ` Christoph Lameter
2014-07-29 15:26                   ` Christoph Lameter
2014-07-29 15:39                 ` Christoph Lameter
2014-07-29 15:39                   ` Christoph Lameter
2014-07-29 15:47                   ` Sasha Levin
2014-07-29 15:47                     ` Sasha Levin
2014-07-29 15:59                     ` Christoph Lameter
2014-07-29 15:59                       ` Christoph Lameter
2014-07-30  3:11                   ` Lai Jiangshan
2014-07-30  3:11                     ` Lai Jiangshan
2014-07-30 14:34                     ` Christoph Lameter
2014-07-30 14:34                       ` Christoph Lameter
2014-07-29 15:22             ` Christoph Lameter
2014-07-29 15:22               ` Christoph Lameter
2014-07-29 15:43               ` Sasha Levin
2014-07-29 15:43                 ` Sasha Levin
2014-08-04 21:37   ` Sasha Levin
2014-08-04 21:37     ` Sasha Levin
2014-08-05 14:51     ` Christoph Lameter
2014-08-05 14:51       ` Christoph Lameter
2014-08-05 22:25       ` Sasha Levin
2014-08-05 22:25         ` Sasha Levin
2014-08-06 14:12         ` Christoph Lameter
2014-08-06 14:12           ` Christoph Lameter
2014-08-07  1:50           ` Sasha Levin
2014-08-07  1:50             ` Sasha Levin
2014-07-30  2:57 ` Lai Jiangshan
2014-07-30  2:57   ` Lai Jiangshan
2014-07-30 14:45   ` Christoph Lameter
2014-07-30 14:45     ` Christoph Lameter
2014-07-31  0:52     ` Lai Jiangshan
2014-07-31  0:52       ` Lai Jiangshan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140711132032.GB26045@localhost.localdomain \
    --to=fweisbec@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@gentwo.org \
    --cc=gilad@benyossef.com \
    --cc=hakanakkan@gmail.com \
    --cc=hpa@zytor.com \
    --cc=hughd@google.com \
    --cc=johnstul@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=maxk@qualcomm.com \
    --cc=minchan.kim@gmail.com \
    --cc=mingo@kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=tj@kernel.org \
    --cc=vapier@gentoo.org \
    --cc=viresh.kumar@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.