linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* vmstat: On demand vmstat workers V5
@ 2014-05-12 18:18 Christoph Lameter
  2014-05-13 15:24 ` Thomas Gleixner
  2014-05-28 15:21 ` Frederic Weisbecker
  0 siblings, 2 replies; 13+ messages in thread
From: Christoph Lameter @ 2014-05-12 18:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Gilad Ben-Yossef, Thomas Gleixner, Tejun Heo, John Stultz,
	Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky,
	Frederic Weisbecker, Paul E. McKenney, linux-kernel, linux-mm,
	hughd, viresh.kumar, hpa, mingo, peterz


V4->V5:
- Shepherd thread on a specific cpu (HOUSEKEEPING_CPU).
- Incorporate Andrew's feedback
- Work out the races.
- Make visible which CPUs have stat updates switched off
  in /sys/devices/system/cpu/stat_off

V3->V4:
- Make the shepherd task not deferrable. It runs on the tick cpu
  anyways. Deferral could get deltas too far out of sync if
  vmstat operations are disabled for a certain processor.

V2->V3:
- Introduce a new tick_get_housekeeping_cpu() function. Not sure
  if that is exactly what we want but it is a start. Thomas?
- Migrate the shepherd task if the output of
  tick_get_housekeeping_cpu() changes.
- Fixes recommended by Andrew.

V1->V2:
- Optimize the need_update check by using memchr_inv.
- Clean up.

vmstat workers are used for folding counter differentials into the
zone, per node and global counters at certain time intervals.
They currently run at defined intervals on all processors which will
cause some holdoff for processors that need minimal intrusion by the
OS.

The current vmstat_update mechanism depends on a deferrable timer
firing every other second by default which registers a work queue item
that runs on the local CPU, with the result that we have 1 interrupt
and one additional schedulable task on each CPU every 2 seconds
If a workload indeed causes VM activity or multiple tasks are running
on a CPU, then there are probably bigger issues to deal with.

However, some workloads dedicate a CPU for a single CPU bound task.
This is done in high performance computing, in high frequency
financial applications, in networking (Intel DPDK, EZchip NPS) and with
the advent of systems with more and more CPUs over time, this may become
more and more common to do since when one has enough CPUs
one cares less about efficiently sharing a CPU with other tasks and
more about efficiently monopolizing a CPU per task.

The difference of having this timer firing and workqueue kernel thread
scheduled per second can be enormous. An artificial test measuring the
worst case time to do a simple "i++" in an endless loop on a bare metal
system and under Linux on an isolated CPU with dynticks and with and
without this patch, have Linux match the bare metal performance
(~700 cycles) with this patch and loose by couple of orders of magnitude
(~200k cycles) without it[*].  The loss occurs for something that just
calculates statistics. For networking applications, for example, this
could be the difference between dropping packets or sustaining line rate.

Statistics are important and useful, but it would be great if there
would be a way to not cause statistics gathering produce a huge
performance difference. This patche does just that.

This patch creates a vmstat shepherd worker that monitors the
per cpu differentials on all processors. If there are differentials
on a processor then a vmstat worker local to the processors
with the differentials is created. That worker will then start
folding the diffs in regular intervals. Should the worker
find that there is no work to be done then it will make the shepherd
worker monitor the differentials again.

With this patch it is possible then to have periods longer than
2 seconds without any OS event on a "cpu" (hardware thread).

Signed-off-by: Christoph Lameter <cl@linux.com>


Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c	2014-05-12 12:43:49.373311788 -0500
+++ linux/mm/vmstat.c	2014-05-12 13:07:12.406956661 -0500
@@ -7,6 +7,7 @@
  *  zoned VM statistics
  *  Copyright (C) 2006 Silicon Graphics, Inc.,
  *		Christoph Lameter <christoph@lameter.com>
+ *  Copyright (C) 2008-2014 Christoph Lameter
  */
 #include <linux/fs.h>
 #include <linux/mm.h>
@@ -14,12 +15,14 @@
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <linux/cpu.h>
+#include <linux/cpumask.h>
 #include <linux/vmstat.h>
 #include <linux/sched.h>
 #include <linux/math64.h>
 #include <linux/writeback.h>
 #include <linux/compaction.h>
 #include <linux/mm_inline.h>
+#include <linux/tick.h>

 #include "internal.h"

@@ -417,13 +420,22 @@
 EXPORT_SYMBOL(dec_zone_page_state);
 #endif

-static inline void fold_diff(int *diff)
+
+/*
+ * Fold a differential into the global counters.
+ * Returns the number of counters updated.
+ */
+static int fold_diff(int *diff)
 {
 	int i;
+	int changes = 0;

 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
-		if (diff[i])
+		if (diff[i]) {
 			atomic_long_add(diff[i], &vm_stat[i]);
+			changes++;
+	}
+	return changes;
 }

 /*
@@ -439,12 +451,15 @@
  * statistics in the remote zone struct as well as the global cachelines
  * with the global counters. These could cause remote node cache line
  * bouncing and will have to be only done when necessary.
+ *
+ * The function returns the number of global counters updated.
  */
-static void refresh_cpu_vm_stats(void)
+static int refresh_cpu_vm_stats(void)
 {
 	struct zone *zone;
 	int i;
 	int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+	int changes = 0;

 	for_each_populated_zone(zone) {
 		struct per_cpu_pageset __percpu *p = zone->pageset;
@@ -484,15 +499,17 @@
 			continue;
 		}

-
 		if (__this_cpu_dec_return(p->expire))
 			continue;

-		if (__this_cpu_read(p->pcp.count))
+		if (__this_cpu_read(p->pcp.count)) {
 			drain_zone_pages(zone, __this_cpu_ptr(&p->pcp));
+			changes++;
+		}
 #endif
 	}
-	fold_diff(global_diff);
+	changes += fold_diff(global_diff);
+	return changes;
 }

 /*
@@ -1222,20 +1239,119 @@
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
+static DECLARE_BITMAP(cpu_stat_off_bits, CONFIG_NR_CPUS) __read_mostly;
+const struct cpumask *const cpu_stat_off = to_cpumask(cpu_stat_off_bits);
+EXPORT_SYMBOL(cpu_stat_off);
+
+/* We need to write to cpu_stat_off here */
+#define stat_off to_cpumask(cpu_stat_off_bits)

 static void vmstat_update(struct work_struct *w)
 {
+	if (refresh_cpu_vm_stats())
+		/*
+		 * Counters were updated so we expect more updates
+		 * to occur in the future. Keep on running the
+		 * update worker thread.
+		 */
+		schedule_delayed_work(this_cpu_ptr(&vmstat_work),
+			round_jiffies_relative(sysctl_stat_interval));
+	else {
+		/*
+		 * We did not update any counters so the app may be in
+		 * a mode where it does not cause counter updates.
+		 * We may be uselessly running vmstat_update.
+		 * Defer the checking for differentials to the
+		 * shepherd thread on a different processor.
+		 */
+		int r;
+		/*
+		 * Housekeeping cpu does not race since it never
+		 * changes the bit if its zero
+		 */
+		r = cpumask_test_and_set_cpu(smp_processor_id(),
+			stat_off);
+		VM_BUG_ON(r);
+	}
+}
+
+/*
+ * Check if the diffs for a certain cpu indicate that
+ * an update is needed.
+ */
+static bool need_update(int cpu)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu);
+
+		BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1);
+		/*
+		 * The fast way of checking if there are any vmstat diffs.
+		 * This works because the diffs are byte sized items.
+		 */
+		if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS))
+			return true;
+
+	}
+	return false;
+}
+
+
+/*
+ * Shepherd worker thread that updates the statistics for the
+ * processor the shepherd worker is running on and checks the
+ * differentials of other processors that have their worker
+ * threads for vm statistics updates disabled because of
+ * inactivity.
+ */
+static void vmstat_shepherd(struct work_struct *w)
+{
+	int cpu;
+
 	refresh_cpu_vm_stats();
-	schedule_delayed_work(&__get_cpu_var(vmstat_work),
-		round_jiffies_relative(sysctl_stat_interval));
+
+	/* Check processors whose vmstat worker threads have been disabled */
+	for_each_cpu(cpu, stat_off)
+		if (need_update(cpu) &&
+			cpumask_test_and_clear_cpu(cpu, stat_off)) {
+
+			struct delayed_work *work = &per_cpu(vmstat_work, cpu);
+
+			INIT_DEFERRABLE_WORK(work, vmstat_update);
+			schedule_delayed_work_on(cpu, work,
+				__round_jiffies_relative(sysctl_stat_interval,
+				cpu));
+		}
+
+	schedule_delayed_work(this_cpu_ptr(&vmstat_work),
+		__round_jiffies_relative(sysctl_stat_interval,
+		HOUSEKEEPING_CPU));
+
 }

-static void start_cpu_timer(int cpu)
+static void __init start_shepherd_timer(void)
 {
-	struct delayed_work *work = &per_cpu(vmstat_work, cpu);
+	struct delayed_work *d = per_cpu_ptr(&vmstat_work,
+					HOUSEKEEPING_CPU);
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu),
+			vmstat_update);
+
+	INIT_DELAYED_WORK(d, vmstat_shepherd);
+
+	cpumask_copy(stat_off, cpu_online_mask);

-	INIT_DEFERRABLE_WORK(work, vmstat_update);
-	schedule_delayed_work_on(cpu, work, __round_jiffies_relative(HZ, cpu));
+	/* Make sure that arch code hasnt dont anything goofy */
+	VM_BUG_ON(!cpu_isset(HOUSEKEEPING_CPU, *stat_off));
+
+	cpumask_clear_cpu(HOUSEKEEPING_CPU, stat_off);
+	schedule_delayed_work_on(HOUSEKEEPING_CPU, d,
+		__round_jiffies_relative(sysctl_stat_interval,
+		HOUSEKEEPING_CPU));
 }

 static void vmstat_cpu_dead(int node)
@@ -1266,17 +1382,17 @@
 	case CPU_ONLINE:
 	case CPU_ONLINE_FROZEN:
 		refresh_zone_stat_thresholds();
-		start_cpu_timer(cpu);
 		node_set_state(cpu_to_node(cpu), N_CPU);
+		cpumask_set_cpu(cpu, stat_off);
 		break;
 	case CPU_DOWN_PREPARE:
 	case CPU_DOWN_PREPARE_FROZEN:
-		cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
-		per_cpu(vmstat_work, cpu).work.func = NULL;
+		if (!cpumask_test_and_clear_cpu(cpu, stat_off))
+			cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
 		break;
 	case CPU_DOWN_FAILED:
 	case CPU_DOWN_FAILED_FROZEN:
-		start_cpu_timer(cpu);
+		cpumask_set_cpu(cpu, stat_off);
 		break;
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
@@ -1296,15 +1412,10 @@
 static int __init setup_vmstat(void)
 {
 #ifdef CONFIG_SMP
-	int cpu;
-
 	cpu_notifier_register_begin();
 	__register_cpu_notifier(&vmstat_notifier);

-	for_each_online_cpu(cpu) {
-		start_cpu_timer(cpu);
-		node_set_state(cpu_to_node(cpu), N_CPU);
-	}
+	start_shepherd_timer();
 	cpu_notifier_register_done();
 #endif
 #ifdef CONFIG_PROC_FS
Index: linux/include/linux/tick.h
===================================================================
--- linux.orig/include/linux/tick.h	2014-05-12 12:43:49.373311788 -0500
+++ linux/include/linux/tick.h	2014-05-12 12:43:49.369311864 -0500
@@ -13,6 +13,8 @@
 #include <linux/context_tracking_state.h>
 #include <linux/cpumask.h>

+#define HOUSEKEEPING_CPU 0
+
 #ifdef CONFIG_GENERIC_CLOCKEVENTS

 enum tick_device_mode {
Index: linux/drivers/base/cpu.c
===================================================================
--- linux.orig/drivers/base/cpu.c	2014-05-12 12:43:49.373311788 -0500
+++ linux/drivers/base/cpu.c	2014-05-12 13:11:07.054562341 -0500
@@ -222,6 +222,7 @@
 	_CPU_ATTR(online, &cpu_online_mask),
 	_CPU_ATTR(possible, &cpu_possible_mask),
 	_CPU_ATTR(present, &cpu_present_mask),
+	_CPU_ATTR(stat_off, &cpu_stat_off),
 };

 /*
@@ -378,6 +379,7 @@
 	&cpu_attrs[0].attr.attr,
 	&cpu_attrs[1].attr.attr,
 	&cpu_attrs[2].attr.attr,
+	&cpu_attrs[3].attr.attr,
 	&dev_attr_kernel_max.attr,
 	&dev_attr_offline.attr,
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
Index: linux/include/linux/cpumask.h
===================================================================
--- linux.orig/include/linux/cpumask.h	2014-05-12 12:43:49.373311788 -0500
+++ linux/include/linux/cpumask.h	2014-05-12 12:52:12.191784100 -0500
@@ -80,6 +80,7 @@
 extern const struct cpumask *const cpu_online_mask;
 extern const struct cpumask *const cpu_present_mask;
 extern const struct cpumask *const cpu_active_mask;
+extern const struct cpumask *const cpu_stat_off;

 #if NR_CPUS > 1
 #define num_online_cpus()	cpumask_weight(cpu_online_mask)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: vmstat: On demand vmstat workers V5
  2014-05-12 18:18 vmstat: On demand vmstat workers V5 Christoph Lameter
@ 2014-05-13 15:24 ` Thomas Gleixner
  2014-05-14 16:07   ` Christoph Lameter
  2014-05-28 15:21 ` Frederic Weisbecker
  1 sibling, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2014-05-13 15:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Gilad Ben-Yossef, Tejun Heo, John Stultz,
	Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky,
	Frederic Weisbecker, Paul E. McKenney, linux-kernel, linux-mm,
	hughd, viresh.kumar, hpa, mingo, peterz

On Mon, 12 May 2014, Christoph Lameter wrote:

> 
> V4->V5:
> - Shepherd thread on a specific cpu (HOUSEKEEPING_CPU).

That's going to fail for the NOHZ_IDLE case, e.g. for the little/big
cluster support on ARM. They need to be able to move it from the
housekeeper on the big to the housekeeper on the little cluster.

So as I said before: This wants to be on a dedicated housekeeper
workqueue. Where the thread is placed is decided by the core depending
on system configuration and state.

 	  nohz = off	    CPU0
	  nohz = idle	    core decision depending on state
	  nohz = full	    CPU0

Your solution solves only the off and full case and will fail when in
the idle case CPU0 should be sent into deep idle sleep until the user
decides to do some work which requires CPU0 to come back.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: vmstat: On demand vmstat workers V5
  2014-05-13 15:24 ` Thomas Gleixner
@ 2014-05-14 16:07   ` Christoph Lameter
  2014-05-14 23:15     ` Thomas Gleixner
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2014-05-14 16:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, Gilad Ben-Yossef, Tejun Heo, John Stultz,
	Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky,
	Frederic Weisbecker, Paul E. McKenney, linux-kernel, linux-mm,
	hughd, viresh.kumar, hpa, mingo, peterz

On Tue, 13 May 2014, Thomas Gleixner wrote:

> So as I said before: This wants to be on a dedicated housekeeper
> workqueue. Where the thread is placed is decided by the core depending
> on system configuration and state.

Ok then we can just go with a generic worker thread and use the work on
restricting general work queue items to specific cpus by Frederic to
restrict where the work queue can be run.

Then this would be ok I guess?

Subject: vmstat: On demand vmstat workers V6


V5->V6:
- Shepherd thread as a general worker thread. This means
  that the general mechanism to control worker thread
  cpu proposed by Frederic Weisbecker is necessary to
  restrict the shepherd thread to the cpus not used
  for low latency tasks. Hopefully that is ready to be
  merged soon. No need anymore to have a specific
  cpu be the housekeeper cpu.

V4->V5:
- Shepherd thread on a specific cpu (HOUSEKEEPING_CPU).
- Incorporate Andrew's feedback
- Work out the races.
- Make visible which CPUs have stat updates switched off
  in /sys/devices/system/cpu/stat_off

V3->V4:
- Make the shepherd task not deferrable. It runs on the tick cpu
  anyways. Deferral could get deltas too far out of sync if
  vmstat operations are disabled for a certain processor.

V2->V3:
- Introduce a new tick_get_housekeeping_cpu() function. Not sure
  if that is exactly what we want but it is a start. Thomas?
- Migrate the shepherd task if the output of
  tick_get_housekeeping_cpu() changes.
- Fixes recommended by Andrew.

V1->V2:
- Optimize the need_update check by using memchr_inv.
- Clean up.

vmstat workers are used for folding counter differentials into the
zone, per node and global counters at certain time intervals.
They currently run at defined intervals on all processors which will
cause some holdoff for processors that need minimal intrusion by the
OS.

The current vmstat_update mechanism depends on a deferrable timer
firing every other second by default which registers a work queue item
that runs on the local CPU, with the result that we have 1 interrupt
and one additional schedulable task on each CPU every 2 seconds
If a workload indeed causes VM activity or multiple tasks are running
on a CPU, then there are probably bigger issues to deal with.

However, some workloads dedicate a CPU for a single CPU bound task.
This is done in high performance computing, in high frequency
financial applications, in networking (Intel DPDK, EZchip NPS) and with
the advent of systems with more and more CPUs over time, this may become
more and more common to do since when one has enough CPUs
one cares less about efficiently sharing a CPU with other tasks and
more about efficiently monopolizing a CPU per task.

The difference of having this timer firing and workqueue kernel thread
scheduled per second can be enormous. An artificial test measuring the
worst case time to do a simple "i++" in an endless loop on a bare metal
system and under Linux on an isolated CPU with dynticks and with and
without this patch, have Linux match the bare metal performance
(~700 cycles) with this patch and loose by couple of orders of magnitude
(~200k cycles) without it[*].  The loss occurs for something that just
calculates statistics. For networking applications, for example, this
could be the difference between dropping packets or sustaining line rate.

Statistics are important and useful, but it would be great if there
would be a way to not cause statistics gathering produce a huge
performance difference. This patche does just that.

This patch creates a vmstat shepherd worker that monitors the
per cpu differentials on all processors. If there are differentials
on a processor then a vmstat worker local to the processors
with the differentials is created. That worker will then start
folding the diffs in regular intervals. Should the worker
find that there is no work to be done then it will make the shepherd
worker monitor the differentials again.

With this patch it is possible then to have periods longer than
2 seconds without any OS event on a "cpu" (hardware thread).

Reviewed-by: Gilad Ben-Yossef <gilad@benyossef.com>
Signed-off-by: Christoph Lameter <cl@linux.com>


Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c	2014-05-12 12:43:49.373311788 -0500
+++ linux/mm/vmstat.c	2014-05-14 11:02:58.093511078 -0500
@@ -7,6 +7,7 @@
  *  zoned VM statistics
  *  Copyright (C) 2006 Silicon Graphics, Inc.,
  *		Christoph Lameter <christoph@lameter.com>
+ *  Copyright (C) 2008-2014 Christoph Lameter
  */
 #include <linux/fs.h>
 #include <linux/mm.h>
@@ -14,6 +15,7 @@
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <linux/cpu.h>
+#include <linux/cpumask.h>
 #include <linux/vmstat.h>
 #include <linux/sched.h>
 #include <linux/math64.h>
@@ -417,13 +419,22 @@
 EXPORT_SYMBOL(dec_zone_page_state);
 #endif

-static inline void fold_diff(int *diff)
+
+/*
+ * Fold a differential into the global counters.
+ * Returns the number of counters updated.
+ */
+static int fold_diff(int *diff)
 {
 	int i;
+	int changes = 0;

 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
-		if (diff[i])
+		if (diff[i]) {
 			atomic_long_add(diff[i], &vm_stat[i]);
+			changes++;
+	}
+	return changes;
 }

 /*
@@ -439,12 +450,15 @@
  * statistics in the remote zone struct as well as the global cachelines
  * with the global counters. These could cause remote node cache line
  * bouncing and will have to be only done when necessary.
+ *
+ * The function returns the number of global counters updated.
  */
-static void refresh_cpu_vm_stats(void)
+static int refresh_cpu_vm_stats(void)
 {
 	struct zone *zone;
 	int i;
 	int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+	int changes = 0;

 	for_each_populated_zone(zone) {
 		struct per_cpu_pageset __percpu *p = zone->pageset;
@@ -484,15 +498,17 @@
 			continue;
 		}

-
 		if (__this_cpu_dec_return(p->expire))
 			continue;

-		if (__this_cpu_read(p->pcp.count))
+		if (__this_cpu_read(p->pcp.count)) {
 			drain_zone_pages(zone, __this_cpu_ptr(&p->pcp));
+			changes++;
+		}
 #endif
 	}
-	fold_diff(global_diff);
+	changes += fold_diff(global_diff);
+	return changes;
 }

 /*
@@ -1222,20 +1238,109 @@
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
+static DECLARE_BITMAP(cpu_stat_off_bits, CONFIG_NR_CPUS) __read_mostly;
+const struct cpumask *const cpu_stat_off = to_cpumask(cpu_stat_off_bits);
+EXPORT_SYMBOL(cpu_stat_off);
+
+/* We need to write to cpu_stat_off here */
+#define stat_off to_cpumask(cpu_stat_off_bits)

 static void vmstat_update(struct work_struct *w)
 {
-	refresh_cpu_vm_stats();
-	schedule_delayed_work(&__get_cpu_var(vmstat_work),
+	if (refresh_cpu_vm_stats())
+		/*
+		 * Counters were updated so we expect more updates
+		 * to occur in the future. Keep on running the
+		 * update worker thread.
+		 */
+		schedule_delayed_work(this_cpu_ptr(&vmstat_work),
+			round_jiffies_relative(sysctl_stat_interval));
+	else {
+		/*
+		 * We did not update any counters so the app may be in
+		 * a mode where it does not cause counter updates.
+		 * We may be uselessly running vmstat_update.
+		 * Defer the checking for differentials to the
+		 * shepherd thread on a different processor.
+		 */
+		int r;
+		/*
+		 * Shepherd work thread does not race since it never
+		 * changes the bit if its zero but the cpu
+		 * online / off line code may race if
+		 * worker threads are still allowed during
+		 * shutdown / startup.
+		 */
+		r = cpumask_test_and_set_cpu(smp_processor_id(),
+			stat_off);
+		VM_BUG_ON(r);
+	}
+}
+
+/*
+ * Check if the diffs for a certain cpu indicate that
+ * an update is needed.
+ */
+static bool need_update(int cpu)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu);
+
+		BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1);
+		/*
+		 * The fast way of checking if there are any vmstat diffs.
+		 * This works because the diffs are byte sized items.
+		 */
+		if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS))
+			return true;
+
+	}
+	return false;
+}
+
+
+/*
+ * Shepherd worker thread that checks the
+ * differentials of processors that have their worker
+ * threads for vm statistics updates disabled because of
+ * inactivity.
+ */
+static void vmstat_shepherd(struct work_struct *w);
+
+static DECLARE_DELAYED_WORK(shepherd, vmstat_shepherd);
+
+static void vmstat_shepherd(struct work_struct *w)
+{
+	int cpu;
+
+	/* Check processors whose vmstat worker threads have been disabled */
+	for_each_cpu(cpu, stat_off)
+		if (need_update(cpu) &&
+			cpumask_test_and_clear_cpu(cpu, stat_off))
+
+			schedule_delayed_work_on(cpu, &per_cpu(vmstat_work, cpu),
+				__round_jiffies_relative(sysctl_stat_interval, cpu));
+
+
+	schedule_delayed_work(&shepherd,
 		round_jiffies_relative(sysctl_stat_interval));
+
 }

-static void start_cpu_timer(int cpu)
+static void __init start_shepherd_timer(void)
 {
-	struct delayed_work *work = &per_cpu(vmstat_work, cpu);
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu),
+			vmstat_update);
+
+	cpumask_copy(stat_off, cpu_online_mask);

-	INIT_DEFERRABLE_WORK(work, vmstat_update);
-	schedule_delayed_work_on(cpu, work, __round_jiffies_relative(HZ, cpu));
+	schedule_delayed_work(&shepherd,
+		round_jiffies_relative(sysctl_stat_interval));
 }

 static void vmstat_cpu_dead(int node)
@@ -1266,17 +1371,17 @@
 	case CPU_ONLINE:
 	case CPU_ONLINE_FROZEN:
 		refresh_zone_stat_thresholds();
-		start_cpu_timer(cpu);
 		node_set_state(cpu_to_node(cpu), N_CPU);
+		cpumask_set_cpu(cpu, stat_off);
 		break;
 	case CPU_DOWN_PREPARE:
 	case CPU_DOWN_PREPARE_FROZEN:
-		cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
-		per_cpu(vmstat_work, cpu).work.func = NULL;
+		if (!cpumask_test_and_set_cpu(cpu, stat_off))
+			cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
 		break;
 	case CPU_DOWN_FAILED:
 	case CPU_DOWN_FAILED_FROZEN:
-		start_cpu_timer(cpu);
+		cpumask_set_cpu(cpu, stat_off);
 		break;
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
@@ -1296,15 +1401,10 @@
 static int __init setup_vmstat(void)
 {
 #ifdef CONFIG_SMP
-	int cpu;
-
 	cpu_notifier_register_begin();
 	__register_cpu_notifier(&vmstat_notifier);

-	for_each_online_cpu(cpu) {
-		start_cpu_timer(cpu);
-		node_set_state(cpu_to_node(cpu), N_CPU);
-	}
+	start_shepherd_timer();
 	cpu_notifier_register_done();
 #endif
 #ifdef CONFIG_PROC_FS
Index: linux/drivers/base/cpu.c
===================================================================
--- linux.orig/drivers/base/cpu.c	2014-05-12 12:43:49.373311788 -0500
+++ linux/drivers/base/cpu.c	2014-05-12 13:11:07.054562341 -0500
@@ -222,6 +222,7 @@
 	_CPU_ATTR(online, &cpu_online_mask),
 	_CPU_ATTR(possible, &cpu_possible_mask),
 	_CPU_ATTR(present, &cpu_present_mask),
+	_CPU_ATTR(stat_off, &cpu_stat_off),
 };

 /*
@@ -378,6 +379,7 @@
 	&cpu_attrs[0].attr.attr,
 	&cpu_attrs[1].attr.attr,
 	&cpu_attrs[2].attr.attr,
+	&cpu_attrs[3].attr.attr,
 	&dev_attr_kernel_max.attr,
 	&dev_attr_offline.attr,
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
Index: linux/include/linux/cpumask.h
===================================================================
--- linux.orig/include/linux/cpumask.h	2014-05-12 12:43:49.373311788 -0500
+++ linux/include/linux/cpumask.h	2014-05-12 12:52:12.191784100 -0500
@@ -80,6 +80,7 @@
 extern const struct cpumask *const cpu_online_mask;
 extern const struct cpumask *const cpu_present_mask;
 extern const struct cpumask *const cpu_active_mask;
+extern const struct cpumask *const cpu_stat_off;

 #if NR_CPUS > 1
 #define num_online_cpus()	cpumask_weight(cpu_online_mask)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: vmstat: On demand vmstat workers V5
  2014-05-14 16:07   ` Christoph Lameter
@ 2014-05-14 23:15     ` Thomas Gleixner
  2014-05-27 20:07       ` Christoph Lameter
  0 siblings, 1 reply; 13+ messages in thread
From: Thomas Gleixner @ 2014-05-14 23:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Gilad Ben-Yossef, Tejun Heo, John Stultz,
	Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky,
	Frederic Weisbecker, Paul E. McKenney, linux-kernel, linux-mm,
	hughd, viresh.kumar, hpa, mingo, peterz

On Wed, 14 May 2014, Christoph Lameter wrote:
> - Shepherd thread as a general worker thread. This means
>   that the general mechanism to control worker thread
>   cpu proposed by Frederic Weisbecker is necessary to
>   restrict the shepherd thread to the cpus not used
>   for low latency tasks. Hopefully that is ready to be
>   merged soon. No need anymore to have a specific
>   cpu be the housekeeper cpu.

Amen to that.

Acked-by me for the general approach.

I don't want to give any unqualified opinion on the mm/vmstat parts of
this patch.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: vmstat: On demand vmstat workers V5
  2014-05-14 23:15     ` Thomas Gleixner
@ 2014-05-27 20:07       ` Christoph Lameter
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Lameter @ 2014-05-27 20:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andrew Morton, Gilad Ben-Yossef, Tejun Heo, John Stultz,
	Mike Frysinger, Minchan Kim, Hakan Akkan, Max Krasnyansky,
	Frederic Weisbecker, Paul E. McKenney, linux-kernel, linux-mm,
	hughd, viresh.kumar, hpa, mingo, peterz

On Thu, 15 May 2014, Thomas Gleixner wrote:

> Acked-by me for the general approach.
>
> I don't want to give any unqualified opinion on the mm/vmstat parts of
> this patch.

Thanks. Any other comments? Could we get this into -next?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: vmstat: On demand vmstat workers V5
  2014-05-12 18:18 vmstat: On demand vmstat workers V5 Christoph Lameter
  2014-05-13 15:24 ` Thomas Gleixner
@ 2014-05-28 15:21 ` Frederic Weisbecker
  2014-05-28 16:19   ` Christoph Lameter
  1 sibling, 1 reply; 13+ messages in thread
From: Frederic Weisbecker @ 2014-05-28 15:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Gilad Ben-Yossef, Thomas Gleixner, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Paul E. McKenney, linux-kernel, linux-mm, hughd,
	viresh.kumar, hpa, mingo, peterz

On Mon, May 12, 2014 at 01:18:10PM -0500, Christoph Lameter wrote:
>  #ifdef CONFIG_SMP
>  static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
>  int sysctl_stat_interval __read_mostly = HZ;
> +static DECLARE_BITMAP(cpu_stat_off_bits, CONFIG_NR_CPUS) __read_mostly;
> +const struct cpumask *const cpu_stat_off = to_cpumask(cpu_stat_off_bits);
> +EXPORT_SYMBOL(cpu_stat_off);

Is there no way to make it a cpumask_var_t, and allocate it from
start_shepherd_timer()?

This should really take less space overall.

> +
> +/* We need to write to cpu_stat_off here */
> +#define stat_off to_cpumask(cpu_stat_off_bits)
> 
>  static void vmstat_update(struct work_struct *w)
>  {
> +	if (refresh_cpu_vm_stats())
> +		/*
> +		 * Counters were updated so we expect more updates
> +		 * to occur in the future. Keep on running the
> +		 * update worker thread.
> +		 */
> +		schedule_delayed_work(this_cpu_ptr(&vmstat_work),
> +			round_jiffies_relative(sysctl_stat_interval));
> +	else {
> +		/*
> +		 * We did not update any counters so the app may be in
> +		 * a mode where it does not cause counter updates.
> +		 * We may be uselessly running vmstat_update.
> +		 * Defer the checking for differentials to the
> +		 * shepherd thread on a different processor.
> +		 */
> +		int r;
> +		/*
> +		 * Housekeeping cpu does not race since it never
> +		 * changes the bit if its zero
> +		 */
> +		r = cpumask_test_and_set_cpu(smp_processor_id(),
> +			stat_off);
> +		VM_BUG_ON(r);
> +	}
> +}
> +
> +/*
> + * Check if the diffs for a certain cpu indicate that
> + * an update is needed.
> + */
> +static bool need_update(int cpu)
> +{
> +	struct zone *zone;
> +
> +	for_each_populated_zone(zone) {
> +		struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu);
> +
> +		BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1);
> +		/*
> +		 * The fast way of checking if there are any vmstat diffs.
> +		 * This works because the diffs are byte sized items.
> +		 */
> +		if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS))
> +			return true;
> +
> +	}
> +	return false;
> +}
> +
> +
> +/*
> + * Shepherd worker thread that updates the statistics for the
> + * processor the shepherd worker is running on and checks the
> + * differentials of other processors that have their worker
> + * threads for vm statistics updates disabled because of
> + * inactivity.
> + */
> +static void vmstat_shepherd(struct work_struct *w)
> +{
> +	int cpu;
> +
>  	refresh_cpu_vm_stats();
> -	schedule_delayed_work(&__get_cpu_var(vmstat_work),
> -		round_jiffies_relative(sysctl_stat_interval));
> +
> +	/* Check processors whose vmstat worker threads have been disabled */
> +	for_each_cpu(cpu, stat_off)
> +		if (need_update(cpu) &&
> +			cpumask_test_and_clear_cpu(cpu, stat_off)) {
> +
> +			struct delayed_work *work = &per_cpu(vmstat_work, cpu);
> +
> +			INIT_DEFERRABLE_WORK(work, vmstat_update);
> +			schedule_delayed_work_on(cpu, work,
> +				__round_jiffies_relative(sysctl_stat_interval,
> +				cpu));
> +		}
> +
> +	schedule_delayed_work(this_cpu_ptr(&vmstat_work),
> +		__round_jiffies_relative(sysctl_stat_interval,
> +		HOUSEKEEPING_CPU));

Maybe you can just make the shepherd work unbound and let bind it from userspace
once we have the workqueue user affinity patchset in.

OTOH, it means you need to have a vmstat_update work on the housekeeping CPU as well.
But that's perhaps what you want since the vmstat_shepherd feature is probably not
something you want to enable without full dynticks CPU around. It probably add quite
some overhead on normal workloads to do a system wide scan.

But having two works scheduled for the whole is perhaps some overhead as well.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: vmstat: On demand vmstat workers V5
  2014-05-28 15:21 ` Frederic Weisbecker
@ 2014-05-28 16:19   ` Christoph Lameter
  2014-05-29  0:36     ` Frederic Weisbecker
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2014-05-28 16:19 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, Gilad Ben-Yossef, Thomas Gleixner, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Paul E. McKenney, linux-kernel, linux-mm, hughd,
	viresh.kumar, hpa, mingo, peterz

On Wed, 28 May 2014, Frederic Weisbecker wrote:

> On Mon, May 12, 2014 at 01:18:10PM -0500, Christoph Lameter wrote:
> >  #ifdef CONFIG_SMP
> >  static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
> >  int sysctl_stat_interval __read_mostly = HZ;
> > +static DECLARE_BITMAP(cpu_stat_off_bits, CONFIG_NR_CPUS) __read_mostly;
> > +const struct cpumask *const cpu_stat_off = to_cpumask(cpu_stat_off_bits);
> > +EXPORT_SYMBOL(cpu_stat_off);
>
> Is there no way to make it a cpumask_var_t, and allocate it from
> start_shepherd_timer()?
>
> This should really take less space overall.

This was taken from the way things work with the other cpumasks in
linux/kernel/cpu.c. Its compatible with the way done there and allows
also the write protection of the cpumask outside of vmstat.c

> > +	schedule_delayed_work(this_cpu_ptr(&vmstat_work),
> > +		__round_jiffies_relative(sysctl_stat_interval,
> > +		HOUSEKEEPING_CPU));
>
> Maybe you can just make the shepherd work unbound and let bind it from userspace
> once we have the workqueue user affinity patchset in.

Yes that is what V5 should have done. Looks like the final version was not
posted. Sigh. The correct patch follows this message and it no longer uses
HOUSEKEEPING_CPU.


> OTOH, it means you need to have a vmstat_update work on the housekeeping CPU as well.

Well the vnstat_udpate may not be needed on the processor where the
shepherd runs so it may save something.

>From cl@linux.com Thu Oct  3 12:41:21 2013
Date: Thu, 3 Oct 2013 12:41:21 -0500 (CDT)
From: Christoph Lameter <cl@linux.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>, Thomas Gleixner <tglx@linutronix.de>, Tejun Heo <tj@kernel.org>, John Stultz <johnstul@us.ibm.com>, Mike Frysinger <vapier@gentoo.org>, Minchan Kim <minchan.kim@gmail.com>, Hakan Akkan <hakanakkan@gmail.com>, Max Krasnyansky <maxk@qualcomm.com>, Frederic Weisbecker <fweisbec@gmail.com>, Paul E. McKenney <paulmck@linux.vnet.ibm.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, hughd@google.com, viresh.kumar@linaro.org, hpa@zytor.com, mingo@kernel.org, peterz@infradead.org
Subject: vmstat: On demand vmstat workers V6


V5->V6:
- Shepherd thread as a general worker thread. This means
  that the general mechanism to control worker thread
  cpu use by Frederic Weisbecker is necessary to
  restrict the shepherd thread to the cpus not used
  for low latency tasks. Hopefully that is ready to be
  merged soon. No need anymore to have a specific
  cpu be the housekeeper cpu.

V4->V5:
- Shepherd thread on a specific cpu (HOUSEKEEPING_CPU).
- Incorporate Andrew's feedback
- Work out the races.
- Make visible which CPUs have stat updates switched off
  in /sys/devices/system/cpu/stat_off

V3->V4:
- Make the shepherd task not deferrable. It runs on the tick cpu
  anyways. Deferral could get deltas too far out of sync if
  vmstat operations are disabled for a certain processor.

V2->V3:
- Introduce a new tick_get_housekeeping_cpu() function. Not sure
  if that is exactly what we want but it is a start. Thomas?
- Migrate the shepherd task if the output of
  tick_get_housekeeping_cpu() changes.
- Fixes recommended by Andrew.

V1->V2:
- Optimize the need_update check by using memchr_inv.
- Clean up.

vmstat workers are used for folding counter differentials into the
zone, per node and global counters at certain time intervals.
They currently run at defined intervals on all processors which will
cause some holdoff for processors that need minimal intrusion by the
OS.

The current vmstat_update mechanism depends on a deferrable timer
firing every other second by default which registers a work queue item
that runs on the local CPU, with the result that we have 1 interrupt
and one additional schedulable task on each CPU every 2 seconds
If a workload indeed causes VM activity or multiple tasks are running
on a CPU, then there are probably bigger issues to deal with.

However, some workloads dedicate a CPU for a single CPU bound task.
This is done in high performance computing, in high frequency
financial applications, in networking (Intel DPDK, EZchip NPS) and with
the advent of systems with more and more CPUs over time, this may become
more and more common to do since when one has enough CPUs
one cares less about efficiently sharing a CPU with other tasks and
more about efficiently monopolizing a CPU per task.

The difference of having this timer firing and workqueue kernel thread
scheduled per second can be enormous. An artificial test measuring the
worst case time to do a simple "i++" in an endless loop on a bare metal
system and under Linux on an isolated CPU with dynticks and with and
without this patch, have Linux match the bare metal performance
(~700 cycles) with this patch and loose by couple of orders of magnitude
(~200k cycles) without it[*].  The loss occurs for something that just
calculates statistics. For networking applications, for example, this
could be the difference between dropping packets or sustaining line rate.

Statistics are important and useful, but it would be great if there
would be a way to not cause statistics gathering produce a huge
performance difference. This patche does just that.

This patch creates a vmstat shepherd worker that monitors the
per cpu differentials on all processors. If there are differentials
on a processor then a vmstat worker local to the processors
with the differentials is created. That worker will then start
folding the diffs in regular intervals. Should the worker
find that there is no work to be done then it will make the shepherd
worker monitor the differentials again.

With this patch it is possible then to have periods longer than
2 seconds without any OS event on a "cpu" (hardware thread).

Reviewed-by: Gilad Ben-Yossef <gilad@benyossef.com>
Signed-off-by: Christoph Lameter <cl@linux.com>


Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c	2014-05-15 10:45:02.133488061 -0500
+++ linux/mm/vmstat.c	2014-05-15 10:45:02.129488144 -0500
@@ -7,6 +7,7 @@
  *  zoned VM statistics
  *  Copyright (C) 2006 Silicon Graphics, Inc.,
  *		Christoph Lameter <christoph@lameter.com>
+ *  Copyright (C) 2008-2014 Christoph Lameter
  */
 #include <linux/fs.h>
 #include <linux/mm.h>
@@ -14,6 +15,7 @@
 #include <linux/module.h>
 #include <linux/slab.h>
 #include <linux/cpu.h>
+#include <linux/cpumask.h>
 #include <linux/vmstat.h>
 #include <linux/sched.h>
 #include <linux/math64.h>
@@ -417,13 +419,22 @@
 EXPORT_SYMBOL(dec_zone_page_state);
 #endif

-static inline void fold_diff(int *diff)
+
+/*
+ * Fold a differential into the global counters.
+ * Returns the number of counters updated.
+ */
+static int fold_diff(int *diff)
 {
 	int i;
+	int changes = 0;

 	for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
-		if (diff[i])
+		if (diff[i]) {
 			atomic_long_add(diff[i], &vm_stat[i]);
+			changes++;
+	}
+	return changes;
 }

 /*
@@ -439,12 +450,15 @@
  * statistics in the remote zone struct as well as the global cachelines
  * with the global counters. These could cause remote node cache line
  * bouncing and will have to be only done when necessary.
+ *
+ * The function returns the number of global counters updated.
  */
-static void refresh_cpu_vm_stats(void)
+static int refresh_cpu_vm_stats(void)
 {
 	struct zone *zone;
 	int i;
 	int global_diff[NR_VM_ZONE_STAT_ITEMS] = { 0, };
+	int changes = 0;

 	for_each_populated_zone(zone) {
 		struct per_cpu_pageset __percpu *p = zone->pageset;
@@ -484,15 +498,17 @@
 			continue;
 		}

-
 		if (__this_cpu_dec_return(p->expire))
 			continue;

-		if (__this_cpu_read(p->pcp.count))
+		if (__this_cpu_read(p->pcp.count)) {
 			drain_zone_pages(zone, __this_cpu_ptr(&p->pcp));
+			changes++;
+		}
 #endif
 	}
-	fold_diff(global_diff);
+	changes += fold_diff(global_diff);
+	return changes;
 }

 /*
@@ -1222,20 +1238,109 @@
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
+static DECLARE_BITMAP(cpu_stat_off_bits, CONFIG_NR_CPUS) __read_mostly;
+const struct cpumask *const cpu_stat_off = to_cpumask(cpu_stat_off_bits);
+EXPORT_SYMBOL(cpu_stat_off);
+
+/* We need to write to cpu_stat_off here */
+#define stat_off to_cpumask(cpu_stat_off_bits)

 static void vmstat_update(struct work_struct *w)
 {
-	refresh_cpu_vm_stats();
-	schedule_delayed_work(&__get_cpu_var(vmstat_work),
+	if (refresh_cpu_vm_stats())
+		/*
+		 * Counters were updated so we expect more updates
+		 * to occur in the future. Keep on running the
+		 * update worker thread.
+		 */
+		schedule_delayed_work(this_cpu_ptr(&vmstat_work),
+			round_jiffies_relative(sysctl_stat_interval));
+	else {
+		/*
+		 * We did not update any counters so the app may be in
+		 * a mode where it does not cause counter updates.
+		 * We may be uselessly running vmstat_update.
+		 * Defer the checking for differentials to the
+		 * shepherd thread on a different processor.
+		 */
+		int r;
+		/*
+		 * Shepherd work thread does not race since it never
+		 * changes the bit if its zero but the cpu
+		 * online / off line code may race if
+		 * worker threads are still allowed during
+		 * shutdown / startup.
+		 */
+		r = cpumask_test_and_set_cpu(smp_processor_id(),
+			stat_off);
+		VM_BUG_ON(r);
+	}
+}
+
+/*
+ * Check if the diffs for a certain cpu indicate that
+ * an update is needed.
+ */
+static bool need_update(int cpu)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		struct per_cpu_pageset *p = per_cpu_ptr(zone->pageset, cpu);
+
+		BUILD_BUG_ON(sizeof(p->vm_stat_diff[0]) != 1);
+		/*
+		 * The fast way of checking if there are any vmstat diffs.
+		 * This works because the diffs are byte sized items.
+		 */
+		if (memchr_inv(p->vm_stat_diff, 0, NR_VM_ZONE_STAT_ITEMS))
+			return true;
+
+	}
+	return false;
+}
+
+
+/*
+ * Shepherd worker thread that checks the
+ * differentials of processors that have their worker
+ * threads for vm statistics updates disabled because of
+ * inactivity.
+ */
+static void vmstat_shepherd(struct work_struct *w);
+
+static DECLARE_DELAYED_WORK(shepherd, vmstat_shepherd);
+
+static void vmstat_shepherd(struct work_struct *w)
+{
+	int cpu;
+
+	/* Check processors whose vmstat worker threads have been disabled */
+	for_each_cpu(cpu, stat_off)
+		if (need_update(cpu) &&
+			cpumask_test_and_clear_cpu(cpu, stat_off))
+
+			schedule_delayed_work_on(cpu, &per_cpu(vmstat_work, cpu),
+				__round_jiffies_relative(sysctl_stat_interval, cpu));
+
+
+	schedule_delayed_work(&shepherd,
 		round_jiffies_relative(sysctl_stat_interval));
+
 }

-static void start_cpu_timer(int cpu)
+static void __init start_shepherd_timer(void)
 {
-	struct delayed_work *work = &per_cpu(vmstat_work, cpu);
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu),
+			vmstat_update);
+
+	cpumask_copy(stat_off, cpu_online_mask);

-	INIT_DEFERRABLE_WORK(work, vmstat_update);
-	schedule_delayed_work_on(cpu, work, __round_jiffies_relative(HZ, cpu));
+	schedule_delayed_work(&shepherd,
+		round_jiffies_relative(sysctl_stat_interval));
 }

 static void vmstat_cpu_dead(int node)
@@ -1266,17 +1371,17 @@
 	case CPU_ONLINE:
 	case CPU_ONLINE_FROZEN:
 		refresh_zone_stat_thresholds();
-		start_cpu_timer(cpu);
 		node_set_state(cpu_to_node(cpu), N_CPU);
+		cpumask_set_cpu(cpu, stat_off);
 		break;
 	case CPU_DOWN_PREPARE:
 	case CPU_DOWN_PREPARE_FROZEN:
-		cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
-		per_cpu(vmstat_work, cpu).work.func = NULL;
+		if (!cpumask_test_and_set_cpu(cpu, stat_off))
+			cancel_delayed_work_sync(&per_cpu(vmstat_work, cpu));
 		break;
 	case CPU_DOWN_FAILED:
 	case CPU_DOWN_FAILED_FROZEN:
-		start_cpu_timer(cpu);
+		cpumask_set_cpu(cpu, stat_off);
 		break;
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
@@ -1296,15 +1401,10 @@
 static int __init setup_vmstat(void)
 {
 #ifdef CONFIG_SMP
-	int cpu;
-
 	cpu_notifier_register_begin();
 	__register_cpu_notifier(&vmstat_notifier);

-	for_each_online_cpu(cpu) {
-		start_cpu_timer(cpu);
-		node_set_state(cpu_to_node(cpu), N_CPU);
-	}
+	start_shepherd_timer();
 	cpu_notifier_register_done();
 #endif
 #ifdef CONFIG_PROC_FS
Index: linux/drivers/base/cpu.c
===================================================================
--- linux.orig/drivers/base/cpu.c	2014-05-15 10:45:02.133488061 -0500
+++ linux/drivers/base/cpu.c	2014-05-15 10:45:02.129488144 -0500
@@ -222,6 +222,7 @@
 	_CPU_ATTR(online, &cpu_online_mask),
 	_CPU_ATTR(possible, &cpu_possible_mask),
 	_CPU_ATTR(present, &cpu_present_mask),
+	_CPU_ATTR(stat_off, &cpu_stat_off),
 };

 /*
@@ -378,6 +379,7 @@
 	&cpu_attrs[0].attr.attr,
 	&cpu_attrs[1].attr.attr,
 	&cpu_attrs[2].attr.attr,
+	&cpu_attrs[3].attr.attr,
 	&dev_attr_kernel_max.attr,
 	&dev_attr_offline.attr,
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
Index: linux/include/linux/cpumask.h
===================================================================
--- linux.orig/include/linux/cpumask.h	2014-05-15 10:45:02.133488061 -0500
+++ linux/include/linux/cpumask.h	2014-05-15 10:45:02.129488144 -0500
@@ -80,6 +80,7 @@
 extern const struct cpumask *const cpu_online_mask;
 extern const struct cpumask *const cpu_present_mask;
 extern const struct cpumask *const cpu_active_mask;
+extern const struct cpumask *const cpu_stat_off;

 #if NR_CPUS > 1
 #define num_online_cpus()	cpumask_weight(cpu_online_mask)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: vmstat: On demand vmstat workers V5
  2014-05-28 16:19   ` Christoph Lameter
@ 2014-05-29  0:36     ` Frederic Weisbecker
  2014-05-29 14:07       ` Christoph Lameter
  0 siblings, 1 reply; 13+ messages in thread
From: Frederic Weisbecker @ 2014-05-29  0:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Gilad Ben-Yossef, Thomas Gleixner, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Paul E. McKenney, linux-kernel, linux-mm, hughd,
	viresh.kumar, hpa, mingo, peterz

On Wed, May 28, 2014 at 11:19:49AM -0500, Christoph Lameter wrote:
> On Wed, 28 May 2014, Frederic Weisbecker wrote:
> 
> > On Mon, May 12, 2014 at 01:18:10PM -0500, Christoph Lameter wrote:
> > >  #ifdef CONFIG_SMP
> > >  static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
> > >  int sysctl_stat_interval __read_mostly = HZ;
> > > +static DECLARE_BITMAP(cpu_stat_off_bits, CONFIG_NR_CPUS) __read_mostly;
> > > +const struct cpumask *const cpu_stat_off = to_cpumask(cpu_stat_off_bits);
> > > +EXPORT_SYMBOL(cpu_stat_off);
> >
> > Is there no way to make it a cpumask_var_t, and allocate it from
> > start_shepherd_timer()?
> >
> > This should really take less space overall.
> 
> This was taken from the way things work with the other cpumasks in
> linux/kernel/cpu.c. Its compatible with the way done there and allows
> also the write protection of the cpumask outside of vmstat.c

The cpumasks in cpu.c are special as they are the base of the cpumask_var_t
definition. They are necessary to define nr_cpu_bits which is the base of
cpumask_var_t allocations. As such they must stay lower level and defined
on top of NR_CPUS.

But most other cases don't need that huge static bitmap. I actually haven't
seen any other struct cpumask than isn't based on cpumask_var_t.

> 
> > > +	schedule_delayed_work(this_cpu_ptr(&vmstat_work),
> > > +		__round_jiffies_relative(sysctl_stat_interval,
> > > +		HOUSEKEEPING_CPU));
> >
> > Maybe you can just make the shepherd work unbound and let bind it from userspace
> > once we have the workqueue user affinity patchset in.
> 
> Yes that is what V5 should have done. Looks like the final version was not
> posted. Sigh. The correct patch follows this message and it no longer uses
> HOUSEKEEPING_CPU.

Ok.

> 
> 
> > OTOH, it means you need to have a vmstat_update work on the housekeeping CPU as well.
> 
> Well the vnstat_udpate may not be needed on the processor where the
> shepherd runs so it may save something.

Ok, thanks!

> 
> From cl@linux.com Thu Oct  3 12:41:21 2013
> Date: Thu, 3 Oct 2013 12:41:21 -0500 (CDT)
> From: Christoph Lameter <cl@linux.com>
> To: Andrew Morton <akpm@linux-foundation.org>
> Cc: Gilad Ben-Yossef <gilad@benyossef.com>, Thomas Gleixner <tglx@linutronix.de>, Tejun Heo <tj@kernel.org>, John Stultz <johnstul@us.ibm.com>, Mike Frysinger <vapier@gentoo.org>, Minchan Kim <minchan.kim@gmail.com>, Hakan Akkan <hakanakkan@gmail.com>, Max Krasnyansky <maxk@qualcomm.com>, Frederic Weisbecker <fweisbec@gmail.com>, Paul E. McKenney <paulmck@linux.vnet.ibm.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, hughd@google.com, viresh.kumar@linaro.org, hpa@zytor.com, mingo@kernel.org, peterz@infradead.org
> Subject: vmstat: On demand vmstat workers V6

Please post it on a new thread so it gets noticed by others.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: vmstat: On demand vmstat workers V5
  2014-05-29  0:36     ` Frederic Weisbecker
@ 2014-05-29 14:07       ` Christoph Lameter
  2014-05-29 14:26         ` Frederic Weisbecker
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2014-05-29 14:07 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, Gilad Ben-Yossef, Thomas Gleixner, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Paul E. McKenney, linux-kernel, linux-mm, hughd,
	viresh.kumar, hpa, mingo, peterz

On Thu, 29 May 2014, Frederic Weisbecker wrote:

> The cpumasks in cpu.c are special as they are the base of the cpumask_var_t
> definition. They are necessary to define nr_cpu_bits which is the base of
> cpumask_var_t allocations. As such they must stay lower level and defined
> on top of NR_CPUS.
>
> But most other cases don't need that huge static bitmap. I actually haven't
> seen any other struct cpumask than isn't based on cpumask_var_t.

Well yes and I am tying directly into that scheme there in cpu.c to
display the active vmstat threads in sysfs. so its the same.

> Please post it on a new thread so it gets noticed by others.

Ok. Will do when we got agreement on the cpumask issue.

I would like to have some way to display the activities on cpus in /sysfs
like I have done here with the active vmstat workers.

What I think we need is display cpumasks for

1. Cpus where the tick is currently off
2. Cpus that have dynticks enabled.
3. Cpus that are idle
4. Cpus that are used for RCU.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: vmstat: On demand vmstat workers V5
  2014-05-29 14:07       ` Christoph Lameter
@ 2014-05-29 14:26         ` Frederic Weisbecker
  2014-05-29 16:24           ` Christoph Lameter
  2014-05-29 16:29           ` Paul E. McKenney
  0 siblings, 2 replies; 13+ messages in thread
From: Frederic Weisbecker @ 2014-05-29 14:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Gilad Ben-Yossef, Thomas Gleixner, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Paul E. McKenney, linux-kernel, linux-mm, hughd,
	viresh.kumar, hpa, mingo, peterz

On Thu, May 29, 2014 at 09:07:44AM -0500, Christoph Lameter wrote:
> On Thu, 29 May 2014, Frederic Weisbecker wrote:
> 
> > The cpumasks in cpu.c are special as they are the base of the cpumask_var_t
> > definition. They are necessary to define nr_cpu_bits which is the base of
> > cpumask_var_t allocations. As such they must stay lower level and defined
> > on top of NR_CPUS.
> >
> > But most other cases don't need that huge static bitmap. I actually haven't
> > seen any other struct cpumask than isn't based on cpumask_var_t.
> 
> Well yes and I am tying directly into that scheme there in cpu.c to
> display the active vmstat threads in sysfs. so its the same.

I don't think so. Or is there something in vmstat that cpumask_var_t
definition depends upon?

> 
> > Please post it on a new thread so it gets noticed by others.
> 
> Ok. Will do when we got agreement on the cpumask issue.
> 
> I would like to have some way to display the activities on cpus in /sysfs
> like I have done here with the active vmstat workers.
> 
> What I think we need is display cpumasks for
> 
> 1. Cpus where the tick is currently off
> 2. Cpus that have dynticks enabled.
> 3. Cpus that are idle

You should find all that in /proc/timer_list

Now for CPUs that have full dynticks enabled, we probably need something
in sysfs. We could dump the nohz cpumask somewhere. For now you can only grep
the dmesg

> 4. Cpus that are used for RCU.

So, you mean those that aren't in extended grace period (between rcu_user_enter()/exit
or rcu_idle_enter/exit)?

Paul?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: vmstat: On demand vmstat workers V5
  2014-05-29 14:26         ` Frederic Weisbecker
@ 2014-05-29 16:24           ` Christoph Lameter
  2014-05-29 16:40             ` Paul E. McKenney
  2014-05-29 16:29           ` Paul E. McKenney
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Lameter @ 2014-05-29 16:24 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, Gilad Ben-Yossef, Thomas Gleixner, Tejun Heo,
	John Stultz, Mike Frysinger, Minchan Kim, Hakan Akkan,
	Max Krasnyansky, Paul E. McKenney, linux-kernel, linux-mm, hughd,
	viresh.kumar, hpa, mingo, peterz

On Thu, 29 May 2014, Frederic Weisbecker wrote:

> > Well yes and I am tying directly into that scheme there in cpu.c to
> > display the active vmstat threads in sysfs. so its the same.
>
> I don't think so. Or is there something in vmstat that cpumask_var_t
> definition depends upon?

This patch definitely ties the vmstat cpumask into the scheme in cpu.c

> > I would like to have some way to display the activities on cpus in /sysfs
> > like I have done here with the active vmstat workers.
> >
> > What I think we need is display cpumasks for
> >
> > 1. Cpus where the tick is currently off
> > 2. Cpus that have dynticks enabled.
> > 3. Cpus that are idle
>
> You should find all that in /proc/timer_list

True. I could actually drop the vmstat cpumask support.

> Now for CPUs that have full dynticks enabled, we probably need something
> in sysfs. We could dump the nohz cpumask somewhere. For now you can only grep
> the dmesg

There is a nohz mode in /proc/timer_list right?

> > 4. Cpus that are used for RCU.
>
> So, you mean those that aren't in extended grace period (between rcu_user_enter()/exit
> or rcu_idle_enter/exit)?

No I mean cpus that have their RCU processing directed to another
processor.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: vmstat: On demand vmstat workers V5
  2014-05-29 14:26         ` Frederic Weisbecker
  2014-05-29 16:24           ` Christoph Lameter
@ 2014-05-29 16:29           ` Paul E. McKenney
  1 sibling, 0 replies; 13+ messages in thread
From: Paul E. McKenney @ 2014-05-29 16:29 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Christoph Lameter, Andrew Morton, Gilad Ben-Yossef,
	Thomas Gleixner, Tejun Heo, John Stultz, Mike Frysinger,
	Minchan Kim, Hakan Akkan, Max Krasnyansky, linux-kernel,
	linux-mm, hughd, viresh.kumar, hpa, mingo, peterz

On Thu, May 29, 2014 at 04:26:05PM +0200, Frederic Weisbecker wrote:
> On Thu, May 29, 2014 at 09:07:44AM -0500, Christoph Lameter wrote:
> > On Thu, 29 May 2014, Frederic Weisbecker wrote:
> > 
> > > The cpumasks in cpu.c are special as they are the base of the cpumask_var_t
> > > definition. They are necessary to define nr_cpu_bits which is the base of
> > > cpumask_var_t allocations. As such they must stay lower level and defined
> > > on top of NR_CPUS.
> > >
> > > But most other cases don't need that huge static bitmap. I actually haven't
> > > seen any other struct cpumask than isn't based on cpumask_var_t.
> > 
> > Well yes and I am tying directly into that scheme there in cpu.c to
> > display the active vmstat threads in sysfs. so its the same.
> 
> I don't think so. Or is there something in vmstat that cpumask_var_t
> definition depends upon?
> 
> > 
> > > Please post it on a new thread so it gets noticed by others.
> > 
> > Ok. Will do when we got agreement on the cpumask issue.
> > 
> > I would like to have some way to display the activities on cpus in /sysfs
> > like I have done here with the active vmstat workers.
> > 
> > What I think we need is display cpumasks for
> > 
> > 1. Cpus where the tick is currently off
> > 2. Cpus that have dynticks enabled.
> > 3. Cpus that are idle
> 
> You should find all that in /proc/timer_list
> 
> Now for CPUs that have full dynticks enabled, we probably need something
> in sysfs. We could dump the nohz cpumask somewhere. For now you can only grep
> the dmesg
> 
> > 4. Cpus that are used for RCU.
> 
> So, you mean those that aren't in extended grace period (between rcu_user_enter()/exit
> or rcu_idle_enter/exit)?
> 
> Paul?

We are clearly going to have to be very careful to avoid cache thrashing,
so methods that update a CPU mask on each transition, as the saying goes,
"need not apply".

So we need a function like __rcu_is_watching(), but that takes the
CPU number as an argument.  Something like the following:

include/linux/rcutree.h:

	bool rcu_is_watching_cpu(int cpu);

kernel/rcu/tree.c:

	bool rcu_is_watching_cpu(int cpu)
	{
		return atomic_read(per_cpu(&rcu_dynticks.dynticks), cpu) & 0x1;
	}

include/linux/rcutiny.h:

	static inline bool rcu_is_watching_cpu(int cpu)
	{
		return true;
	}

This could then be invoked from the appropriate sysfs or /proc setup.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: vmstat: On demand vmstat workers V5
  2014-05-29 16:24           ` Christoph Lameter
@ 2014-05-29 16:40             ` Paul E. McKenney
  0 siblings, 0 replies; 13+ messages in thread
From: Paul E. McKenney @ 2014-05-29 16:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Frederic Weisbecker, Andrew Morton, Gilad Ben-Yossef,
	Thomas Gleixner, Tejun Heo, John Stultz, Mike Frysinger,
	Minchan Kim, Hakan Akkan, Max Krasnyansky, linux-kernel,
	linux-mm, hughd, viresh.kumar, hpa, mingo, peterz

On Thu, May 29, 2014 at 11:24:15AM -0500, Christoph Lameter wrote:
> On Thu, 29 May 2014, Frederic Weisbecker wrote:
> 
> > > Well yes and I am tying directly into that scheme there in cpu.c to
> > > display the active vmstat threads in sysfs. so its the same.
> >
> > I don't think so. Or is there something in vmstat that cpumask_var_t
> > definition depends upon?
> 
> This patch definitely ties the vmstat cpumask into the scheme in cpu.c
> 
> > > I would like to have some way to display the activities on cpus in /sysfs
> > > like I have done here with the active vmstat workers.
> > >
> > > What I think we need is display cpumasks for
> > >
> > > 1. Cpus where the tick is currently off
> > > 2. Cpus that have dynticks enabled.
> > > 3. Cpus that are idle
> >
> > You should find all that in /proc/timer_list
> 
> True. I could actually drop the vmstat cpumask support.
> 
> > Now for CPUs that have full dynticks enabled, we probably need something
> > in sysfs. We could dump the nohz cpumask somewhere. For now you can only grep
> > the dmesg
> 
> There is a nohz mode in /proc/timer_list right?
> 
> > > 4. Cpus that are used for RCU.
> >
> > So, you mean those that aren't in extended grace period (between rcu_user_enter()/exit
> > or rcu_idle_enter/exit)?
> 
> No I mean cpus that have their RCU processing directed to another
> processor.

Ah, that is easier!

In kernel/rcu/tree_plugin.c under #ifdef CONFIG_RCU_NOCB_CPU:

cpumask_var_t get_rcu_nocb_mask(void)
{
	return rcu_nocb_mask;
}


In include/linux/rcupdate.h:

#if defined(CONFIG_TINY_RCU) || !defined(CONFIG_RCU_NOCB_CPU)
static inline cpumask_var_t get_rcu_nocb_mask(void)
{
	return NULL;
}
#else
cpumask_var_t get_rcu_nocb_mask(void);
#endif


Then display the mask however you prefer.  Modifying the mask is a very
bad idea, and will void your warranty, etc., etc.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-05-29 16:40 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-12 18:18 vmstat: On demand vmstat workers V5 Christoph Lameter
2014-05-13 15:24 ` Thomas Gleixner
2014-05-14 16:07   ` Christoph Lameter
2014-05-14 23:15     ` Thomas Gleixner
2014-05-27 20:07       ` Christoph Lameter
2014-05-28 15:21 ` Frederic Weisbecker
2014-05-28 16:19   ` Christoph Lameter
2014-05-29  0:36     ` Frederic Weisbecker
2014-05-29 14:07       ` Christoph Lameter
2014-05-29 14:26         ` Frederic Weisbecker
2014-05-29 16:24           ` Christoph Lameter
2014-05-29 16:40             ` Paul E. McKenney
2014-05-29 16:29           ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).