All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-09-28  6:11 ` Kemi Wang
  0 siblings, 0 replies; 48+ messages in thread
From: Kemi Wang @ 2017-09-28  6:11 UTC (permalink / raw)
  To: Luis R . Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet,
	Michal Hocko, Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka
  Cc: Dave, Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Kemi Wang, Proc sysctl, Linux MM, Linux Kernel

This is the second step which introduces a tunable interface that allow
numa stats configurable for optimizing zone_statistics(), as suggested by
Dave Hansen and Ying Huang.

=========================================================================
When page allocation performance becomes a bottleneck and you can tolerate
some possible tool breakage and decreased numa counter precision, you can
do:
	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
In this case, numa counter update is ignored. We can see about
*4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
cycles per single page allocation and reclaim on Jesper's page_bench03 (88
threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
memory).

Benchmark link provided by Jesper D Brouer(increase loop times to
10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
bench

=========================================================================
When page allocation performance is not a bottleneck and you want all
tooling to work, you can do:
	echo [S|s]trict > /proc/sys/vm/numa_stats_mode

=========================================================================
We recommend automatic detection of numa statistics by system, this is also
system default configuration, you can do:
	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
In this case, numa counter update is skipped unless it has been read by
users at least once, e.g. cat /proc/zoneinfo.

Branch target selection with jump label:
a) When numa_stats_mode is changed to *strict*, jump to the branch for numa
counters update.
b) When numa_stats_mode is changed to *coarse*, return back directly.
c) When numa_stats_mode is changed to *auto*, the branch target used in
last time is kept, and the branch target is changed to the branch for numa
counters update once numa counters are *read* by users.

Therefore, with the help of jump label, the page allocation performance is
hardly affected when numa counters are updated with a call in
zone_statistics(). Meanwhile, the auto mode can give people benefit without
manual tuning.

Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to
help improve the original patch.

ChangeLog:
  V2->V3:
  a) Propose a better way to use jump label to eliminate the overhead of
  branch selection in zone_statistics(), as inspired by Ying Huang;
  b) Add a paragraph in commit log to describe the way for branch target
  selection;
  c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
  and change the description accordingly, as suggested by Michal Hocko;
  d) Make this functionality NUMA-specific via ifdef

  V1->V2:
  a) Merge to one patch;
  b) Use jump label to eliminate the overhead of branch selection;
  c) Add a single-time log message at boot time to help tell users what
  happened.

Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Suggested-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 Documentation/sysctl/vm.txt |  24 +++++++++
 drivers/base/node.c         |   4 ++
 include/linux/vmstat.h      |  23 ++++++++
 init/main.c                 |   3 ++
 kernel/sysctl.c             |   7 +++
 mm/page_alloc.c             |  10 ++++
 mm/vmstat.c                 | 129 ++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 200 insertions(+)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 9baf66a..e310e69 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
 - swappiness
 - user_reserve_kbytes
 - vfs_cache_pressure
+- numa_stats_mode
 - watermark_scale_factor
 - zone_reclaim_mode
 
@@ -843,6 +844,29 @@ ten times more freeable objects than there are.
 
 =============================================================
 
+numa_stats_mode
+
+This interface allows numa statistics configurable.
+
+When page allocation performance becomes a bottleneck and you can tolerate
+some possible tool breakage and decreased numa counter precision, you can
+do:
+	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
+
+When page allocation performance is not a bottleneck and you want all
+tooling to work, you can do:
+	echo [S|s]trict > /proc/sys/vm/numa_stat_mode
+
+We recommend automatic detection of numa statistics by system, because numa
+statistics does not affect system's decision and it is very rarely
+consumed. you can do:
+	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
+This is also system default configuration, with this default setting, numa
+counters update is skipped unless the counter is *read* by users at least
+once.
+
+==============================================================
+
 watermark_scale_factor:
 
 This factor controls the aggressiveness of kswapd. It defines the
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 3855902..b57b5622 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
 static ssize_t node_read_numastat(struct device *dev,
 				struct device_attribute *attr, char *buf)
 {
+	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
+		static_branch_enable(&vm_numa_stats_mode_key);
 	return sprintf(buf,
 		       "numa_hit %lu\n"
 		       "numa_miss %lu\n"
@@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
 		n += sprintf(buf+n, "%s %lu\n",
 			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
 			     sum_zone_numa_state(nid, i));
+	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
+		static_branch_enable(&vm_numa_stats_mode_key);
 #endif
 
 	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index ade7cb5..d52e882 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -6,9 +6,28 @@
 #include <linux/mmzone.h>
 #include <linux/vm_event_item.h>
 #include <linux/atomic.h>
+#include <linux/static_key.h>
 
 extern int sysctl_stat_interval;
 
+#ifdef CONFIG_NUMA
+DECLARE_STATIC_KEY_FALSE(vm_numa_stats_mode_key);
+/*
+ * vm_numa_stats_mode:
+ * 0 = auto mode of NUMA stats, automatic detection of NUMA statistics.
+ * 1 = strict mode of NUMA stats, keep NUMA statistics.
+ * 2 = coarse mode of NUMA stats, ignore NUMA statistics.
+ */
+#define VM_NUMA_STAT_AUTO_MODE 0
+#define VM_NUMA_STAT_STRICT_MODE  1
+#define VM_NUMA_STAT_COARSE_MODE  2
+#define VM_NUMA_STAT_MODE_LEN 16
+extern int vm_numa_stats_mode;
+extern char sysctl_vm_numa_stats_mode[];
+extern int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *length, loff_t *ppos);
+#endif
+
 #ifdef CONFIG_VM_EVENT_COUNTERS
 /*
  * Light weight per cpu counter implementation.
@@ -229,6 +248,10 @@ extern unsigned long sum_zone_node_page_state(int node,
 extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
 extern unsigned long node_page_state(struct pglist_data *pgdat,
 						enum node_stat_item item);
+extern void zero_zone_numa_counters(struct zone *zone);
+extern void zero_zones_numa_counters(void);
+extern void zero_global_numa_counters(void);
+extern void invalid_numa_statistics(void);
 #else
 #define sum_zone_node_page_state(node, item) global_zone_page_state(item)
 #define node_page_state(node, item) global_node_page_state(item)
diff --git a/init/main.c b/init/main.c
index 0ee9c686..1e300a8 100644
--- a/init/main.c
+++ b/init/main.c
@@ -567,6 +567,9 @@ asmlinkage __visible void __init start_kernel(void)
 	sort_main_extable();
 	trap_init();
 	mm_init();
+#ifdef CONFIG_NUMA
+	pr_info("vmstat: NUMA stats is skipped unless it has been consumed\n");
+#endif
 
 	ftrace_init();
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..0678668 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1374,6 +1374,13 @@ static struct ctl_table vm_table[] = {
 		.mode           = 0644,
 		.proc_handler   = &hugetlb_mempolicy_sysctl_handler,
 	},
+	{
+		.procname	= "numa_stats_mode",
+		.data		= sysctl_vm_numa_stats_mode,
+		.maxlen		= VM_NUMA_STAT_MODE_LEN,
+		.mode		= 0644,
+		.proc_handler	= sysctl_vm_numa_stats_mode_handler,
+	},
 #endif
 	 {
 		.procname	= "hugetlb_shm_group",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c841af8..6d7ea18 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -83,6 +83,8 @@ DEFINE_PER_CPU(int, numa_node);
 EXPORT_PER_CPU_SYMBOL(numa_node);
 #endif
 
+DEFINE_STATIC_KEY_FALSE(vm_numa_stats_mode_key);
+
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
 /*
  * N.B., Do NOT reference the '_numa_mem_' per cpu variable directly.
@@ -2743,6 +2745,14 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
 #ifdef CONFIG_NUMA
 	enum numa_stat_item local_stat = NUMA_LOCAL;
 
+	/*
+	 * skip zone_statistics() if NUMA stats is set to coarse mode or
+	 * NUMA stats is never consumed in auto mode.
+	 */
+
+	if (!static_branch_unlikely(&vm_numa_stats_mode_key))
+		return;
+
 	if (z->node != numa_node_id())
 		local_stat = NUMA_OTHER;
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4bb13e7..469599c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -32,6 +32,91 @@
 
 #define NUMA_STATS_THRESHOLD (U16_MAX - 2)
 
+#ifdef CONFIG_NUMA
+int vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
+char sysctl_vm_numa_stats_mode[VM_NUMA_STAT_MODE_LEN] = "auto";
+static const char *vm_numa_stats_mode_name[3] = {"auto", "strict", "coarse"};
+static DEFINE_MUTEX(vm_numa_stats_mode_lock);
+
+static int __parse_vm_numa_stats_mode(char *s)
+{
+	const char *str = s;
+
+	if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0)
+		vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
+	else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0)
+		vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE;
+	else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0)
+		vm_numa_stats_mode = VM_NUMA_STAT_COARSE_MODE;
+	else {
+		pr_warn("Ignoring invalid vm_numa_stats_mode value: %s\n", s);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *length, loff_t *ppos)
+{
+	char old_string[VM_NUMA_STAT_MODE_LEN];
+	int ret, oldval;
+
+	mutex_lock(&vm_numa_stats_mode_lock);
+	if (write)
+		strncpy(old_string, (char *)table->data, VM_NUMA_STAT_MODE_LEN);
+	ret = proc_dostring(table, write, buffer, length, ppos);
+	if (ret || !write) {
+		mutex_unlock(&vm_numa_stats_mode_lock);
+		return ret;
+	}
+
+	oldval = vm_numa_stats_mode;
+	if (__parse_vm_numa_stats_mode((char *)table->data)) {
+		/*
+		 * invalid sysctl_vm_numa_stats_mode value, restore saved string
+		 */
+		strncpy((char *)table->data, old_string, VM_NUMA_STAT_MODE_LEN);
+		vm_numa_stats_mode = oldval;
+	} else {
+		/*
+		 * check whether numa stats mode changes or not
+		 */
+		if (vm_numa_stats_mode == oldval) {
+			/* no change */
+			mutex_unlock(&vm_numa_stats_mode_lock);
+			return 0;
+		} else if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
+			/*
+			 * Keep the branch selection in last time when numa stats
+			 * is changed to auto mode.
+			 */
+			pr_info("numa stats changes from %s mode to auto mode\n",
+					vm_numa_stats_mode_name[oldval]);
+		else if (vm_numa_stats_mode == VM_NUMA_STAT_STRICT_MODE) {
+			static_branch_enable(&vm_numa_stats_mode_key);
+			pr_info("numa stats changes from %s mode to strict mode\n",
+					vm_numa_stats_mode_name[oldval]);
+		} else if (vm_numa_stats_mode == VM_NUMA_STAT_COARSE_MODE) {
+			static_branch_disable(&vm_numa_stats_mode_key);
+			/*
+			 * Invalidate numa counters when vmstat mode is set to coarse
+			 * mode, because users can't tell the difference between the
+			 * dead state and when allocator activity is quiet once
+			 * zone_statistics() is turned off.
+			 */
+			invalid_numa_statistics();
+			pr_info("numa stats changes from %s mode to coarse mode\n",
+					vm_numa_stats_mode_name[oldval]);
+		} else
+			pr_warn("invalid vm_numa_stats_mode:%d\n", vm_numa_stats_mode);
+	}
+
+	mutex_unlock(&vm_numa_stats_mode_lock);
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_VM_EVENT_COUNTERS
 DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
 EXPORT_PER_CPU_SYMBOL(vm_event_states);
@@ -914,6 +999,42 @@ unsigned long sum_zone_numa_state(int node,
 	return count;
 }
 
+/* zero numa counters within a zone */
+void zero_zone_numa_counters(struct zone *zone)
+{
+	int item, cpu;
+
+	for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) {
+		atomic_long_set(&zone->vm_numa_stat[item], 0);
+		for_each_online_cpu(cpu)
+			per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item] = 0;
+	}
+}
+
+/* zero numa counters of all the populated zones */
+void zero_zones_numa_counters(void)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone)
+		zero_zone_numa_counters(zone);
+}
+
+/* zero global numa counters */
+void zero_global_numa_counters(void)
+{
+	int item;
+
+	for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++)
+		atomic_long_set(&vm_numa_stat[item], 0);
+}
+
+void invalid_numa_statistics(void)
+{
+	zero_zones_numa_counters();
+	zero_global_numa_counters();
+}
+
 /*
  * Determine the per node value of a stat item.
  */
@@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void *arg)
 {
 	pg_data_t *pgdat = (pg_data_t *)arg;
 	walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print);
+#ifdef CONFIG_NUMA
+	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
+		static_branch_enable(&vm_numa_stats_mode_key);
+#endif
 	return 0;
 }
 
@@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg)
 
 static void vmstat_stop(struct seq_file *m, void *arg)
 {
+#ifdef CONFIG_NUMA
+	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
+		static_branch_enable(&vm_numa_stats_mode_key);
+#endif
 	kfree(m->private);
 	m->private = NULL;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-09-28  6:11 ` Kemi Wang
  0 siblings, 0 replies; 48+ messages in thread
From: Kemi Wang @ 2017-09-28  6:11 UTC (permalink / raw)
  To: Luis R . Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet,
	Michal Hocko, Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka
  Cc: Dave, Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Kemi Wang, Proc sysctl, Linux MM, Linux Kernel

This is the second step which introduces a tunable interface that allow
numa stats configurable for optimizing zone_statistics(), as suggested by
Dave Hansen and Ying Huang.

=========================================================================
When page allocation performance becomes a bottleneck and you can tolerate
some possible tool breakage and decreased numa counter precision, you can
do:
	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
In this case, numa counter update is ignored. We can see about
*4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
cycles per single page allocation and reclaim on Jesper's page_bench03 (88
threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
memory).

Benchmark link provided by Jesper D Brouer(increase loop times to
10000000):
https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
bench

=========================================================================
When page allocation performance is not a bottleneck and you want all
tooling to work, you can do:
	echo [S|s]trict > /proc/sys/vm/numa_stats_mode

=========================================================================
We recommend automatic detection of numa statistics by system, this is also
system default configuration, you can do:
	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
In this case, numa counter update is skipped unless it has been read by
users at least once, e.g. cat /proc/zoneinfo.

Branch target selection with jump label:
a) When numa_stats_mode is changed to *strict*, jump to the branch for numa
counters update.
b) When numa_stats_mode is changed to *coarse*, return back directly.
c) When numa_stats_mode is changed to *auto*, the branch target used in
last time is kept, and the branch target is changed to the branch for numa
counters update once numa counters are *read* by users.

Therefore, with the help of jump label, the page allocation performance is
hardly affected when numa counters are updated with a call in
zone_statistics(). Meanwhile, the auto mode can give people benefit without
manual tuning.

Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to
help improve the original patch.

ChangeLog:
  V2->V3:
  a) Propose a better way to use jump label to eliminate the overhead of
  branch selection in zone_statistics(), as inspired by Ying Huang;
  b) Add a paragraph in commit log to describe the way for branch target
  selection;
  c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
  and change the description accordingly, as suggested by Michal Hocko;
  d) Make this functionality NUMA-specific via ifdef

  V1->V2:
  a) Merge to one patch;
  b) Use jump label to eliminate the overhead of branch selection;
  c) Add a single-time log message at boot time to help tell users what
  happened.

Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Suggested-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Kemi Wang <kemi.wang@intel.com>
---
 Documentation/sysctl/vm.txt |  24 +++++++++
 drivers/base/node.c         |   4 ++
 include/linux/vmstat.h      |  23 ++++++++
 init/main.c                 |   3 ++
 kernel/sysctl.c             |   7 +++
 mm/page_alloc.c             |  10 ++++
 mm/vmstat.c                 | 129 ++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 200 insertions(+)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 9baf66a..e310e69 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
 - swappiness
 - user_reserve_kbytes
 - vfs_cache_pressure
+- numa_stats_mode
 - watermark_scale_factor
 - zone_reclaim_mode
 
@@ -843,6 +844,29 @@ ten times more freeable objects than there are.
 
 =============================================================
 
+numa_stats_mode
+
+This interface allows numa statistics configurable.
+
+When page allocation performance becomes a bottleneck and you can tolerate
+some possible tool breakage and decreased numa counter precision, you can
+do:
+	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
+
+When page allocation performance is not a bottleneck and you want all
+tooling to work, you can do:
+	echo [S|s]trict > /proc/sys/vm/numa_stat_mode
+
+We recommend automatic detection of numa statistics by system, because numa
+statistics does not affect system's decision and it is very rarely
+consumed. you can do:
+	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
+This is also system default configuration, with this default setting, numa
+counters update is skipped unless the counter is *read* by users at least
+once.
+
+==============================================================
+
 watermark_scale_factor:
 
 This factor controls the aggressiveness of kswapd. It defines the
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 3855902..b57b5622 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
 static ssize_t node_read_numastat(struct device *dev,
 				struct device_attribute *attr, char *buf)
 {
+	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
+		static_branch_enable(&vm_numa_stats_mode_key);
 	return sprintf(buf,
 		       "numa_hit %lu\n"
 		       "numa_miss %lu\n"
@@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
 		n += sprintf(buf+n, "%s %lu\n",
 			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
 			     sum_zone_numa_state(nid, i));
+	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
+		static_branch_enable(&vm_numa_stats_mode_key);
 #endif
 
 	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index ade7cb5..d52e882 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -6,9 +6,28 @@
 #include <linux/mmzone.h>
 #include <linux/vm_event_item.h>
 #include <linux/atomic.h>
+#include <linux/static_key.h>
 
 extern int sysctl_stat_interval;
 
+#ifdef CONFIG_NUMA
+DECLARE_STATIC_KEY_FALSE(vm_numa_stats_mode_key);
+/*
+ * vm_numa_stats_mode:
+ * 0 = auto mode of NUMA stats, automatic detection of NUMA statistics.
+ * 1 = strict mode of NUMA stats, keep NUMA statistics.
+ * 2 = coarse mode of NUMA stats, ignore NUMA statistics.
+ */
+#define VM_NUMA_STAT_AUTO_MODE 0
+#define VM_NUMA_STAT_STRICT_MODE  1
+#define VM_NUMA_STAT_COARSE_MODE  2
+#define VM_NUMA_STAT_MODE_LEN 16
+extern int vm_numa_stats_mode;
+extern char sysctl_vm_numa_stats_mode[];
+extern int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *length, loff_t *ppos);
+#endif
+
 #ifdef CONFIG_VM_EVENT_COUNTERS
 /*
  * Light weight per cpu counter implementation.
@@ -229,6 +248,10 @@ extern unsigned long sum_zone_node_page_state(int node,
 extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
 extern unsigned long node_page_state(struct pglist_data *pgdat,
 						enum node_stat_item item);
+extern void zero_zone_numa_counters(struct zone *zone);
+extern void zero_zones_numa_counters(void);
+extern void zero_global_numa_counters(void);
+extern void invalid_numa_statistics(void);
 #else
 #define sum_zone_node_page_state(node, item) global_zone_page_state(item)
 #define node_page_state(node, item) global_node_page_state(item)
diff --git a/init/main.c b/init/main.c
index 0ee9c686..1e300a8 100644
--- a/init/main.c
+++ b/init/main.c
@@ -567,6 +567,9 @@ asmlinkage __visible void __init start_kernel(void)
 	sort_main_extable();
 	trap_init();
 	mm_init();
+#ifdef CONFIG_NUMA
+	pr_info("vmstat: NUMA stats is skipped unless it has been consumed\n");
+#endif
 
 	ftrace_init();
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..0678668 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1374,6 +1374,13 @@ static struct ctl_table vm_table[] = {
 		.mode           = 0644,
 		.proc_handler   = &hugetlb_mempolicy_sysctl_handler,
 	},
+	{
+		.procname	= "numa_stats_mode",
+		.data		= sysctl_vm_numa_stats_mode,
+		.maxlen		= VM_NUMA_STAT_MODE_LEN,
+		.mode		= 0644,
+		.proc_handler	= sysctl_vm_numa_stats_mode_handler,
+	},
 #endif
 	 {
 		.procname	= "hugetlb_shm_group",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c841af8..6d7ea18 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -83,6 +83,8 @@ DEFINE_PER_CPU(int, numa_node);
 EXPORT_PER_CPU_SYMBOL(numa_node);
 #endif
 
+DEFINE_STATIC_KEY_FALSE(vm_numa_stats_mode_key);
+
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
 /*
  * N.B., Do NOT reference the '_numa_mem_' per cpu variable directly.
@@ -2743,6 +2745,14 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
 #ifdef CONFIG_NUMA
 	enum numa_stat_item local_stat = NUMA_LOCAL;
 
+	/*
+	 * skip zone_statistics() if NUMA stats is set to coarse mode or
+	 * NUMA stats is never consumed in auto mode.
+	 */
+
+	if (!static_branch_unlikely(&vm_numa_stats_mode_key))
+		return;
+
 	if (z->node != numa_node_id())
 		local_stat = NUMA_OTHER;
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4bb13e7..469599c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -32,6 +32,91 @@
 
 #define NUMA_STATS_THRESHOLD (U16_MAX - 2)
 
+#ifdef CONFIG_NUMA
+int vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
+char sysctl_vm_numa_stats_mode[VM_NUMA_STAT_MODE_LEN] = "auto";
+static const char *vm_numa_stats_mode_name[3] = {"auto", "strict", "coarse"};
+static DEFINE_MUTEX(vm_numa_stats_mode_lock);
+
+static int __parse_vm_numa_stats_mode(char *s)
+{
+	const char *str = s;
+
+	if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0)
+		vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
+	else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0)
+		vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE;
+	else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0)
+		vm_numa_stats_mode = VM_NUMA_STAT_COARSE_MODE;
+	else {
+		pr_warn("Ignoring invalid vm_numa_stats_mode value: %s\n", s);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
+		void __user *buffer, size_t *length, loff_t *ppos)
+{
+	char old_string[VM_NUMA_STAT_MODE_LEN];
+	int ret, oldval;
+
+	mutex_lock(&vm_numa_stats_mode_lock);
+	if (write)
+		strncpy(old_string, (char *)table->data, VM_NUMA_STAT_MODE_LEN);
+	ret = proc_dostring(table, write, buffer, length, ppos);
+	if (ret || !write) {
+		mutex_unlock(&vm_numa_stats_mode_lock);
+		return ret;
+	}
+
+	oldval = vm_numa_stats_mode;
+	if (__parse_vm_numa_stats_mode((char *)table->data)) {
+		/*
+		 * invalid sysctl_vm_numa_stats_mode value, restore saved string
+		 */
+		strncpy((char *)table->data, old_string, VM_NUMA_STAT_MODE_LEN);
+		vm_numa_stats_mode = oldval;
+	} else {
+		/*
+		 * check whether numa stats mode changes or not
+		 */
+		if (vm_numa_stats_mode == oldval) {
+			/* no change */
+			mutex_unlock(&vm_numa_stats_mode_lock);
+			return 0;
+		} else if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
+			/*
+			 * Keep the branch selection in last time when numa stats
+			 * is changed to auto mode.
+			 */
+			pr_info("numa stats changes from %s mode to auto mode\n",
+					vm_numa_stats_mode_name[oldval]);
+		else if (vm_numa_stats_mode == VM_NUMA_STAT_STRICT_MODE) {
+			static_branch_enable(&vm_numa_stats_mode_key);
+			pr_info("numa stats changes from %s mode to strict mode\n",
+					vm_numa_stats_mode_name[oldval]);
+		} else if (vm_numa_stats_mode == VM_NUMA_STAT_COARSE_MODE) {
+			static_branch_disable(&vm_numa_stats_mode_key);
+			/*
+			 * Invalidate numa counters when vmstat mode is set to coarse
+			 * mode, because users can't tell the difference between the
+			 * dead state and when allocator activity is quiet once
+			 * zone_statistics() is turned off.
+			 */
+			invalid_numa_statistics();
+			pr_info("numa stats changes from %s mode to coarse mode\n",
+					vm_numa_stats_mode_name[oldval]);
+		} else
+			pr_warn("invalid vm_numa_stats_mode:%d\n", vm_numa_stats_mode);
+	}
+
+	mutex_unlock(&vm_numa_stats_mode_lock);
+	return 0;
+}
+#endif
+
 #ifdef CONFIG_VM_EVENT_COUNTERS
 DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
 EXPORT_PER_CPU_SYMBOL(vm_event_states);
@@ -914,6 +999,42 @@ unsigned long sum_zone_numa_state(int node,
 	return count;
 }
 
+/* zero numa counters within a zone */
+void zero_zone_numa_counters(struct zone *zone)
+{
+	int item, cpu;
+
+	for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) {
+		atomic_long_set(&zone->vm_numa_stat[item], 0);
+		for_each_online_cpu(cpu)
+			per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item] = 0;
+	}
+}
+
+/* zero numa counters of all the populated zones */
+void zero_zones_numa_counters(void)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone)
+		zero_zone_numa_counters(zone);
+}
+
+/* zero global numa counters */
+void zero_global_numa_counters(void)
+{
+	int item;
+
+	for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++)
+		atomic_long_set(&vm_numa_stat[item], 0);
+}
+
+void invalid_numa_statistics(void)
+{
+	zero_zones_numa_counters();
+	zero_global_numa_counters();
+}
+
 /*
  * Determine the per node value of a stat item.
  */
@@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void *arg)
 {
 	pg_data_t *pgdat = (pg_data_t *)arg;
 	walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print);
+#ifdef CONFIG_NUMA
+	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
+		static_branch_enable(&vm_numa_stats_mode_key);
+#endif
 	return 0;
 }
 
@@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg)
 
 static void vmstat_stop(struct seq_file *m, void *arg)
 {
+#ifdef CONFIG_NUMA
+	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
+		static_branch_enable(&vm_numa_stats_mode_key);
+#endif
 	kfree(m->private);
 	m->private = NULL;
 }
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-09-28  6:11 ` Kemi Wang
@ 2017-09-28 21:29   ` Andrew Morton
  -1 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2017-09-28 21:29 UTC (permalink / raw)
  To: Kemi Wang
  Cc: Luis R . Rodriguez, Kees Cook, Jonathan Corbet, Michal Hocko,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Dave, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel

On Thu, 28 Sep 2017 14:11:41 +0800 Kemi Wang <kemi.wang@intel.com> wrote:

> This is the second step which introduces a tunable interface that allow
> numa stats configurable for optimizing zone_statistics(), as suggested by
> Dave Hansen and Ying Huang.

Looks OK I guess.

I fiddled with it a lot.  Please consider:

From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-sysctl-make-numa-stats-configurable-fix

- tweak documentation

- move advisory message from start_kernel() into mm_init() (I'm not sure
  we really need this message)

- use strcasecmp() in __parse_vm_numa_stats_mode()

- clean up coding style amd nessages in sysctl_vm_numa_stats_mode_handler()

Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Ying Huang <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/sysctl/vm.txt |   15 ++++++-------
 init/main.c                 |    6 ++---
 mm/vmstat.c                 |   39 +++++++++++++++-------------------
 3 files changed, 29 insertions(+), 31 deletions(-)

diff -puN Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix
+++ a/Documentation/sysctl/vm.txt
@@ -853,7 +853,7 @@ ten times more freeable objects than the
 
 numa_stats_mode
 
-This interface allows numa statistics configurable.
+This interface allows runtime configuration or numa statistics.
 
 When page allocation performance becomes a bottleneck and you can tolerate
 some possible tool breakage and decreased numa counter precision, you can
@@ -864,13 +864,14 @@ When page allocation performance is not
 tooling to work, you can do:
 	echo [S|s]trict > /proc/sys/vm/numa_stat_mode
 
-We recommend automatic detection of numa statistics by system, because numa
-statistics does not affect system's decision and it is very rarely
-consumed. you can do:
+We recommend automatic detection of numa statistics by system, because
+numa statistics do not affect system decisions and it is very rarely
+consumed.  In this case you can do:
 	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
-This is also system default configuration, with this default setting, numa
-counters update is skipped unless the counter is *read* by users at least
-once.
+
+This is the system default configuration.  With this default setting, numa
+counter updates are skipped until the counter is *read* by userspace at
+least once.
 
 ==============================================================
 
diff -puN drivers/base/node.c~mm-sysctl-make-numa-stats-configurable-fix drivers/base/node.c
diff -puN include/linux/vmstat.h~mm-sysctl-make-numa-stats-configurable-fix include/linux/vmstat.h
diff -puN init/main.c~mm-sysctl-make-numa-stats-configurable-fix init/main.c
--- a/init/main.c~mm-sysctl-make-numa-stats-configurable-fix
+++ a/init/main.c
@@ -504,6 +504,9 @@ static void __init mm_init(void)
 	pgtable_init();
 	vmalloc_init();
 	ioremap_huge_init();
+#ifdef CONFIG_NUMA
+	pr_info("vmstat: NUMA stat updates are skipped unless they have been used\n");
+#endif
 }
 
 asmlinkage __visible void __init start_kernel(void)
@@ -567,9 +570,6 @@ asmlinkage __visible void __init start_k
 	sort_main_extable();
 	trap_init();
 	mm_init();
-#ifdef CONFIG_NUMA
-	pr_info("vmstat: NUMA stats is skipped unless it has been consumed\n");
-#endif
 
 	ftrace_init();
 
diff -puN kernel/sysctl.c~mm-sysctl-make-numa-stats-configurable-fix kernel/sysctl.c
diff -puN mm/page_alloc.c~mm-sysctl-make-numa-stats-configurable-fix mm/page_alloc.c
diff -puN mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix mm/vmstat.c
--- a/mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix
+++ a/mm/vmstat.c
@@ -40,13 +40,11 @@ static DEFINE_MUTEX(vm_numa_stats_mode_l
 
 static int __parse_vm_numa_stats_mode(char *s)
 {
-	const char *str = s;
-
-	if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0)
+	if (strcasecmp(s, "auto"))
 		vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
-	else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0)
+	else if (strcasecmp(s, "strict") == 0)
 		vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE;
-	else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0)
+	else if (strcasecmp(s, "coarse"))
 		vm_numa_stats_mode = VM_NUMA_STAT_COARSE_MODE;
 	else {
 		pr_warn("Ignoring invalid vm_numa_stats_mode value: %s\n", s);
@@ -86,30 +84,29 @@ int sysctl_vm_numa_stats_mode_handler(st
 			/* no change */
 			mutex_unlock(&vm_numa_stats_mode_lock);
 			return 0;
-		} else if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
+		} else if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) {
 			/*
-			 * Keep the branch selection in last time when numa stats
-			 * is changed to auto mode.
+			 * Keep the branch selection in last time when numa
+			 * stats is changed to auto mode.
 			 */
-			pr_info("numa stats changes from %s mode to auto mode\n",
-					vm_numa_stats_mode_name[oldval]);
-		else if (vm_numa_stats_mode == VM_NUMA_STAT_STRICT_MODE) {
+			pr_info("numa stats changed from %s to auto mode\n",
+				 vm_numa_stats_mode_name[oldval]);
+		} else if (vm_numa_stats_mode == VM_NUMA_STAT_STRICT_MODE) {
 			static_branch_enable(&vm_numa_stats_mode_key);
-			pr_info("numa stats changes from %s mode to strict mode\n",
-					vm_numa_stats_mode_name[oldval]);
+			pr_info("numa stats changes from %s to strict mode\n",
+				 vm_numa_stats_mode_name[oldval]);
 		} else if (vm_numa_stats_mode == VM_NUMA_STAT_COARSE_MODE) {
 			static_branch_disable(&vm_numa_stats_mode_key);
 			/*
-			 * Invalidate numa counters when vmstat mode is set to coarse
-			 * mode, because users can't tell the difference between the
-			 * dead state and when allocator activity is quiet once
-			 * zone_statistics() is turned off.
+			 * Invalidate numa counters when vmstat mode is set to
+			 * coarse mode, because users can't tell the difference
+			 * between the dead state and when allocator activity is
+			 * quiet once zone_statistics() is turned off.
 			 */
 			invalid_numa_statistics();
-			pr_info("numa stats changes from %s mode to coarse mode\n",
-					vm_numa_stats_mode_name[oldval]);
-		} else
-			pr_warn("invalid vm_numa_stats_mode:%d\n", vm_numa_stats_mode);
+			pr_info("numa stats changes from %s to coarse mode\n",
+				 vm_numa_stats_mode_name[oldval]);
+		}
 	}
 
 	mutex_unlock(&vm_numa_stats_mode_lock);
_

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-09-28 21:29   ` Andrew Morton
  0 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2017-09-28 21:29 UTC (permalink / raw)
  To: Kemi Wang
  Cc: Luis R . Rodriguez, Kees Cook, Jonathan Corbet, Michal Hocko,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Dave, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel

On Thu, 28 Sep 2017 14:11:41 +0800 Kemi Wang <kemi.wang@intel.com> wrote:

> This is the second step which introduces a tunable interface that allow
> numa stats configurable for optimizing zone_statistics(), as suggested by
> Dave Hansen and Ying Huang.

Looks OK I guess.

I fiddled with it a lot.  Please consider:

From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-sysctl-make-numa-stats-configurable-fix

- tweak documentation

- move advisory message from start_kernel() into mm_init() (I'm not sure
  we really need this message)

- use strcasecmp() in __parse_vm_numa_stats_mode()

- clean up coding style amd nessages in sysctl_vm_numa_stats_mode_handler()

Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Ying Huang <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/sysctl/vm.txt |   15 ++++++-------
 init/main.c                 |    6 ++---
 mm/vmstat.c                 |   39 +++++++++++++++-------------------
 3 files changed, 29 insertions(+), 31 deletions(-)

diff -puN Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix Documentation/sysctl/vm.txt
--- a/Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix
+++ a/Documentation/sysctl/vm.txt
@@ -853,7 +853,7 @@ ten times more freeable objects than the
 
 numa_stats_mode
 
-This interface allows numa statistics configurable.
+This interface allows runtime configuration or numa statistics.
 
 When page allocation performance becomes a bottleneck and you can tolerate
 some possible tool breakage and decreased numa counter precision, you can
@@ -864,13 +864,14 @@ When page allocation performance is not
 tooling to work, you can do:
 	echo [S|s]trict > /proc/sys/vm/numa_stat_mode
 
-We recommend automatic detection of numa statistics by system, because numa
-statistics does not affect system's decision and it is very rarely
-consumed. you can do:
+We recommend automatic detection of numa statistics by system, because
+numa statistics do not affect system decisions and it is very rarely
+consumed.  In this case you can do:
 	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
-This is also system default configuration, with this default setting, numa
-counters update is skipped unless the counter is *read* by users at least
-once.
+
+This is the system default configuration.  With this default setting, numa
+counter updates are skipped until the counter is *read* by userspace at
+least once.
 
 ==============================================================
 
diff -puN drivers/base/node.c~mm-sysctl-make-numa-stats-configurable-fix drivers/base/node.c
diff -puN include/linux/vmstat.h~mm-sysctl-make-numa-stats-configurable-fix include/linux/vmstat.h
diff -puN init/main.c~mm-sysctl-make-numa-stats-configurable-fix init/main.c
--- a/init/main.c~mm-sysctl-make-numa-stats-configurable-fix
+++ a/init/main.c
@@ -504,6 +504,9 @@ static void __init mm_init(void)
 	pgtable_init();
 	vmalloc_init();
 	ioremap_huge_init();
+#ifdef CONFIG_NUMA
+	pr_info("vmstat: NUMA stat updates are skipped unless they have been used\n");
+#endif
 }
 
 asmlinkage __visible void __init start_kernel(void)
@@ -567,9 +570,6 @@ asmlinkage __visible void __init start_k
 	sort_main_extable();
 	trap_init();
 	mm_init();
-#ifdef CONFIG_NUMA
-	pr_info("vmstat: NUMA stats is skipped unless it has been consumed\n");
-#endif
 
 	ftrace_init();
 
diff -puN kernel/sysctl.c~mm-sysctl-make-numa-stats-configurable-fix kernel/sysctl.c
diff -puN mm/page_alloc.c~mm-sysctl-make-numa-stats-configurable-fix mm/page_alloc.c
diff -puN mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix mm/vmstat.c
--- a/mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix
+++ a/mm/vmstat.c
@@ -40,13 +40,11 @@ static DEFINE_MUTEX(vm_numa_stats_mode_l
 
 static int __parse_vm_numa_stats_mode(char *s)
 {
-	const char *str = s;
-
-	if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0)
+	if (strcasecmp(s, "auto"))
 		vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
-	else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0)
+	else if (strcasecmp(s, "strict") == 0)
 		vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE;
-	else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0)
+	else if (strcasecmp(s, "coarse"))
 		vm_numa_stats_mode = VM_NUMA_STAT_COARSE_MODE;
 	else {
 		pr_warn("Ignoring invalid vm_numa_stats_mode value: %s\n", s);
@@ -86,30 +84,29 @@ int sysctl_vm_numa_stats_mode_handler(st
 			/* no change */
 			mutex_unlock(&vm_numa_stats_mode_lock);
 			return 0;
-		} else if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
+		} else if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) {
 			/*
-			 * Keep the branch selection in last time when numa stats
-			 * is changed to auto mode.
+			 * Keep the branch selection in last time when numa
+			 * stats is changed to auto mode.
 			 */
-			pr_info("numa stats changes from %s mode to auto mode\n",
-					vm_numa_stats_mode_name[oldval]);
-		else if (vm_numa_stats_mode == VM_NUMA_STAT_STRICT_MODE) {
+			pr_info("numa stats changed from %s to auto mode\n",
+				 vm_numa_stats_mode_name[oldval]);
+		} else if (vm_numa_stats_mode == VM_NUMA_STAT_STRICT_MODE) {
 			static_branch_enable(&vm_numa_stats_mode_key);
-			pr_info("numa stats changes from %s mode to strict mode\n",
-					vm_numa_stats_mode_name[oldval]);
+			pr_info("numa stats changes from %s to strict mode\n",
+				 vm_numa_stats_mode_name[oldval]);
 		} else if (vm_numa_stats_mode == VM_NUMA_STAT_COARSE_MODE) {
 			static_branch_disable(&vm_numa_stats_mode_key);
 			/*
-			 * Invalidate numa counters when vmstat mode is set to coarse
-			 * mode, because users can't tell the difference between the
-			 * dead state and when allocator activity is quiet once
-			 * zone_statistics() is turned off.
+			 * Invalidate numa counters when vmstat mode is set to
+			 * coarse mode, because users can't tell the difference
+			 * between the dead state and when allocator activity is
+			 * quiet once zone_statistics() is turned off.
 			 */
 			invalid_numa_statistics();
-			pr_info("numa stats changes from %s mode to coarse mode\n",
-					vm_numa_stats_mode_name[oldval]);
-		} else
-			pr_warn("invalid vm_numa_stats_mode:%d\n", vm_numa_stats_mode);
+			pr_info("numa stats changes from %s to coarse mode\n",
+				 vm_numa_stats_mode_name[oldval]);
+		}
 	}
 
 	mutex_unlock(&vm_numa_stats_mode_lock);
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-09-28 21:29   ` Andrew Morton
  (?)
@ 2017-09-29  1:44     ` kemi
  -1 siblings, 0 replies; 48+ messages in thread
From: kemi @ 2017-09-29  1:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Luis R . Rodriguez, Kees Cook, Jonathan Corbet, Michal Hocko,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Dave, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel



On 2017年09月29日 05:29, Andrew Morton wrote:
> On Thu, 28 Sep 2017 14:11:41 +0800 Kemi Wang <kemi.wang@intel.com> wrote:
> 
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
> 
> Looks OK I guess.
> 
> I fiddled with it a lot.  Please consider:
> 

Thanks for your help to make it more graceful! I will be more careful next time.
There may be a typo error in Documentation/sysctl/vm.txt, see comment below.

> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm-sysctl-make-numa-stats-configurable-fix
> 
> - tweak documentation
> 
> - move advisory message from start_kernel() into mm_init() (I'm not sure
>   we really need this message)
> 
> - use strcasecmp() in __parse_vm_numa_stats_mode()
> 
> - clean up coding style amd nessages in sysctl_vm_numa_stats_mode_handler()
> 
> Cc: Aaron Lu <aaron.lu@intel.com>
> Cc: Andi Kleen <andi.kleen@intel.com>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Kemi Wang <kemi.wang@intel.com>
> Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Tim Chen <tim.c.chen@intel.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Ying Huang <ying.huang@intel.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  Documentation/sysctl/vm.txt |   15 ++++++-------
>  init/main.c                 |    6 ++---
>  mm/vmstat.c                 |   39 +++++++++++++++-------------------
>  3 files changed, 29 insertions(+), 31 deletions(-)
> 
> diff -puN Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix Documentation/sysctl/vm.txt
> --- a/Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix
> +++ a/Documentation/sysctl/vm.txt
> @@ -853,7 +853,7 @@ ten times more freeable objects than the
>  
>  numa_stats_mode
>  
> -This interface allows numa statistics configurable.
> +This interface allows runtime configuration *or* numa statistics.
>  

typo? or->of/for?

>  When page allocation performance becomes a bottleneck and you can tolerate
>  some possible tool breakage and decreased numa counter precision, you can
> @@ -864,13 +864,14 @@ When page allocation performance is not
>  tooling to work, you can do:
>  	echo [S|s]trict > /proc/sys/vm/numa_stat_mode
>  
> -We recommend automatic detection of numa statistics by system, because numa
> -statistics does not affect system's decision and it is very rarely
> -consumed. you can do:
> +We recommend automatic detection of numa statistics by system, because
> +numa statistics do not affect system decisions and it is very rarely
> +consumed.  In this case you can do:
>  	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> -This is also system default configuration, with this default setting, numa
> -counters update is skipped unless the counter is *read* by users at least
> -once.
> +
> +This is the system default configuration.  With this default setting, numa
> +counter updates are skipped until the counter is *read* by userspace at
> +least once.
>  
>  ==============================================================
>  
> diff -puN drivers/base/node.c~mm-sysctl-make-numa-stats-configurable-fix drivers/base/node.c
> diff -puN include/linux/vmstat.h~mm-sysctl-make-numa-stats-configurable-fix include/linux/vmstat.h
> diff -puN init/main.c~mm-sysctl-make-numa-stats-configurable-fix init/main.c
> --- a/init/main.c~mm-sysctl-make-numa-stats-configurable-fix
> +++ a/init/main.c
> @@ -504,6 +504,9 @@ static void __init mm_init(void)
>  	pgtable_init();
>  	vmalloc_init();
>  	ioremap_huge_init();
> +#ifdef CONFIG_NUMA
> +	pr_info("vmstat: NUMA stat updates are skipped unless they have been used\n");
> +#endif
>  }
>  
>  asmlinkage __visible void __init start_kernel(void)
> @@ -567,9 +570,6 @@ asmlinkage __visible void __init start_k
>  	sort_main_extable();
>  	trap_init();
>  	mm_init();
> -#ifdef CONFIG_NUMA
> -	pr_info("vmstat: NUMA stats is skipped unless it has been consumed\n");
> -#endif
>  
>  	ftrace_init();
>  
> diff -puN kernel/sysctl.c~mm-sysctl-make-numa-stats-configurable-fix kernel/sysctl.c
> diff -puN mm/page_alloc.c~mm-sysctl-make-numa-stats-configurable-fix mm/page_alloc.c
> diff -puN mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix mm/vmstat.c
> --- a/mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix
> +++ a/mm/vmstat.c
> @@ -40,13 +40,11 @@ static DEFINE_MUTEX(vm_numa_stats_mode_l
>  
>  static int __parse_vm_numa_stats_mode(char *s)
>  {
> -	const char *str = s;
> -
> -	if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0)
> +	if (strcasecmp(s, "auto"))
>  		vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
> -	else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0)
> +	else if (strcasecmp(s, "strict") == 0)
>  		vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE;
> -	else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0)
> +	else if (strcasecmp(s, "coarse"))
>  		vm_numa_stats_mode = VM_NUMA_STAT_COARSE_MODE;
>  	else {
>  		pr_warn("Ignoring invalid vm_numa_stats_mode value: %s\n", s);
> @@ -86,30 +84,29 @@ int sysctl_vm_numa_stats_mode_handler(st
>  			/* no change */
>  			mutex_unlock(&vm_numa_stats_mode_lock);
>  			return 0;
> -		} else if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		} else if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) {
>  			/*
> -			 * Keep the branch selection in last time when numa stats
> -			 * is changed to auto mode.
> +			 * Keep the branch selection in last time when numa
> +			 * stats is changed to auto mode.
>  			 */
> -			pr_info("numa stats changes from %s mode to auto mode\n",
> -					vm_numa_stats_mode_name[oldval]);
> -		else if (vm_numa_stats_mode == VM_NUMA_STAT_STRICT_MODE) {
> +			pr_info("numa stats changed from %s to auto mode\n",
> +				 vm_numa_stats_mode_name[oldval]);
> +		} else if (vm_numa_stats_mode == VM_NUMA_STAT_STRICT_MODE) {
>  			static_branch_enable(&vm_numa_stats_mode_key);
> -			pr_info("numa stats changes from %s mode to strict mode\n",
> -					vm_numa_stats_mode_name[oldval]);
> +			pr_info("numa stats changes from %s to strict mode\n",
> +				 vm_numa_stats_mode_name[oldval]);
>  		} else if (vm_numa_stats_mode == VM_NUMA_STAT_COARSE_MODE) {
>  			static_branch_disable(&vm_numa_stats_mode_key);
>  			/*
> -			 * Invalidate numa counters when vmstat mode is set to coarse
> -			 * mode, because users can't tell the difference between the
> -			 * dead state and when allocator activity is quiet once
> -			 * zone_statistics() is turned off.
> +			 * Invalidate numa counters when vmstat mode is set to
> +			 * coarse mode, because users can't tell the difference
> +			 * between the dead state and when allocator activity is
> +			 * quiet once zone_statistics() is turned off.
>  			 */
>  			invalid_numa_statistics();
> -			pr_info("numa stats changes from %s mode to coarse mode\n",
> -					vm_numa_stats_mode_name[oldval]);
> -		} else
> -			pr_warn("invalid vm_numa_stats_mode:%d\n", vm_numa_stats_mode);
> +			pr_info("numa stats changes from %s to coarse mode\n",
> +				 vm_numa_stats_mode_name[oldval]);
> +		}
>  	}
>  
>  	mutex_unlock(&vm_numa_stats_mode_lock);
> _
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-09-29  1:44     ` kemi
  0 siblings, 0 replies; 48+ messages in thread
From: kemi @ 2017-09-29  1:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Luis R . Rodriguez, Kees Cook, Jonathan Corbet, Michal Hocko,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Dave, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel



On 2017年09月29日 05:29, Andrew Morton wrote:
> On Thu, 28 Sep 2017 14:11:41 +0800 Kemi Wang <kemi.wang@intel.com> wrote:
> 
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
> 
> Looks OK I guess.
> 
> I fiddled with it a lot.  Please consider:
> 

Thanks for your help to make it more graceful! I will be more careful next time.
There may be a typo error in Documentation/sysctl/vm.txt, see comment below.

> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm-sysctl-make-numa-stats-configurable-fix
> 
> - tweak documentation
> 
> - move advisory message from start_kernel() into mm_init() (I'm not sure
>   we really need this message)
> 
> - use strcasecmp() in __parse_vm_numa_stats_mode()
> 
> - clean up coding style amd nessages in sysctl_vm_numa_stats_mode_handler()
> 
> Cc: Aaron Lu <aaron.lu@intel.com>
> Cc: Andi Kleen <andi.kleen@intel.com>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Kemi Wang <kemi.wang@intel.com>
> Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Tim Chen <tim.c.chen@intel.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Ying Huang <ying.huang@intel.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  Documentation/sysctl/vm.txt |   15 ++++++-------
>  init/main.c                 |    6 ++---
>  mm/vmstat.c                 |   39 +++++++++++++++-------------------
>  3 files changed, 29 insertions(+), 31 deletions(-)
> 
> diff -puN Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix Documentation/sysctl/vm.txt
> --- a/Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix
> +++ a/Documentation/sysctl/vm.txt
> @@ -853,7 +853,7 @@ ten times more freeable objects than the
>  
>  numa_stats_mode
>  
> -This interface allows numa statistics configurable.
> +This interface allows runtime configuration *or* numa statistics.
>  

typo? or->of/for?

>  When page allocation performance becomes a bottleneck and you can tolerate
>  some possible tool breakage and decreased numa counter precision, you can
> @@ -864,13 +864,14 @@ When page allocation performance is not
>  tooling to work, you can do:
>  	echo [S|s]trict > /proc/sys/vm/numa_stat_mode
>  
> -We recommend automatic detection of numa statistics by system, because numa
> -statistics does not affect system's decision and it is very rarely
> -consumed. you can do:
> +We recommend automatic detection of numa statistics by system, because
> +numa statistics do not affect system decisions and it is very rarely
> +consumed.  In this case you can do:
>  	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> -This is also system default configuration, with this default setting, numa
> -counters update is skipped unless the counter is *read* by users at least
> -once.
> +
> +This is the system default configuration.  With this default setting, numa
> +counter updates are skipped until the counter is *read* by userspace at
> +least once.
>  
>  ==============================================================
>  
> diff -puN drivers/base/node.c~mm-sysctl-make-numa-stats-configurable-fix drivers/base/node.c
> diff -puN include/linux/vmstat.h~mm-sysctl-make-numa-stats-configurable-fix include/linux/vmstat.h
> diff -puN init/main.c~mm-sysctl-make-numa-stats-configurable-fix init/main.c
> --- a/init/main.c~mm-sysctl-make-numa-stats-configurable-fix
> +++ a/init/main.c
> @@ -504,6 +504,9 @@ static void __init mm_init(void)
>  	pgtable_init();
>  	vmalloc_init();
>  	ioremap_huge_init();
> +#ifdef CONFIG_NUMA
> +	pr_info("vmstat: NUMA stat updates are skipped unless they have been used\n");
> +#endif
>  }
>  
>  asmlinkage __visible void __init start_kernel(void)
> @@ -567,9 +570,6 @@ asmlinkage __visible void __init start_k
>  	sort_main_extable();
>  	trap_init();
>  	mm_init();
> -#ifdef CONFIG_NUMA
> -	pr_info("vmstat: NUMA stats is skipped unless it has been consumed\n");
> -#endif
>  
>  	ftrace_init();
>  
> diff -puN kernel/sysctl.c~mm-sysctl-make-numa-stats-configurable-fix kernel/sysctl.c
> diff -puN mm/page_alloc.c~mm-sysctl-make-numa-stats-configurable-fix mm/page_alloc.c
> diff -puN mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix mm/vmstat.c
> --- a/mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix
> +++ a/mm/vmstat.c
> @@ -40,13 +40,11 @@ static DEFINE_MUTEX(vm_numa_stats_mode_l
>  
>  static int __parse_vm_numa_stats_mode(char *s)
>  {
> -	const char *str = s;
> -
> -	if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0)
> +	if (strcasecmp(s, "auto"))
>  		vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
> -	else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0)
> +	else if (strcasecmp(s, "strict") == 0)
>  		vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE;
> -	else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0)
> +	else if (strcasecmp(s, "coarse"))
>  		vm_numa_stats_mode = VM_NUMA_STAT_COARSE_MODE;
>  	else {
>  		pr_warn("Ignoring invalid vm_numa_stats_mode value: %s\n", s);
> @@ -86,30 +84,29 @@ int sysctl_vm_numa_stats_mode_handler(st
>  			/* no change */
>  			mutex_unlock(&vm_numa_stats_mode_lock);
>  			return 0;
> -		} else if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		} else if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) {
>  			/*
> -			 * Keep the branch selection in last time when numa stats
> -			 * is changed to auto mode.
> +			 * Keep the branch selection in last time when numa
> +			 * stats is changed to auto mode.
>  			 */
> -			pr_info("numa stats changes from %s mode to auto mode\n",
> -					vm_numa_stats_mode_name[oldval]);
> -		else if (vm_numa_stats_mode == VM_NUMA_STAT_STRICT_MODE) {
> +			pr_info("numa stats changed from %s to auto mode\n",
> +				 vm_numa_stats_mode_name[oldval]);
> +		} else if (vm_numa_stats_mode == VM_NUMA_STAT_STRICT_MODE) {
>  			static_branch_enable(&vm_numa_stats_mode_key);
> -			pr_info("numa stats changes from %s mode to strict mode\n",
> -					vm_numa_stats_mode_name[oldval]);
> +			pr_info("numa stats changes from %s to strict mode\n",
> +				 vm_numa_stats_mode_name[oldval]);
>  		} else if (vm_numa_stats_mode == VM_NUMA_STAT_COARSE_MODE) {
>  			static_branch_disable(&vm_numa_stats_mode_key);
>  			/*
> -			 * Invalidate numa counters when vmstat mode is set to coarse
> -			 * mode, because users can't tell the difference between the
> -			 * dead state and when allocator activity is quiet once
> -			 * zone_statistics() is turned off.
> +			 * Invalidate numa counters when vmstat mode is set to
> +			 * coarse mode, because users can't tell the difference
> +			 * between the dead state and when allocator activity is
> +			 * quiet once zone_statistics() is turned off.
>  			 */
>  			invalid_numa_statistics();
> -			pr_info("numa stats changes from %s mode to coarse mode\n",
> -					vm_numa_stats_mode_name[oldval]);
> -		} else
> -			pr_warn("invalid vm_numa_stats_mode:%d\n", vm_numa_stats_mode);
> +			pr_info("numa stats changes from %s to coarse mode\n",
> +				 vm_numa_stats_mode_name[oldval]);
> +		}
>  	}
>  
>  	mutex_unlock(&vm_numa_stats_mode_lock);
> _
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-09-29  1:44     ` kemi
  0 siblings, 0 replies; 48+ messages in thread
From: kemi @ 2017-09-29  1:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Luis R . Rodriguez, Kees Cook, Jonathan Corbet, Michal Hocko,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Dave, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel



On 2017a1'09ae??29ae?JPY 05:29, Andrew Morton wrote:
> On Thu, 28 Sep 2017 14:11:41 +0800 Kemi Wang <kemi.wang@intel.com> wrote:
> 
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
> 
> Looks OK I guess.
> 
> I fiddled with it a lot.  Please consider:
> 

Thanks for your help to make it more graceful! I will be more careful next time.
There may be a typo error in Documentation/sysctl/vm.txt, see comment below.

> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm-sysctl-make-numa-stats-configurable-fix
> 
> - tweak documentation
> 
> - move advisory message from start_kernel() into mm_init() (I'm not sure
>   we really need this message)
> 
> - use strcasecmp() in __parse_vm_numa_stats_mode()
> 
> - clean up coding style amd nessages in sysctl_vm_numa_stats_mode_handler()
> 
> Cc: Aaron Lu <aaron.lu@intel.com>
> Cc: Andi Kleen <andi.kleen@intel.com>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Kemi Wang <kemi.wang@intel.com>
> Cc: "Luis R . Rodriguez" <mcgrof@kernel.org>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Tim Chen <tim.c.chen@intel.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Ying Huang <ying.huang@intel.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  Documentation/sysctl/vm.txt |   15 ++++++-------
>  init/main.c                 |    6 ++---
>  mm/vmstat.c                 |   39 +++++++++++++++-------------------
>  3 files changed, 29 insertions(+), 31 deletions(-)
> 
> diff -puN Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix Documentation/sysctl/vm.txt
> --- a/Documentation/sysctl/vm.txt~mm-sysctl-make-numa-stats-configurable-fix
> +++ a/Documentation/sysctl/vm.txt
> @@ -853,7 +853,7 @@ ten times more freeable objects than the
>  
>  numa_stats_mode
>  
> -This interface allows numa statistics configurable.
> +This interface allows runtime configuration *or* numa statistics.
>  

typo? or->of/for?

>  When page allocation performance becomes a bottleneck and you can tolerate
>  some possible tool breakage and decreased numa counter precision, you can
> @@ -864,13 +864,14 @@ When page allocation performance is not
>  tooling to work, you can do:
>  	echo [S|s]trict > /proc/sys/vm/numa_stat_mode
>  
> -We recommend automatic detection of numa statistics by system, because numa
> -statistics does not affect system's decision and it is very rarely
> -consumed. you can do:
> +We recommend automatic detection of numa statistics by system, because
> +numa statistics do not affect system decisions and it is very rarely
> +consumed.  In this case you can do:
>  	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> -This is also system default configuration, with this default setting, numa
> -counters update is skipped unless the counter is *read* by users at least
> -once.
> +
> +This is the system default configuration.  With this default setting, numa
> +counter updates are skipped until the counter is *read* by userspace at
> +least once.
>  
>  ==============================================================
>  
> diff -puN drivers/base/node.c~mm-sysctl-make-numa-stats-configurable-fix drivers/base/node.c
> diff -puN include/linux/vmstat.h~mm-sysctl-make-numa-stats-configurable-fix include/linux/vmstat.h
> diff -puN init/main.c~mm-sysctl-make-numa-stats-configurable-fix init/main.c
> --- a/init/main.c~mm-sysctl-make-numa-stats-configurable-fix
> +++ a/init/main.c
> @@ -504,6 +504,9 @@ static void __init mm_init(void)
>  	pgtable_init();
>  	vmalloc_init();
>  	ioremap_huge_init();
> +#ifdef CONFIG_NUMA
> +	pr_info("vmstat: NUMA stat updates are skipped unless they have been used\n");
> +#endif
>  }
>  
>  asmlinkage __visible void __init start_kernel(void)
> @@ -567,9 +570,6 @@ asmlinkage __visible void __init start_k
>  	sort_main_extable();
>  	trap_init();
>  	mm_init();
> -#ifdef CONFIG_NUMA
> -	pr_info("vmstat: NUMA stats is skipped unless it has been consumed\n");
> -#endif
>  
>  	ftrace_init();
>  
> diff -puN kernel/sysctl.c~mm-sysctl-make-numa-stats-configurable-fix kernel/sysctl.c
> diff -puN mm/page_alloc.c~mm-sysctl-make-numa-stats-configurable-fix mm/page_alloc.c
> diff -puN mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix mm/vmstat.c
> --- a/mm/vmstat.c~mm-sysctl-make-numa-stats-configurable-fix
> +++ a/mm/vmstat.c
> @@ -40,13 +40,11 @@ static DEFINE_MUTEX(vm_numa_stats_mode_l
>  
>  static int __parse_vm_numa_stats_mode(char *s)
>  {
> -	const char *str = s;
> -
> -	if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0)
> +	if (strcasecmp(s, "auto"))
>  		vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
> -	else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0)
> +	else if (strcasecmp(s, "strict") == 0)
>  		vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE;
> -	else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0)
> +	else if (strcasecmp(s, "coarse"))
>  		vm_numa_stats_mode = VM_NUMA_STAT_COARSE_MODE;
>  	else {
>  		pr_warn("Ignoring invalid vm_numa_stats_mode value: %s\n", s);
> @@ -86,30 +84,29 @@ int sysctl_vm_numa_stats_mode_handler(st
>  			/* no change */
>  			mutex_unlock(&vm_numa_stats_mode_lock);
>  			return 0;
> -		} else if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		} else if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE) {
>  			/*
> -			 * Keep the branch selection in last time when numa stats
> -			 * is changed to auto mode.
> +			 * Keep the branch selection in last time when numa
> +			 * stats is changed to auto mode.
>  			 */
> -			pr_info("numa stats changes from %s mode to auto mode\n",
> -					vm_numa_stats_mode_name[oldval]);
> -		else if (vm_numa_stats_mode == VM_NUMA_STAT_STRICT_MODE) {
> +			pr_info("numa stats changed from %s to auto mode\n",
> +				 vm_numa_stats_mode_name[oldval]);
> +		} else if (vm_numa_stats_mode == VM_NUMA_STAT_STRICT_MODE) {
>  			static_branch_enable(&vm_numa_stats_mode_key);
> -			pr_info("numa stats changes from %s mode to strict mode\n",
> -					vm_numa_stats_mode_name[oldval]);
> +			pr_info("numa stats changes from %s to strict mode\n",
> +				 vm_numa_stats_mode_name[oldval]);
>  		} else if (vm_numa_stats_mode == VM_NUMA_STAT_COARSE_MODE) {
>  			static_branch_disable(&vm_numa_stats_mode_key);
>  			/*
> -			 * Invalidate numa counters when vmstat mode is set to coarse
> -			 * mode, because users can't tell the difference between the
> -			 * dead state and when allocator activity is quiet once
> -			 * zone_statistics() is turned off.
> +			 * Invalidate numa counters when vmstat mode is set to
> +			 * coarse mode, because users can't tell the difference
> +			 * between the dead state and when allocator activity is
> +			 * quiet once zone_statistics() is turned off.
>  			 */
>  			invalid_numa_statistics();
> -			pr_info("numa stats changes from %s mode to coarse mode\n",
> -					vm_numa_stats_mode_name[oldval]);
> -		} else
> -			pr_warn("invalid vm_numa_stats_mode:%d\n", vm_numa_stats_mode);
> +			pr_info("numa stats changes from %s to coarse mode\n",
> +				 vm_numa_stats_mode_name[oldval]);
> +		}
>  	}
>  
>  	mutex_unlock(&vm_numa_stats_mode_lock);
> _
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-09-28  6:11 ` Kemi Wang
@ 2017-09-29  7:03   ` Vlastimil Babka
  -1 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2017-09-29  7:03 UTC (permalink / raw)
  To: Kemi Wang, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Michal Hocko, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior
  Cc: Dave, Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel

On 09/28/2017 08:11 AM, Kemi Wang wrote:
> This is the second step which introduces a tunable interface that allow
> numa stats configurable for optimizing zone_statistics(), as suggested by
> Dave Hansen and Ying Huang.
> 
> =========================================================================
> When page allocation performance becomes a bottleneck and you can tolerate
> some possible tool breakage and decreased numa counter precision, you can
> do:
> 	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is ignored. We can see about
> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> memory).
> 
> Benchmark link provided by Jesper D Brouer(increase loop times to
> 10000000):
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> bench
> 
> =========================================================================
> When page allocation performance is not a bottleneck and you want all
> tooling to work, you can do:
> 	echo [S|s]trict > /proc/sys/vm/numa_stats_mode
> 
> =========================================================================
> We recommend automatic detection of numa statistics by system, this is also
> system default configuration, you can do:
> 	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is skipped unless it has been read by
> users at least once, e.g. cat /proc/zoneinfo.
> 
> Branch target selection with jump label:
> a) When numa_stats_mode is changed to *strict*, jump to the branch for numa
> counters update.
> b) When numa_stats_mode is changed to *coarse*, return back directly.
> c) When numa_stats_mode is changed to *auto*, the branch target used in
> last time is kept, and the branch target is changed to the branch for numa
> counters update once numa counters are *read* by users.
> 
> Therefore, with the help of jump label, the page allocation performance is
> hardly affected when numa counters are updated with a call in
> zone_statistics(). Meanwhile, the auto mode can give people benefit without
> manual tuning.
> 
> Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to
> help improve the original patch.
> 
> ChangeLog:
>   V2->V3:
>   a) Propose a better way to use jump label to eliminate the overhead of
>   branch selection in zone_statistics(), as inspired by Ying Huang;
>   b) Add a paragraph in commit log to describe the way for branch target
>   selection;
>   c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
>   and change the description accordingly, as suggested by Michal Hocko;
>   d) Make this functionality NUMA-specific via ifdef
> 
>   V1->V2:
>   a) Merge to one patch;
>   b) Use jump label to eliminate the overhead of branch selection;
>   c) Add a single-time log message at boot time to help tell users what
>   happened.
> 
> Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Suggested-by: Ying Huang <ying.huang@intel.com>
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> ---
>  Documentation/sysctl/vm.txt |  24 +++++++++
>  drivers/base/node.c         |   4 ++
>  include/linux/vmstat.h      |  23 ++++++++
>  init/main.c                 |   3 ++
>  kernel/sysctl.c             |   7 +++
>  mm/page_alloc.c             |  10 ++++
>  mm/vmstat.c                 | 129 ++++++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 200 insertions(+)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 9baf66a..e310e69 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
>  - swappiness
>  - user_reserve_kbytes
>  - vfs_cache_pressure
> +- numa_stats_mode
>  - watermark_scale_factor
>  - zone_reclaim_mode
>  
> @@ -843,6 +844,29 @@ ten times more freeable objects than there are.
>  
>  =============================================================
>  
> +numa_stats_mode
> +
> +This interface allows numa statistics configurable.
> +
> +When page allocation performance becomes a bottleneck and you can tolerate
> +some possible tool breakage and decreased numa counter precision, you can
> +do:
> +	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> +
> +When page allocation performance is not a bottleneck and you want all
> +tooling to work, you can do:
> +	echo [S|s]trict > /proc/sys/vm/numa_stat_mode
> +
> +We recommend automatic detection of numa statistics by system, because numa
> +statistics does not affect system's decision and it is very rarely
> +consumed. you can do:
> +	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> +This is also system default configuration, with this default setting, numa
> +counters update is skipped unless the counter is *read* by users at least
> +once.

It says "the counter", but it seems multiple files in /proc and /sys are
triggering this, so perhaps list them?
Also, is it possible that with contemporary userspace/distros (systemd
etc.) there will always be something that will read one of those upon boot?

> +
> +==============================================================
> +
>  watermark_scale_factor:
>  
>  This factor controls the aggressiveness of kswapd. It defines the
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 3855902..b57b5622 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
>  static ssize_t node_read_numastat(struct device *dev,
>  				struct device_attribute *attr, char *buf)
>  {
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
>  	return sprintf(buf,
>  		       "numa_hit %lu\n"
>  		       "numa_miss %lu\n"
> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
>  		n += sprintf(buf+n, "%s %lu\n",
>  			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>  			     sum_zone_numa_state(nid, i));
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
>  #endif
>  
>  	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index ade7cb5..d52e882 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -6,9 +6,28 @@
>  #include <linux/mmzone.h>
>  #include <linux/vm_event_item.h>
>  #include <linux/atomic.h>
> +#include <linux/static_key.h>
>  
>  extern int sysctl_stat_interval;
>  
> +#ifdef CONFIG_NUMA
> +DECLARE_STATIC_KEY_FALSE(vm_numa_stats_mode_key);
> +/*
> + * vm_numa_stats_mode:
> + * 0 = auto mode of NUMA stats, automatic detection of NUMA statistics.
> + * 1 = strict mode of NUMA stats, keep NUMA statistics.
> + * 2 = coarse mode of NUMA stats, ignore NUMA statistics.
> + */
> +#define VM_NUMA_STAT_AUTO_MODE 0
> +#define VM_NUMA_STAT_STRICT_MODE  1
> +#define VM_NUMA_STAT_COARSE_MODE  2
> +#define VM_NUMA_STAT_MODE_LEN 16
> +extern int vm_numa_stats_mode;
> +extern char sysctl_vm_numa_stats_mode[];
> +extern int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
> +		void __user *buffer, size_t *length, loff_t *ppos);
> +#endif
> +
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>  /*
>   * Light weight per cpu counter implementation.
> @@ -229,6 +248,10 @@ extern unsigned long sum_zone_node_page_state(int node,
>  extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
>  extern unsigned long node_page_state(struct pglist_data *pgdat,
>  						enum node_stat_item item);
> +extern void zero_zone_numa_counters(struct zone *zone);
> +extern void zero_zones_numa_counters(void);
> +extern void zero_global_numa_counters(void);
> +extern void invalid_numa_statistics(void);

These seem to be called only from within mm/vmstat.c where they live, so
I'd suggest removing these extern declarations, and making them static
in vmstat.c.

...

>  #define NUMA_STATS_THRESHOLD (U16_MAX - 2)
>  
> +#ifdef CONFIG_NUMA
> +int vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
> +char sysctl_vm_numa_stats_mode[VM_NUMA_STAT_MODE_LEN] = "auto";
> +static const char *vm_numa_stats_mode_name[3] = {"auto", "strict", "coarse"};
> +static DEFINE_MUTEX(vm_numa_stats_mode_lock);
> +
> +static int __parse_vm_numa_stats_mode(char *s)
> +{
> +	const char *str = s;
> +
> +	if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0)
> +		vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
> +	else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0)
> +		vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE;
> +	else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0)
> +		vm_numa_stats_mode = VM_NUMA_STAT_COARSE_MODE;
> +	else {
> +		pr_warn("Ignoring invalid vm_numa_stats_mode value: %s\n", s);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
> +		void __user *buffer, size_t *length, loff_t *ppos)
> +{
> +	char old_string[VM_NUMA_STAT_MODE_LEN];
> +	int ret, oldval;
> +
> +	mutex_lock(&vm_numa_stats_mode_lock);
> +	if (write)
> +		strncpy(old_string, (char *)table->data, VM_NUMA_STAT_MODE_LEN);
> +	ret = proc_dostring(table, write, buffer, length, ppos);
> +	if (ret || !write) {
> +		mutex_unlock(&vm_numa_stats_mode_lock);
> +		return ret;
> +	}
> +
> +	oldval = vm_numa_stats_mode;
> +	if (__parse_vm_numa_stats_mode((char *)table->data)) {
> +		/*
> +		 * invalid sysctl_vm_numa_stats_mode value, restore saved string
> +		 */
> +		strncpy((char *)table->data, old_string, VM_NUMA_STAT_MODE_LEN);
> +		vm_numa_stats_mode = oldval;

Do we need to restore vm_numa_stats_mode? AFAICS it didn't change. Also,
should the EINVAL be returned also to userspace? (not sure what's the
API here, hmm man 2 sysctl doesn't list EINVAL...)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-09-29  7:03   ` Vlastimil Babka
  0 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2017-09-29  7:03 UTC (permalink / raw)
  To: Kemi Wang, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Michal Hocko, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior
  Cc: Dave, Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel

On 09/28/2017 08:11 AM, Kemi Wang wrote:
> This is the second step which introduces a tunable interface that allow
> numa stats configurable for optimizing zone_statistics(), as suggested by
> Dave Hansen and Ying Huang.
> 
> =========================================================================
> When page allocation performance becomes a bottleneck and you can tolerate
> some possible tool breakage and decreased numa counter precision, you can
> do:
> 	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is ignored. We can see about
> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> memory).
> 
> Benchmark link provided by Jesper D Brouer(increase loop times to
> 10000000):
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> bench
> 
> =========================================================================
> When page allocation performance is not a bottleneck and you want all
> tooling to work, you can do:
> 	echo [S|s]trict > /proc/sys/vm/numa_stats_mode
> 
> =========================================================================
> We recommend automatic detection of numa statistics by system, this is also
> system default configuration, you can do:
> 	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is skipped unless it has been read by
> users at least once, e.g. cat /proc/zoneinfo.
> 
> Branch target selection with jump label:
> a) When numa_stats_mode is changed to *strict*, jump to the branch for numa
> counters update.
> b) When numa_stats_mode is changed to *coarse*, return back directly.
> c) When numa_stats_mode is changed to *auto*, the branch target used in
> last time is kept, and the branch target is changed to the branch for numa
> counters update once numa counters are *read* by users.
> 
> Therefore, with the help of jump label, the page allocation performance is
> hardly affected when numa counters are updated with a call in
> zone_statistics(). Meanwhile, the auto mode can give people benefit without
> manual tuning.
> 
> Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to
> help improve the original patch.
> 
> ChangeLog:
>   V2->V3:
>   a) Propose a better way to use jump label to eliminate the overhead of
>   branch selection in zone_statistics(), as inspired by Ying Huang;
>   b) Add a paragraph in commit log to describe the way for branch target
>   selection;
>   c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
>   and change the description accordingly, as suggested by Michal Hocko;
>   d) Make this functionality NUMA-specific via ifdef
> 
>   V1->V2:
>   a) Merge to one patch;
>   b) Use jump label to eliminate the overhead of branch selection;
>   c) Add a single-time log message at boot time to help tell users what
>   happened.
> 
> Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Suggested-by: Ying Huang <ying.huang@intel.com>
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> ---
>  Documentation/sysctl/vm.txt |  24 +++++++++
>  drivers/base/node.c         |   4 ++
>  include/linux/vmstat.h      |  23 ++++++++
>  init/main.c                 |   3 ++
>  kernel/sysctl.c             |   7 +++
>  mm/page_alloc.c             |  10 ++++
>  mm/vmstat.c                 | 129 ++++++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 200 insertions(+)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 9baf66a..e310e69 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
>  - swappiness
>  - user_reserve_kbytes
>  - vfs_cache_pressure
> +- numa_stats_mode
>  - watermark_scale_factor
>  - zone_reclaim_mode
>  
> @@ -843,6 +844,29 @@ ten times more freeable objects than there are.
>  
>  =============================================================
>  
> +numa_stats_mode
> +
> +This interface allows numa statistics configurable.
> +
> +When page allocation performance becomes a bottleneck and you can tolerate
> +some possible tool breakage and decreased numa counter precision, you can
> +do:
> +	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> +
> +When page allocation performance is not a bottleneck and you want all
> +tooling to work, you can do:
> +	echo [S|s]trict > /proc/sys/vm/numa_stat_mode
> +
> +We recommend automatic detection of numa statistics by system, because numa
> +statistics does not affect system's decision and it is very rarely
> +consumed. you can do:
> +	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> +This is also system default configuration, with this default setting, numa
> +counters update is skipped unless the counter is *read* by users at least
> +once.

It says "the counter", but it seems multiple files in /proc and /sys are
triggering this, so perhaps list them?
Also, is it possible that with contemporary userspace/distros (systemd
etc.) there will always be something that will read one of those upon boot?

> +
> +==============================================================
> +
>  watermark_scale_factor:
>  
>  This factor controls the aggressiveness of kswapd. It defines the
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 3855902..b57b5622 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
>  static ssize_t node_read_numastat(struct device *dev,
>  				struct device_attribute *attr, char *buf)
>  {
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
>  	return sprintf(buf,
>  		       "numa_hit %lu\n"
>  		       "numa_miss %lu\n"
> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
>  		n += sprintf(buf+n, "%s %lu\n",
>  			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>  			     sum_zone_numa_state(nid, i));
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
>  #endif
>  
>  	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index ade7cb5..d52e882 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -6,9 +6,28 @@
>  #include <linux/mmzone.h>
>  #include <linux/vm_event_item.h>
>  #include <linux/atomic.h>
> +#include <linux/static_key.h>
>  
>  extern int sysctl_stat_interval;
>  
> +#ifdef CONFIG_NUMA
> +DECLARE_STATIC_KEY_FALSE(vm_numa_stats_mode_key);
> +/*
> + * vm_numa_stats_mode:
> + * 0 = auto mode of NUMA stats, automatic detection of NUMA statistics.
> + * 1 = strict mode of NUMA stats, keep NUMA statistics.
> + * 2 = coarse mode of NUMA stats, ignore NUMA statistics.
> + */
> +#define VM_NUMA_STAT_AUTO_MODE 0
> +#define VM_NUMA_STAT_STRICT_MODE  1
> +#define VM_NUMA_STAT_COARSE_MODE  2
> +#define VM_NUMA_STAT_MODE_LEN 16
> +extern int vm_numa_stats_mode;
> +extern char sysctl_vm_numa_stats_mode[];
> +extern int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
> +		void __user *buffer, size_t *length, loff_t *ppos);
> +#endif
> +
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>  /*
>   * Light weight per cpu counter implementation.
> @@ -229,6 +248,10 @@ extern unsigned long sum_zone_node_page_state(int node,
>  extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
>  extern unsigned long node_page_state(struct pglist_data *pgdat,
>  						enum node_stat_item item);
> +extern void zero_zone_numa_counters(struct zone *zone);
> +extern void zero_zones_numa_counters(void);
> +extern void zero_global_numa_counters(void);
> +extern void invalid_numa_statistics(void);

These seem to be called only from within mm/vmstat.c where they live, so
I'd suggest removing these extern declarations, and making them static
in vmstat.c.

...

>  #define NUMA_STATS_THRESHOLD (U16_MAX - 2)
>  
> +#ifdef CONFIG_NUMA
> +int vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
> +char sysctl_vm_numa_stats_mode[VM_NUMA_STAT_MODE_LEN] = "auto";
> +static const char *vm_numa_stats_mode_name[3] = {"auto", "strict", "coarse"};
> +static DEFINE_MUTEX(vm_numa_stats_mode_lock);
> +
> +static int __parse_vm_numa_stats_mode(char *s)
> +{
> +	const char *str = s;
> +
> +	if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0)
> +		vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
> +	else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0)
> +		vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE;
> +	else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0)
> +		vm_numa_stats_mode = VM_NUMA_STAT_COARSE_MODE;
> +	else {
> +		pr_warn("Ignoring invalid vm_numa_stats_mode value: %s\n", s);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
> +		void __user *buffer, size_t *length, loff_t *ppos)
> +{
> +	char old_string[VM_NUMA_STAT_MODE_LEN];
> +	int ret, oldval;
> +
> +	mutex_lock(&vm_numa_stats_mode_lock);
> +	if (write)
> +		strncpy(old_string, (char *)table->data, VM_NUMA_STAT_MODE_LEN);
> +	ret = proc_dostring(table, write, buffer, length, ppos);
> +	if (ret || !write) {
> +		mutex_unlock(&vm_numa_stats_mode_lock);
> +		return ret;
> +	}
> +
> +	oldval = vm_numa_stats_mode;
> +	if (__parse_vm_numa_stats_mode((char *)table->data)) {
> +		/*
> +		 * invalid sysctl_vm_numa_stats_mode value, restore saved string
> +		 */
> +		strncpy((char *)table->data, old_string, VM_NUMA_STAT_MODE_LEN);
> +		vm_numa_stats_mode = oldval;

Do we need to restore vm_numa_stats_mode? AFAICS it didn't change. Also,
should the EINVAL be returned also to userspace? (not sure what's the
API here, hmm man 2 sysctl doesn't list EINVAL...)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-09-28 21:29   ` Andrew Morton
@ 2017-09-29  7:09     ` Vlastimil Babka
  -1 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2017-09-29  7:09 UTC (permalink / raw)
  To: Andrew Morton, Kemi Wang
  Cc: Luis R . Rodriguez, Kees Cook, Jonathan Corbet, Michal Hocko,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Dave, Tim Chen, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Proc sysctl,
	Linux MM, Linux Kernel

On 09/28/2017 11:29 PM, Andrew Morton wrote:
> On Thu, 28 Sep 2017 14:11:41 +0800 Kemi Wang <kemi.wang@intel.com> wrote:
> 
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
> 
> Looks OK I guess.
> 
> I fiddled with it a lot.  Please consider:
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm-sysctl-make-numa-stats-configurable-fix
> 
> - tweak documentation
> 
> - move advisory message from start_kernel() into mm_init() (I'm not sure
>   we really need this message)

Actually, I'm not sure we need any of the current messages, or to have
them at higher priority than pr_debug()? They are all triggered by admin
action, or unconditionally upon boot.
OTOH I think that an useful message that's currently missing would be
when the static_key_enable() is triggered in auto mode. Bonus points for
including the name of the process and the stat file that was read.
However static_key_enable() returns void and not whether it actually
flipped the switch, so it's not trivial.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-09-29  7:09     ` Vlastimil Babka
  0 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2017-09-29  7:09 UTC (permalink / raw)
  To: Andrew Morton, Kemi Wang
  Cc: Luis R . Rodriguez, Kees Cook, Jonathan Corbet, Michal Hocko,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Dave, Tim Chen, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Proc sysctl,
	Linux MM, Linux Kernel

On 09/28/2017 11:29 PM, Andrew Morton wrote:
> On Thu, 28 Sep 2017 14:11:41 +0800 Kemi Wang <kemi.wang@intel.com> wrote:
> 
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
> 
> Looks OK I guess.
> 
> I fiddled with it a lot.  Please consider:
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm-sysctl-make-numa-stats-configurable-fix
> 
> - tweak documentation
> 
> - move advisory message from start_kernel() into mm_init() (I'm not sure
>   we really need this message)

Actually, I'm not sure we need any of the current messages, or to have
them at higher priority than pr_debug()? They are all triggered by admin
action, or unconditionally upon boot.
OTOH I think that an useful message that's currently missing would be
when the static_key_enable() is triggered in auto mode. Bonus points for
including the name of the process and the stat file that was read.
However static_key_enable() returns void and not whether it actually
flipped the switch, so it's not trivial.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-09-28  6:11 ` Kemi Wang
  (?)
@ 2017-09-29  7:27   ` Vlastimil Babka
  -1 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2017-09-29  7:27 UTC (permalink / raw)
  To: Kemi Wang, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Michal Hocko, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior
  Cc: Dave, Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel, Linux API

[+CC linux-api]

On 09/28/2017 08:11 AM, Kemi Wang wrote:
> This is the second step which introduces a tunable interface that allow
> numa stats configurable for optimizing zone_statistics(), as suggested by
> Dave Hansen and Ying Huang.
> 
> =========================================================================
> When page allocation performance becomes a bottleneck and you can tolerate
> some possible tool breakage and decreased numa counter precision, you can
> do:
> 	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is ignored. We can see about
> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> memory).
> 
> Benchmark link provided by Jesper D Brouer(increase loop times to
> 10000000):
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> bench
> 
> =========================================================================
> When page allocation performance is not a bottleneck and you want all
> tooling to work, you can do:
> 	echo [S|s]trict > /proc/sys/vm/numa_stats_mode
> 
> =========================================================================
> We recommend automatic detection of numa statistics by system, this is also
> system default configuration, you can do:
> 	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is skipped unless it has been read by
> users at least once, e.g. cat /proc/zoneinfo.
> 
> Branch target selection with jump label:
> a) When numa_stats_mode is changed to *strict*, jump to the branch for numa
> counters update.
> b) When numa_stats_mode is changed to *coarse*, return back directly.
> c) When numa_stats_mode is changed to *auto*, the branch target used in
> last time is kept, and the branch target is changed to the branch for numa
> counters update once numa counters are *read* by users.
> 
> Therefore, with the help of jump label, the page allocation performance is
> hardly affected when numa counters are updated with a call in
> zone_statistics(). Meanwhile, the auto mode can give people benefit without
> manual tuning.
> 
> Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to
> help improve the original patch.
> 
> ChangeLog:
>   V2->V3:
>   a) Propose a better way to use jump label to eliminate the overhead of
>   branch selection in zone_statistics(), as inspired by Ying Huang;
>   b) Add a paragraph in commit log to describe the way for branch target
>   selection;
>   c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
>   and change the description accordingly, as suggested by Michal Hocko;
>   d) Make this functionality NUMA-specific via ifdef
> 
>   V1->V2:
>   a) Merge to one patch;
>   b) Use jump label to eliminate the overhead of branch selection;
>   c) Add a single-time log message at boot time to help tell users what
>   happened.
> 
> Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Suggested-by: Ying Huang <ying.huang@intel.com>
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> ---
>  Documentation/sysctl/vm.txt |  24 +++++++++
>  drivers/base/node.c         |   4 ++
>  include/linux/vmstat.h      |  23 ++++++++
>  init/main.c                 |   3 ++
>  kernel/sysctl.c             |   7 +++
>  mm/page_alloc.c             |  10 ++++
>  mm/vmstat.c                 | 129 ++++++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 200 insertions(+)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 9baf66a..e310e69 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
>  - swappiness
>  - user_reserve_kbytes
>  - vfs_cache_pressure
> +- numa_stats_mode
>  - watermark_scale_factor
>  - zone_reclaim_mode
>  
> @@ -843,6 +844,29 @@ ten times more freeable objects than there are.
>  
>  =============================================================
>  
> +numa_stats_mode
> +
> +This interface allows numa statistics configurable.
> +
> +When page allocation performance becomes a bottleneck and you can tolerate
> +some possible tool breakage and decreased numa counter precision, you can
> +do:
> +	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> +
> +When page allocation performance is not a bottleneck and you want all
> +tooling to work, you can do:
> +	echo [S|s]trict > /proc/sys/vm/numa_stat_mode
> +
> +We recommend automatic detection of numa statistics by system, because numa
> +statistics does not affect system's decision and it is very rarely
> +consumed. you can do:
> +	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> +This is also system default configuration, with this default setting, numa
> +counters update is skipped unless the counter is *read* by users at least
> +once.
> +
> +==============================================================
> +
>  watermark_scale_factor:
>  
>  This factor controls the aggressiveness of kswapd. It defines the
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 3855902..b57b5622 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
>  static ssize_t node_read_numastat(struct device *dev,
>  				struct device_attribute *attr, char *buf)
>  {
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
>  	return sprintf(buf,
>  		       "numa_hit %lu\n"
>  		       "numa_miss %lu\n"
> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
>  		n += sprintf(buf+n, "%s %lu\n",
>  			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>  			     sum_zone_numa_state(nid, i));
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
>  #endif
>  
>  	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index ade7cb5..d52e882 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -6,9 +6,28 @@
>  #include <linux/mmzone.h>
>  #include <linux/vm_event_item.h>
>  #include <linux/atomic.h>
> +#include <linux/static_key.h>
>  
>  extern int sysctl_stat_interval;
>  
> +#ifdef CONFIG_NUMA
> +DECLARE_STATIC_KEY_FALSE(vm_numa_stats_mode_key);
> +/*
> + * vm_numa_stats_mode:
> + * 0 = auto mode of NUMA stats, automatic detection of NUMA statistics.
> + * 1 = strict mode of NUMA stats, keep NUMA statistics.
> + * 2 = coarse mode of NUMA stats, ignore NUMA statistics.
> + */
> +#define VM_NUMA_STAT_AUTO_MODE 0
> +#define VM_NUMA_STAT_STRICT_MODE  1
> +#define VM_NUMA_STAT_COARSE_MODE  2
> +#define VM_NUMA_STAT_MODE_LEN 16
> +extern int vm_numa_stats_mode;
> +extern char sysctl_vm_numa_stats_mode[];
> +extern int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
> +		void __user *buffer, size_t *length, loff_t *ppos);
> +#endif
> +
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>  /*
>   * Light weight per cpu counter implementation.
> @@ -229,6 +248,10 @@ extern unsigned long sum_zone_node_page_state(int node,
>  extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
>  extern unsigned long node_page_state(struct pglist_data *pgdat,
>  						enum node_stat_item item);
> +extern void zero_zone_numa_counters(struct zone *zone);
> +extern void zero_zones_numa_counters(void);
> +extern void zero_global_numa_counters(void);
> +extern void invalid_numa_statistics(void);
>  #else
>  #define sum_zone_node_page_state(node, item) global_zone_page_state(item)
>  #define node_page_state(node, item) global_node_page_state(item)
> diff --git a/init/main.c b/init/main.c
> index 0ee9c686..1e300a8 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -567,6 +567,9 @@ asmlinkage __visible void __init start_kernel(void)
>  	sort_main_extable();
>  	trap_init();
>  	mm_init();
> +#ifdef CONFIG_NUMA
> +	pr_info("vmstat: NUMA stats is skipped unless it has been consumed\n");
> +#endif
>  
>  	ftrace_init();
>  
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6648fbb..0678668 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1374,6 +1374,13 @@ static struct ctl_table vm_table[] = {
>  		.mode           = 0644,
>  		.proc_handler   = &hugetlb_mempolicy_sysctl_handler,
>  	},
> +	{
> +		.procname	= "numa_stats_mode",
> +		.data		= sysctl_vm_numa_stats_mode,
> +		.maxlen		= VM_NUMA_STAT_MODE_LEN,
> +		.mode		= 0644,
> +		.proc_handler	= sysctl_vm_numa_stats_mode_handler,
> +	},
>  #endif
>  	 {
>  		.procname	= "hugetlb_shm_group",
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c841af8..6d7ea18 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -83,6 +83,8 @@ DEFINE_PER_CPU(int, numa_node);
>  EXPORT_PER_CPU_SYMBOL(numa_node);
>  #endif
>  
> +DEFINE_STATIC_KEY_FALSE(vm_numa_stats_mode_key);
> +
>  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
>  /*
>   * N.B., Do NOT reference the '_numa_mem_' per cpu variable directly.
> @@ -2743,6 +2745,14 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
>  #ifdef CONFIG_NUMA
>  	enum numa_stat_item local_stat = NUMA_LOCAL;
>  
> +	/*
> +	 * skip zone_statistics() if NUMA stats is set to coarse mode or
> +	 * NUMA stats is never consumed in auto mode.
> +	 */
> +
> +	if (!static_branch_unlikely(&vm_numa_stats_mode_key))
> +		return;
> +
>  	if (z->node != numa_node_id())
>  		local_stat = NUMA_OTHER;
>  
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4bb13e7..469599c 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -32,6 +32,91 @@
>  
>  #define NUMA_STATS_THRESHOLD (U16_MAX - 2)
>  
> +#ifdef CONFIG_NUMA
> +int vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
> +char sysctl_vm_numa_stats_mode[VM_NUMA_STAT_MODE_LEN] = "auto";
> +static const char *vm_numa_stats_mode_name[3] = {"auto", "strict", "coarse"};
> +static DEFINE_MUTEX(vm_numa_stats_mode_lock);
> +
> +static int __parse_vm_numa_stats_mode(char *s)
> +{
> +	const char *str = s;
> +
> +	if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0)
> +		vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
> +	else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0)
> +		vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE;
> +	else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0)
> +		vm_numa_stats_mode = VM_NUMA_STAT_COARSE_MODE;
> +	else {
> +		pr_warn("Ignoring invalid vm_numa_stats_mode value: %s\n", s);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
> +		void __user *buffer, size_t *length, loff_t *ppos)
> +{
> +	char old_string[VM_NUMA_STAT_MODE_LEN];
> +	int ret, oldval;
> +
> +	mutex_lock(&vm_numa_stats_mode_lock);
> +	if (write)
> +		strncpy(old_string, (char *)table->data, VM_NUMA_STAT_MODE_LEN);
> +	ret = proc_dostring(table, write, buffer, length, ppos);
> +	if (ret || !write) {
> +		mutex_unlock(&vm_numa_stats_mode_lock);
> +		return ret;
> +	}
> +
> +	oldval = vm_numa_stats_mode;
> +	if (__parse_vm_numa_stats_mode((char *)table->data)) {
> +		/*
> +		 * invalid sysctl_vm_numa_stats_mode value, restore saved string
> +		 */
> +		strncpy((char *)table->data, old_string, VM_NUMA_STAT_MODE_LEN);
> +		vm_numa_stats_mode = oldval;
> +	} else {
> +		/*
> +		 * check whether numa stats mode changes or not
> +		 */
> +		if (vm_numa_stats_mode == oldval) {
> +			/* no change */
> +			mutex_unlock(&vm_numa_stats_mode_lock);
> +			return 0;
> +		} else if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +			/*
> +			 * Keep the branch selection in last time when numa stats
> +			 * is changed to auto mode.
> +			 */
> +			pr_info("numa stats changes from %s mode to auto mode\n",
> +					vm_numa_stats_mode_name[oldval]);
> +		else if (vm_numa_stats_mode == VM_NUMA_STAT_STRICT_MODE) {
> +			static_branch_enable(&vm_numa_stats_mode_key);
> +			pr_info("numa stats changes from %s mode to strict mode\n",
> +					vm_numa_stats_mode_name[oldval]);
> +		} else if (vm_numa_stats_mode == VM_NUMA_STAT_COARSE_MODE) {
> +			static_branch_disable(&vm_numa_stats_mode_key);
> +			/*
> +			 * Invalidate numa counters when vmstat mode is set to coarse
> +			 * mode, because users can't tell the difference between the
> +			 * dead state and when allocator activity is quiet once
> +			 * zone_statistics() is turned off.
> +			 */
> +			invalid_numa_statistics();
> +			pr_info("numa stats changes from %s mode to coarse mode\n",
> +					vm_numa_stats_mode_name[oldval]);
> +		} else
> +			pr_warn("invalid vm_numa_stats_mode:%d\n", vm_numa_stats_mode);
> +	}
> +
> +	mutex_unlock(&vm_numa_stats_mode_lock);
> +	return 0;
> +}
> +#endif
> +
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>  DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
>  EXPORT_PER_CPU_SYMBOL(vm_event_states);
> @@ -914,6 +999,42 @@ unsigned long sum_zone_numa_state(int node,
>  	return count;
>  }
>  
> +/* zero numa counters within a zone */
> +void zero_zone_numa_counters(struct zone *zone)
> +{
> +	int item, cpu;
> +
> +	for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) {
> +		atomic_long_set(&zone->vm_numa_stat[item], 0);
> +		for_each_online_cpu(cpu)
> +			per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item] = 0;
> +	}
> +}
> +
> +/* zero numa counters of all the populated zones */
> +void zero_zones_numa_counters(void)
> +{
> +	struct zone *zone;
> +
> +	for_each_populated_zone(zone)
> +		zero_zone_numa_counters(zone);
> +}
> +
> +/* zero global numa counters */
> +void zero_global_numa_counters(void)
> +{
> +	int item;
> +
> +	for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++)
> +		atomic_long_set(&vm_numa_stat[item], 0);
> +}
> +
> +void invalid_numa_statistics(void)
> +{
> +	zero_zones_numa_counters();
> +	zero_global_numa_counters();
> +}
> +
>  /*
>   * Determine the per node value of a stat item.
>   */
> @@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void *arg)
>  {
>  	pg_data_t *pgdat = (pg_data_t *)arg;
>  	walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print);
> +#ifdef CONFIG_NUMA
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
> +#endif
>  	return 0;
>  }
>  
> @@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg)
>  
>  static void vmstat_stop(struct seq_file *m, void *arg)
>  {
> +#ifdef CONFIG_NUMA
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
> +#endif
>  	kfree(m->private);
>  	m->private = NULL;
>  }
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-09-29  7:27   ` Vlastimil Babka
  0 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2017-09-29  7:27 UTC (permalink / raw)
  To: Kemi Wang, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Michal Hocko, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior
  Cc: Dave, Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel, Linux API

[+CC linux-api]

On 09/28/2017 08:11 AM, Kemi Wang wrote:
> This is the second step which introduces a tunable interface that allow
> numa stats configurable for optimizing zone_statistics(), as suggested by
> Dave Hansen and Ying Huang.
> 
> =========================================================================
> When page allocation performance becomes a bottleneck and you can tolerate
> some possible tool breakage and decreased numa counter precision, you can
> do:
> 	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is ignored. We can see about
> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> memory).
> 
> Benchmark link provided by Jesper D Brouer(increase loop times to
> 10000000):
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> bench
> 
> =========================================================================
> When page allocation performance is not a bottleneck and you want all
> tooling to work, you can do:
> 	echo [S|s]trict > /proc/sys/vm/numa_stats_mode
> 
> =========================================================================
> We recommend automatic detection of numa statistics by system, this is also
> system default configuration, you can do:
> 	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is skipped unless it has been read by
> users at least once, e.g. cat /proc/zoneinfo.
> 
> Branch target selection with jump label:
> a) When numa_stats_mode is changed to *strict*, jump to the branch for numa
> counters update.
> b) When numa_stats_mode is changed to *coarse*, return back directly.
> c) When numa_stats_mode is changed to *auto*, the branch target used in
> last time is kept, and the branch target is changed to the branch for numa
> counters update once numa counters are *read* by users.
> 
> Therefore, with the help of jump label, the page allocation performance is
> hardly affected when numa counters are updated with a call in
> zone_statistics(). Meanwhile, the auto mode can give people benefit without
> manual tuning.
> 
> Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to
> help improve the original patch.
> 
> ChangeLog:
>   V2->V3:
>   a) Propose a better way to use jump label to eliminate the overhead of
>   branch selection in zone_statistics(), as inspired by Ying Huang;
>   b) Add a paragraph in commit log to describe the way for branch target
>   selection;
>   c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
>   and change the description accordingly, as suggested by Michal Hocko;
>   d) Make this functionality NUMA-specific via ifdef
> 
>   V1->V2:
>   a) Merge to one patch;
>   b) Use jump label to eliminate the overhead of branch selection;
>   c) Add a single-time log message at boot time to help tell users what
>   happened.
> 
> Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Suggested-by: Ying Huang <ying.huang@intel.com>
> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
> ---
>  Documentation/sysctl/vm.txt |  24 +++++++++
>  drivers/base/node.c         |   4 ++
>  include/linux/vmstat.h      |  23 ++++++++
>  init/main.c                 |   3 ++
>  kernel/sysctl.c             |   7 +++
>  mm/page_alloc.c             |  10 ++++
>  mm/vmstat.c                 | 129 ++++++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 200 insertions(+)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 9baf66a..e310e69 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
>  - swappiness
>  - user_reserve_kbytes
>  - vfs_cache_pressure
> +- numa_stats_mode
>  - watermark_scale_factor
>  - zone_reclaim_mode
>  
> @@ -843,6 +844,29 @@ ten times more freeable objects than there are.
>  
>  =============================================================
>  
> +numa_stats_mode
> +
> +This interface allows numa statistics configurable.
> +
> +When page allocation performance becomes a bottleneck and you can tolerate
> +some possible tool breakage and decreased numa counter precision, you can
> +do:
> +	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> +
> +When page allocation performance is not a bottleneck and you want all
> +tooling to work, you can do:
> +	echo [S|s]trict > /proc/sys/vm/numa_stat_mode
> +
> +We recommend automatic detection of numa statistics by system, because numa
> +statistics does not affect system's decision and it is very rarely
> +consumed. you can do:
> +	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> +This is also system default configuration, with this default setting, numa
> +counters update is skipped unless the counter is *read* by users at least
> +once.
> +
> +==============================================================
> +
>  watermark_scale_factor:
>  
>  This factor controls the aggressiveness of kswapd. It defines the
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 3855902..b57b5622 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
>  static ssize_t node_read_numastat(struct device *dev,
>  				struct device_attribute *attr, char *buf)
>  {
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
>  	return sprintf(buf,
>  		       "numa_hit %lu\n"
>  		       "numa_miss %lu\n"
> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
>  		n += sprintf(buf+n, "%s %lu\n",
>  			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>  			     sum_zone_numa_state(nid, i));
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
>  #endif
>  
>  	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index ade7cb5..d52e882 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -6,9 +6,28 @@
>  #include <linux/mmzone.h>
>  #include <linux/vm_event_item.h>
>  #include <linux/atomic.h>
> +#include <linux/static_key.h>
>  
>  extern int sysctl_stat_interval;
>  
> +#ifdef CONFIG_NUMA
> +DECLARE_STATIC_KEY_FALSE(vm_numa_stats_mode_key);
> +/*
> + * vm_numa_stats_mode:
> + * 0 = auto mode of NUMA stats, automatic detection of NUMA statistics.
> + * 1 = strict mode of NUMA stats, keep NUMA statistics.
> + * 2 = coarse mode of NUMA stats, ignore NUMA statistics.
> + */
> +#define VM_NUMA_STAT_AUTO_MODE 0
> +#define VM_NUMA_STAT_STRICT_MODE  1
> +#define VM_NUMA_STAT_COARSE_MODE  2
> +#define VM_NUMA_STAT_MODE_LEN 16
> +extern int vm_numa_stats_mode;
> +extern char sysctl_vm_numa_stats_mode[];
> +extern int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
> +		void __user *buffer, size_t *length, loff_t *ppos);
> +#endif
> +
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>  /*
>   * Light weight per cpu counter implementation.
> @@ -229,6 +248,10 @@ extern unsigned long sum_zone_node_page_state(int node,
>  extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
>  extern unsigned long node_page_state(struct pglist_data *pgdat,
>  						enum node_stat_item item);
> +extern void zero_zone_numa_counters(struct zone *zone);
> +extern void zero_zones_numa_counters(void);
> +extern void zero_global_numa_counters(void);
> +extern void invalid_numa_statistics(void);
>  #else
>  #define sum_zone_node_page_state(node, item) global_zone_page_state(item)
>  #define node_page_state(node, item) global_node_page_state(item)
> diff --git a/init/main.c b/init/main.c
> index 0ee9c686..1e300a8 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -567,6 +567,9 @@ asmlinkage __visible void __init start_kernel(void)
>  	sort_main_extable();
>  	trap_init();
>  	mm_init();
> +#ifdef CONFIG_NUMA
> +	pr_info("vmstat: NUMA stats is skipped unless it has been consumed\n");
> +#endif
>  
>  	ftrace_init();
>  
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6648fbb..0678668 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1374,6 +1374,13 @@ static struct ctl_table vm_table[] = {
>  		.mode           = 0644,
>  		.proc_handler   = &hugetlb_mempolicy_sysctl_handler,
>  	},
> +	{
> +		.procname	= "numa_stats_mode",
> +		.data		= sysctl_vm_numa_stats_mode,
> +		.maxlen		= VM_NUMA_STAT_MODE_LEN,
> +		.mode		= 0644,
> +		.proc_handler	= sysctl_vm_numa_stats_mode_handler,
> +	},
>  #endif
>  	 {
>  		.procname	= "hugetlb_shm_group",
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c841af8..6d7ea18 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -83,6 +83,8 @@ DEFINE_PER_CPU(int, numa_node);
>  EXPORT_PER_CPU_SYMBOL(numa_node);
>  #endif
>  
> +DEFINE_STATIC_KEY_FALSE(vm_numa_stats_mode_key);
> +
>  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
>  /*
>   * N.B., Do NOT reference the '_numa_mem_' per cpu variable directly.
> @@ -2743,6 +2745,14 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
>  #ifdef CONFIG_NUMA
>  	enum numa_stat_item local_stat = NUMA_LOCAL;
>  
> +	/*
> +	 * skip zone_statistics() if NUMA stats is set to coarse mode or
> +	 * NUMA stats is never consumed in auto mode.
> +	 */
> +
> +	if (!static_branch_unlikely(&vm_numa_stats_mode_key))
> +		return;
> +
>  	if (z->node != numa_node_id())
>  		local_stat = NUMA_OTHER;
>  
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4bb13e7..469599c 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -32,6 +32,91 @@
>  
>  #define NUMA_STATS_THRESHOLD (U16_MAX - 2)
>  
> +#ifdef CONFIG_NUMA
> +int vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
> +char sysctl_vm_numa_stats_mode[VM_NUMA_STAT_MODE_LEN] = "auto";
> +static const char *vm_numa_stats_mode_name[3] = {"auto", "strict", "coarse"};
> +static DEFINE_MUTEX(vm_numa_stats_mode_lock);
> +
> +static int __parse_vm_numa_stats_mode(char *s)
> +{
> +	const char *str = s;
> +
> +	if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0)
> +		vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
> +	else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0)
> +		vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE;
> +	else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0)
> +		vm_numa_stats_mode = VM_NUMA_STAT_COARSE_MODE;
> +	else {
> +		pr_warn("Ignoring invalid vm_numa_stats_mode value: %s\n", s);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
> +		void __user *buffer, size_t *length, loff_t *ppos)
> +{
> +	char old_string[VM_NUMA_STAT_MODE_LEN];
> +	int ret, oldval;
> +
> +	mutex_lock(&vm_numa_stats_mode_lock);
> +	if (write)
> +		strncpy(old_string, (char *)table->data, VM_NUMA_STAT_MODE_LEN);
> +	ret = proc_dostring(table, write, buffer, length, ppos);
> +	if (ret || !write) {
> +		mutex_unlock(&vm_numa_stats_mode_lock);
> +		return ret;
> +	}
> +
> +	oldval = vm_numa_stats_mode;
> +	if (__parse_vm_numa_stats_mode((char *)table->data)) {
> +		/*
> +		 * invalid sysctl_vm_numa_stats_mode value, restore saved string
> +		 */
> +		strncpy((char *)table->data, old_string, VM_NUMA_STAT_MODE_LEN);
> +		vm_numa_stats_mode = oldval;
> +	} else {
> +		/*
> +		 * check whether numa stats mode changes or not
> +		 */
> +		if (vm_numa_stats_mode == oldval) {
> +			/* no change */
> +			mutex_unlock(&vm_numa_stats_mode_lock);
> +			return 0;
> +		} else if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +			/*
> +			 * Keep the branch selection in last time when numa stats
> +			 * is changed to auto mode.
> +			 */
> +			pr_info("numa stats changes from %s mode to auto mode\n",
> +					vm_numa_stats_mode_name[oldval]);
> +		else if (vm_numa_stats_mode == VM_NUMA_STAT_STRICT_MODE) {
> +			static_branch_enable(&vm_numa_stats_mode_key);
> +			pr_info("numa stats changes from %s mode to strict mode\n",
> +					vm_numa_stats_mode_name[oldval]);
> +		} else if (vm_numa_stats_mode == VM_NUMA_STAT_COARSE_MODE) {
> +			static_branch_disable(&vm_numa_stats_mode_key);
> +			/*
> +			 * Invalidate numa counters when vmstat mode is set to coarse
> +			 * mode, because users can't tell the difference between the
> +			 * dead state and when allocator activity is quiet once
> +			 * zone_statistics() is turned off.
> +			 */
> +			invalid_numa_statistics();
> +			pr_info("numa stats changes from %s mode to coarse mode\n",
> +					vm_numa_stats_mode_name[oldval]);
> +		} else
> +			pr_warn("invalid vm_numa_stats_mode:%d\n", vm_numa_stats_mode);
> +	}
> +
> +	mutex_unlock(&vm_numa_stats_mode_lock);
> +	return 0;
> +}
> +#endif
> +
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>  DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
>  EXPORT_PER_CPU_SYMBOL(vm_event_states);
> @@ -914,6 +999,42 @@ unsigned long sum_zone_numa_state(int node,
>  	return count;
>  }
>  
> +/* zero numa counters within a zone */
> +void zero_zone_numa_counters(struct zone *zone)
> +{
> +	int item, cpu;
> +
> +	for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) {
> +		atomic_long_set(&zone->vm_numa_stat[item], 0);
> +		for_each_online_cpu(cpu)
> +			per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item] = 0;
> +	}
> +}
> +
> +/* zero numa counters of all the populated zones */
> +void zero_zones_numa_counters(void)
> +{
> +	struct zone *zone;
> +
> +	for_each_populated_zone(zone)
> +		zero_zone_numa_counters(zone);
> +}
> +
> +/* zero global numa counters */
> +void zero_global_numa_counters(void)
> +{
> +	int item;
> +
> +	for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++)
> +		atomic_long_set(&vm_numa_stat[item], 0);
> +}
> +
> +void invalid_numa_statistics(void)
> +{
> +	zero_zones_numa_counters();
> +	zero_global_numa_counters();
> +}
> +
>  /*
>   * Determine the per node value of a stat item.
>   */
> @@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void *arg)
>  {
>  	pg_data_t *pgdat = (pg_data_t *)arg;
>  	walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print);
> +#ifdef CONFIG_NUMA
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
> +#endif
>  	return 0;
>  }
>  
> @@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg)
>  
>  static void vmstat_stop(struct seq_file *m, void *arg)
>  {
> +#ifdef CONFIG_NUMA
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
> +#endif
>  	kfree(m->private);
>  	m->private = NULL;
>  }
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-09-29  7:27   ` Vlastimil Babka
  0 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2017-09-29  7:27 UTC (permalink / raw)
  To: Kemi Wang, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Michal Hocko, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior
  Cc: Dave, Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel, Linux API

[+CC linux-api]

On 09/28/2017 08:11 AM, Kemi Wang wrote:
> This is the second step which introduces a tunable interface that allow
> numa stats configurable for optimizing zone_statistics(), as suggested by
> Dave Hansen and Ying Huang.
> 
> =========================================================================
> When page allocation performance becomes a bottleneck and you can tolerate
> some possible tool breakage and decreased numa counter precision, you can
> do:
> 	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is ignored. We can see about
> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> memory).
> 
> Benchmark link provided by Jesper D Brouer(increase loop times to
> 10000000):
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> bench
> 
> =========================================================================
> When page allocation performance is not a bottleneck and you want all
> tooling to work, you can do:
> 	echo [S|s]trict > /proc/sys/vm/numa_stats_mode
> 
> =========================================================================
> We recommend automatic detection of numa statistics by system, this is also
> system default configuration, you can do:
> 	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is skipped unless it has been read by
> users at least once, e.g. cat /proc/zoneinfo.
> 
> Branch target selection with jump label:
> a) When numa_stats_mode is changed to *strict*, jump to the branch for numa
> counters update.
> b) When numa_stats_mode is changed to *coarse*, return back directly.
> c) When numa_stats_mode is changed to *auto*, the branch target used in
> last time is kept, and the branch target is changed to the branch for numa
> counters update once numa counters are *read* by users.
> 
> Therefore, with the help of jump label, the page allocation performance is
> hardly affected when numa counters are updated with a call in
> zone_statistics(). Meanwhile, the auto mode can give people benefit without
> manual tuning.
> 
> Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to
> help improve the original patch.
> 
> ChangeLog:
>   V2->V3:
>   a) Propose a better way to use jump label to eliminate the overhead of
>   branch selection in zone_statistics(), as inspired by Ying Huang;
>   b) Add a paragraph in commit log to describe the way for branch target
>   selection;
>   c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
>   and change the description accordingly, as suggested by Michal Hocko;
>   d) Make this functionality NUMA-specific via ifdef
> 
>   V1->V2:
>   a) Merge to one patch;
>   b) Use jump label to eliminate the overhead of branch selection;
>   c) Add a single-time log message at boot time to help tell users what
>   happened.
> 
> Reported-by: Jesper Dangaard Brouer <brouer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Suggested-by: Dave Hansen <dave.hansen-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Suggested-by: Ying Huang <ying.huang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> Signed-off-by: Kemi Wang <kemi.wang-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> ---
>  Documentation/sysctl/vm.txt |  24 +++++++++
>  drivers/base/node.c         |   4 ++
>  include/linux/vmstat.h      |  23 ++++++++
>  init/main.c                 |   3 ++
>  kernel/sysctl.c             |   7 +++
>  mm/page_alloc.c             |  10 ++++
>  mm/vmstat.c                 | 129 ++++++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 200 insertions(+)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 9baf66a..e310e69 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
>  - swappiness
>  - user_reserve_kbytes
>  - vfs_cache_pressure
> +- numa_stats_mode
>  - watermark_scale_factor
>  - zone_reclaim_mode
>  
> @@ -843,6 +844,29 @@ ten times more freeable objects than there are.
>  
>  =============================================================
>  
> +numa_stats_mode
> +
> +This interface allows numa statistics configurable.
> +
> +When page allocation performance becomes a bottleneck and you can tolerate
> +some possible tool breakage and decreased numa counter precision, you can
> +do:
> +	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> +
> +When page allocation performance is not a bottleneck and you want all
> +tooling to work, you can do:
> +	echo [S|s]trict > /proc/sys/vm/numa_stat_mode
> +
> +We recommend automatic detection of numa statistics by system, because numa
> +statistics does not affect system's decision and it is very rarely
> +consumed. you can do:
> +	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> +This is also system default configuration, with this default setting, numa
> +counters update is skipped unless the counter is *read* by users at least
> +once.
> +
> +==============================================================
> +
>  watermark_scale_factor:
>  
>  This factor controls the aggressiveness of kswapd. It defines the
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 3855902..b57b5622 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
>  static ssize_t node_read_numastat(struct device *dev,
>  				struct device_attribute *attr, char *buf)
>  {
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
>  	return sprintf(buf,
>  		       "numa_hit %lu\n"
>  		       "numa_miss %lu\n"
> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
>  		n += sprintf(buf+n, "%s %lu\n",
>  			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>  			     sum_zone_numa_state(nid, i));
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
>  #endif
>  
>  	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index ade7cb5..d52e882 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -6,9 +6,28 @@
>  #include <linux/mmzone.h>
>  #include <linux/vm_event_item.h>
>  #include <linux/atomic.h>
> +#include <linux/static_key.h>
>  
>  extern int sysctl_stat_interval;
>  
> +#ifdef CONFIG_NUMA
> +DECLARE_STATIC_KEY_FALSE(vm_numa_stats_mode_key);
> +/*
> + * vm_numa_stats_mode:
> + * 0 = auto mode of NUMA stats, automatic detection of NUMA statistics.
> + * 1 = strict mode of NUMA stats, keep NUMA statistics.
> + * 2 = coarse mode of NUMA stats, ignore NUMA statistics.
> + */
> +#define VM_NUMA_STAT_AUTO_MODE 0
> +#define VM_NUMA_STAT_STRICT_MODE  1
> +#define VM_NUMA_STAT_COARSE_MODE  2
> +#define VM_NUMA_STAT_MODE_LEN 16
> +extern int vm_numa_stats_mode;
> +extern char sysctl_vm_numa_stats_mode[];
> +extern int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
> +		void __user *buffer, size_t *length, loff_t *ppos);
> +#endif
> +
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>  /*
>   * Light weight per cpu counter implementation.
> @@ -229,6 +248,10 @@ extern unsigned long sum_zone_node_page_state(int node,
>  extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
>  extern unsigned long node_page_state(struct pglist_data *pgdat,
>  						enum node_stat_item item);
> +extern void zero_zone_numa_counters(struct zone *zone);
> +extern void zero_zones_numa_counters(void);
> +extern void zero_global_numa_counters(void);
> +extern void invalid_numa_statistics(void);
>  #else
>  #define sum_zone_node_page_state(node, item) global_zone_page_state(item)
>  #define node_page_state(node, item) global_node_page_state(item)
> diff --git a/init/main.c b/init/main.c
> index 0ee9c686..1e300a8 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -567,6 +567,9 @@ asmlinkage __visible void __init start_kernel(void)
>  	sort_main_extable();
>  	trap_init();
>  	mm_init();
> +#ifdef CONFIG_NUMA
> +	pr_info("vmstat: NUMA stats is skipped unless it has been consumed\n");
> +#endif
>  
>  	ftrace_init();
>  
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6648fbb..0678668 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1374,6 +1374,13 @@ static struct ctl_table vm_table[] = {
>  		.mode           = 0644,
>  		.proc_handler   = &hugetlb_mempolicy_sysctl_handler,
>  	},
> +	{
> +		.procname	= "numa_stats_mode",
> +		.data		= sysctl_vm_numa_stats_mode,
> +		.maxlen		= VM_NUMA_STAT_MODE_LEN,
> +		.mode		= 0644,
> +		.proc_handler	= sysctl_vm_numa_stats_mode_handler,
> +	},
>  #endif
>  	 {
>  		.procname	= "hugetlb_shm_group",
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c841af8..6d7ea18 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -83,6 +83,8 @@ DEFINE_PER_CPU(int, numa_node);
>  EXPORT_PER_CPU_SYMBOL(numa_node);
>  #endif
>  
> +DEFINE_STATIC_KEY_FALSE(vm_numa_stats_mode_key);
> +
>  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
>  /*
>   * N.B., Do NOT reference the '_numa_mem_' per cpu variable directly.
> @@ -2743,6 +2745,14 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
>  #ifdef CONFIG_NUMA
>  	enum numa_stat_item local_stat = NUMA_LOCAL;
>  
> +	/*
> +	 * skip zone_statistics() if NUMA stats is set to coarse mode or
> +	 * NUMA stats is never consumed in auto mode.
> +	 */
> +
> +	if (!static_branch_unlikely(&vm_numa_stats_mode_key))
> +		return;
> +
>  	if (z->node != numa_node_id())
>  		local_stat = NUMA_OTHER;
>  
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4bb13e7..469599c 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -32,6 +32,91 @@
>  
>  #define NUMA_STATS_THRESHOLD (U16_MAX - 2)
>  
> +#ifdef CONFIG_NUMA
> +int vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
> +char sysctl_vm_numa_stats_mode[VM_NUMA_STAT_MODE_LEN] = "auto";
> +static const char *vm_numa_stats_mode_name[3] = {"auto", "strict", "coarse"};
> +static DEFINE_MUTEX(vm_numa_stats_mode_lock);
> +
> +static int __parse_vm_numa_stats_mode(char *s)
> +{
> +	const char *str = s;
> +
> +	if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0)
> +		vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
> +	else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0)
> +		vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE;
> +	else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0)
> +		vm_numa_stats_mode = VM_NUMA_STAT_COARSE_MODE;
> +	else {
> +		pr_warn("Ignoring invalid vm_numa_stats_mode value: %s\n", s);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
> +		void __user *buffer, size_t *length, loff_t *ppos)
> +{
> +	char old_string[VM_NUMA_STAT_MODE_LEN];
> +	int ret, oldval;
> +
> +	mutex_lock(&vm_numa_stats_mode_lock);
> +	if (write)
> +		strncpy(old_string, (char *)table->data, VM_NUMA_STAT_MODE_LEN);
> +	ret = proc_dostring(table, write, buffer, length, ppos);
> +	if (ret || !write) {
> +		mutex_unlock(&vm_numa_stats_mode_lock);
> +		return ret;
> +	}
> +
> +	oldval = vm_numa_stats_mode;
> +	if (__parse_vm_numa_stats_mode((char *)table->data)) {
> +		/*
> +		 * invalid sysctl_vm_numa_stats_mode value, restore saved string
> +		 */
> +		strncpy((char *)table->data, old_string, VM_NUMA_STAT_MODE_LEN);
> +		vm_numa_stats_mode = oldval;
> +	} else {
> +		/*
> +		 * check whether numa stats mode changes or not
> +		 */
> +		if (vm_numa_stats_mode == oldval) {
> +			/* no change */
> +			mutex_unlock(&vm_numa_stats_mode_lock);
> +			return 0;
> +		} else if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +			/*
> +			 * Keep the branch selection in last time when numa stats
> +			 * is changed to auto mode.
> +			 */
> +			pr_info("numa stats changes from %s mode to auto mode\n",
> +					vm_numa_stats_mode_name[oldval]);
> +		else if (vm_numa_stats_mode == VM_NUMA_STAT_STRICT_MODE) {
> +			static_branch_enable(&vm_numa_stats_mode_key);
> +			pr_info("numa stats changes from %s mode to strict mode\n",
> +					vm_numa_stats_mode_name[oldval]);
> +		} else if (vm_numa_stats_mode == VM_NUMA_STAT_COARSE_MODE) {
> +			static_branch_disable(&vm_numa_stats_mode_key);
> +			/*
> +			 * Invalidate numa counters when vmstat mode is set to coarse
> +			 * mode, because users can't tell the difference between the
> +			 * dead state and when allocator activity is quiet once
> +			 * zone_statistics() is turned off.
> +			 */
> +			invalid_numa_statistics();
> +			pr_info("numa stats changes from %s mode to coarse mode\n",
> +					vm_numa_stats_mode_name[oldval]);
> +		} else
> +			pr_warn("invalid vm_numa_stats_mode:%d\n", vm_numa_stats_mode);
> +	}
> +
> +	mutex_unlock(&vm_numa_stats_mode_lock);
> +	return 0;
> +}
> +#endif
> +
>  #ifdef CONFIG_VM_EVENT_COUNTERS
>  DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
>  EXPORT_PER_CPU_SYMBOL(vm_event_states);
> @@ -914,6 +999,42 @@ unsigned long sum_zone_numa_state(int node,
>  	return count;
>  }
>  
> +/* zero numa counters within a zone */
> +void zero_zone_numa_counters(struct zone *zone)
> +{
> +	int item, cpu;
> +
> +	for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++) {
> +		atomic_long_set(&zone->vm_numa_stat[item], 0);
> +		for_each_online_cpu(cpu)
> +			per_cpu_ptr(zone->pageset, cpu)->vm_numa_stat_diff[item] = 0;
> +	}
> +}
> +
> +/* zero numa counters of all the populated zones */
> +void zero_zones_numa_counters(void)
> +{
> +	struct zone *zone;
> +
> +	for_each_populated_zone(zone)
> +		zero_zone_numa_counters(zone);
> +}
> +
> +/* zero global numa counters */
> +void zero_global_numa_counters(void)
> +{
> +	int item;
> +
> +	for (item = 0; item < NR_VM_NUMA_STAT_ITEMS; item++)
> +		atomic_long_set(&vm_numa_stat[item], 0);
> +}
> +
> +void invalid_numa_statistics(void)
> +{
> +	zero_zones_numa_counters();
> +	zero_global_numa_counters();
> +}
> +
>  /*
>   * Determine the per node value of a stat item.
>   */
> @@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void *arg)
>  {
>  	pg_data_t *pgdat = (pg_data_t *)arg;
>  	walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print);
> +#ifdef CONFIG_NUMA
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
> +#endif
>  	return 0;
>  }
>  
> @@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg)
>  
>  static void vmstat_stop(struct seq_file *m, void *arg)
>  {
> +#ifdef CONFIG_NUMA
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
> +#endif
>  	kfree(m->private);
>  	m->private = NULL;
>  }
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-09-28  6:11 ` Kemi Wang
@ 2017-10-03  9:23   ` Michal Hocko
  -1 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2017-10-03  9:23 UTC (permalink / raw)
  To: Kemi Wang
  Cc: Luis R . Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Dave, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel

On Thu 28-09-17 14:11:41, Kemi Wang wrote:
> This is the second step which introduces a tunable interface that allow
> numa stats configurable for optimizing zone_statistics(), as suggested by
> Dave Hansen and Ying Huang.
> 
> =========================================================================
> When page allocation performance becomes a bottleneck and you can tolerate
> some possible tool breakage and decreased numa counter precision, you can
> do:
> 	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is ignored. We can see about
> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> memory).
> 
> Benchmark link provided by Jesper D Brouer(increase loop times to
> 10000000):
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> bench
> 
> =========================================================================
> When page allocation performance is not a bottleneck and you want all
> tooling to work, you can do:
> 	echo [S|s]trict > /proc/sys/vm/numa_stats_mode
> 
> =========================================================================
> We recommend automatic detection of numa statistics by system, this is also
> system default configuration, you can do:
> 	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is skipped unless it has been read by
> users at least once, e.g. cat /proc/zoneinfo.

I am still not convinced the auto mode is worth all the additional code
and a safe default to use. The whole thing could have been 0/1 with a
simpler parsing and less code to catch readers.

E.g. why do we have to do static_branch_enable on any read or even
vmstat_stop? Wouldn't open be sufficient?

> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
>  static ssize_t node_read_numastat(struct device *dev,
>  				struct device_attribute *attr, char *buf)
>  {
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
>  	return sprintf(buf,
>  		       "numa_hit %lu\n"
>  		       "numa_miss %lu\n"
> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
>  		n += sprintf(buf+n, "%s %lu\n",
>  			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>  			     sum_zone_numa_state(nid, i));
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
>  #endif
>  
>  	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
[...]
> @@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void *arg)
>  {
>  	pg_data_t *pgdat = (pg_data_t *)arg;
>  	walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print);
> +#ifdef CONFIG_NUMA
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
> +#endif
>  	return 0;
>  }
>  
> @@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg)
>  
>  static void vmstat_stop(struct seq_file *m, void *arg)
>  {
> +#ifdef CONFIG_NUMA
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
> +#endif
>  	kfree(m->private);
>  	m->private = NULL;
>  }
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-10-03  9:23   ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2017-10-03  9:23 UTC (permalink / raw)
  To: Kemi Wang
  Cc: Luis R . Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Dave, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel

On Thu 28-09-17 14:11:41, Kemi Wang wrote:
> This is the second step which introduces a tunable interface that allow
> numa stats configurable for optimizing zone_statistics(), as suggested by
> Dave Hansen and Ying Huang.
> 
> =========================================================================
> When page allocation performance becomes a bottleneck and you can tolerate
> some possible tool breakage and decreased numa counter precision, you can
> do:
> 	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is ignored. We can see about
> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> memory).
> 
> Benchmark link provided by Jesper D Brouer(increase loop times to
> 10000000):
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> bench
> 
> =========================================================================
> When page allocation performance is not a bottleneck and you want all
> tooling to work, you can do:
> 	echo [S|s]trict > /proc/sys/vm/numa_stats_mode
> 
> =========================================================================
> We recommend automatic detection of numa statistics by system, this is also
> system default configuration, you can do:
> 	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> In this case, numa counter update is skipped unless it has been read by
> users at least once, e.g. cat /proc/zoneinfo.

I am still not convinced the auto mode is worth all the additional code
and a safe default to use. The whole thing could have been 0/1 with a
simpler parsing and less code to catch readers.

E.g. why do we have to do static_branch_enable on any read or even
vmstat_stop? Wouldn't open be sufficient?

> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
>  static ssize_t node_read_numastat(struct device *dev,
>  				struct device_attribute *attr, char *buf)
>  {
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
>  	return sprintf(buf,
>  		       "numa_hit %lu\n"
>  		       "numa_miss %lu\n"
> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
>  		n += sprintf(buf+n, "%s %lu\n",
>  			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>  			     sum_zone_numa_state(nid, i));
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
>  #endif
>  
>  	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
[...]
> @@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void *arg)
>  {
>  	pg_data_t *pgdat = (pg_data_t *)arg;
>  	walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print);
> +#ifdef CONFIG_NUMA
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
> +#endif
>  	return 0;
>  }
>  
> @@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg)
>  
>  static void vmstat_stop(struct seq_file *m, void *arg)
>  {
> +#ifdef CONFIG_NUMA
> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
> +		static_branch_enable(&vm_numa_stats_mode_key);
> +#endif
>  	kfree(m->private);
>  	m->private = NULL;
>  }
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-09-29  7:03   ` Vlastimil Babka
  (?)
@ 2017-10-09  2:20     ` kemi
  -1 siblings, 0 replies; 48+ messages in thread
From: kemi @ 2017-10-09  2:20 UTC (permalink / raw)
  To: Vlastimil Babka, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Michal Hocko, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior
  Cc: Dave, Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel



On 2017年09月29日 15:03, Vlastimil Babka wrote:
> On 09/28/2017 08:11 AM, Kemi Wang wrote:
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
>>
>> =========================================================================
>> When page allocation performance becomes a bottleneck and you can tolerate
>> some possible tool breakage and decreased numa counter precision, you can
>> do:
>> 	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is ignored. We can see about
>> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
>> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
>> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
>> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
>> memory).
>>
>> Benchmark link provided by Jesper D Brouer(increase loop times to
>> 10000000):
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
>> bench
>>
>> =========================================================================
>> When page allocation performance is not a bottleneck and you want all
>> tooling to work, you can do:
>> 	echo [S|s]trict > /proc/sys/vm/numa_stats_mode
>>
>> =========================================================================
>> We recommend automatic detection of numa statistics by system, this is also
>> system default configuration, you can do:
>> 	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is skipped unless it has been read by
>> users at least once, e.g. cat /proc/zoneinfo.
>>
>> Branch target selection with jump label:
>> a) When numa_stats_mode is changed to *strict*, jump to the branch for numa
>> counters update.
>> b) When numa_stats_mode is changed to *coarse*, return back directly.
>> c) When numa_stats_mode is changed to *auto*, the branch target used in
>> last time is kept, and the branch target is changed to the branch for numa
>> counters update once numa counters are *read* by users.
>>
>> Therefore, with the help of jump label, the page allocation performance is
>> hardly affected when numa counters are updated with a call in
>> zone_statistics(). Meanwhile, the auto mode can give people benefit without
>> manual tuning.
>>
>> Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to
>> help improve the original patch.
>>
>> ChangeLog:
>>   V2->V3:
>>   a) Propose a better way to use jump label to eliminate the overhead of
>>   branch selection in zone_statistics(), as inspired by Ying Huang;
>>   b) Add a paragraph in commit log to describe the way for branch target
>>   selection;
>>   c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
>>   and change the description accordingly, as suggested by Michal Hocko;
>>   d) Make this functionality NUMA-specific via ifdef
>>
>>   V1->V2:
>>   a) Merge to one patch;
>>   b) Use jump label to eliminate the overhead of branch selection;
>>   c) Add a single-time log message at boot time to help tell users what
>>   happened.
>>
>> Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> Suggested-by: Dave Hansen <dave.hansen@intel.com>
>> Suggested-by: Ying Huang <ying.huang@intel.com>
>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>> ---
>>  Documentation/sysctl/vm.txt |  24 +++++++++
>>  drivers/base/node.c         |   4 ++
>>  include/linux/vmstat.h      |  23 ++++++++
>>  init/main.c                 |   3 ++
>>  kernel/sysctl.c             |   7 +++
>>  mm/page_alloc.c             |  10 ++++
>>  mm/vmstat.c                 | 129 ++++++++++++++++++++++++++++++++++++++++++++
>>  7 files changed, 200 insertions(+)
>>
>> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
>> index 9baf66a..e310e69 100644
>> --- a/Documentation/sysctl/vm.txt
>> +++ b/Documentation/sysctl/vm.txt
>> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
>>  - swappiness
>>  - user_reserve_kbytes
>>  - vfs_cache_pressure
>> +- numa_stats_mode
>>  - watermark_scale_factor
>>  - zone_reclaim_mode
>>  
>> @@ -843,6 +844,29 @@ ten times more freeable objects than there are.
>>  
>>  =============================================================
>>  
>> +numa_stats_mode
>> +
>> +This interface allows numa statistics configurable.
>> +
>> +When page allocation performance becomes a bottleneck and you can tolerate
>> +some possible tool breakage and decreased numa counter precision, you can
>> +do:
>> +	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
>> +
>> +When page allocation performance is not a bottleneck and you want all
>> +tooling to work, you can do:
>> +	echo [S|s]trict > /proc/sys/vm/numa_stat_mode
>> +
>> +We recommend automatic detection of numa statistics by system, because numa
>> +statistics does not affect system's decision and it is very rarely
>> +consumed. you can do:
>> +	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
>> +This is also system default configuration, with this default setting, numa
>> +counters update is skipped unless the counter is *read* by users at least
>> +once.
> 
> It says "the counter", but it seems multiple files in /proc and /sys are
> triggering this, so perhaps list them?

Exactly, four files use it.
/proc/zoneinfo
/proc/vmstat
/sys/devices/system/node/node*/vmstat
/sys/devices/system/node/node*/numastat
Well, I am not sure that it is worthy to list here.

> Also, is it possible that with contemporary userspace/distros (systemd
> etc.) there will always be something that will read one of those upon boot?
> 
It depends on the tools used in userspace. If some tool really read it,
the active state in auto mode will be triggered.

>> +
>> +==============================================================
>> +
>>  watermark_scale_factor:
>>  
>>  This factor controls the aggressiveness of kswapd. It defines the
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index 3855902..b57b5622 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
>>  static ssize_t node_read_numastat(struct device *dev,
>>  				struct device_attribute *attr, char *buf)
>>  {
>> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +		static_branch_enable(&vm_numa_stats_mode_key);
>>  	return sprintf(buf,
>>  		       "numa_hit %lu\n"
>>  		       "numa_miss %lu\n"
>> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
>>  		n += sprintf(buf+n, "%s %lu\n",
>>  			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>>  			     sum_zone_numa_state(nid, i));
>> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +		static_branch_enable(&vm_numa_stats_mode_key);
>>  #endif
>>  
>>  	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>> index ade7cb5..d52e882 100644
>> --- a/include/linux/vmstat.h
>> +++ b/include/linux/vmstat.h
>> @@ -6,9 +6,28 @@
>>  #include <linux/mmzone.h>
>>  #include <linux/vm_event_item.h>
>>  #include <linux/atomic.h>
>> +#include <linux/static_key.h>
>>  
>>  extern int sysctl_stat_interval;
>>  
>> +#ifdef CONFIG_NUMA
>> +DECLARE_STATIC_KEY_FALSE(vm_numa_stats_mode_key);
>> +/*
>> + * vm_numa_stats_mode:
>> + * 0 = auto mode of NUMA stats, automatic detection of NUMA statistics.
>> + * 1 = strict mode of NUMA stats, keep NUMA statistics.
>> + * 2 = coarse mode of NUMA stats, ignore NUMA statistics.
>> + */
>> +#define VM_NUMA_STAT_AUTO_MODE 0
>> +#define VM_NUMA_STAT_STRICT_MODE  1
>> +#define VM_NUMA_STAT_COARSE_MODE  2
>> +#define VM_NUMA_STAT_MODE_LEN 16
>> +extern int vm_numa_stats_mode;
>> +extern char sysctl_vm_numa_stats_mode[];
>> +extern int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
>> +		void __user *buffer, size_t *length, loff_t *ppos);
>> +#endif
>> +
>>  #ifdef CONFIG_VM_EVENT_COUNTERS
>>  /*
>>   * Light weight per cpu counter implementation.
>> @@ -229,6 +248,10 @@ extern unsigned long sum_zone_node_page_state(int node,
>>  extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
>>  extern unsigned long node_page_state(struct pglist_data *pgdat,
>>  						enum node_stat_item item);
>> +extern void zero_zone_numa_counters(struct zone *zone);
>> +extern void zero_zones_numa_counters(void);
>> +extern void zero_global_numa_counters(void);
>> +extern void invalid_numa_statistics(void);
> 
> These seem to be called only from within mm/vmstat.c where they live, so
> I'd suggest removing these extern declarations, and making them static
> in vmstat.c.
> 

Agree. Thanks for catching it.

> ...
> 
>>  #define NUMA_STATS_THRESHOLD (U16_MAX - 2)
>>  
>> +#ifdef CONFIG_NUMA
>> +int vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
>> +char sysctl_vm_numa_stats_mode[VM_NUMA_STAT_MODE_LEN] = "auto";
>> +static const char *vm_numa_stats_mode_name[3] = {"auto", "strict", "coarse"};
>> +static DEFINE_MUTEX(vm_numa_stats_mode_lock);
>> +
>> +static int __parse_vm_numa_stats_mode(char *s)
>> +{
>> +	const char *str = s;
>> +
>> +	if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0)
>> +		vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
>> +	else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0)
>> +		vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE;
>> +	else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0)
>> +		vm_numa_stats_mode = VM_NUMA_STAT_COARSE_MODE;
>> +	else {
>> +		pr_warn("Ignoring invalid vm_numa_stats_mode value: %s\n", s);
>> +		return -EINVAL;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
>> +		void __user *buffer, size_t *length, loff_t *ppos)
>> +{
>> +	char old_string[VM_NUMA_STAT_MODE_LEN];
>> +	int ret, oldval;
>> +
>> +	mutex_lock(&vm_numa_stats_mode_lock);
>> +	if (write)
>> +		strncpy(old_string, (char *)table->data, VM_NUMA_STAT_MODE_LEN);
>> +	ret = proc_dostring(table, write, buffer, length, ppos);
>> +	if (ret || !write) {
>> +		mutex_unlock(&vm_numa_stats_mode_lock);
>> +		return ret;
>> +	}
>> +
>> +	oldval = vm_numa_stats_mode;
>> +	if (__parse_vm_numa_stats_mode((char *)table->data)) {
>> +		/*
>> +		 * invalid sysctl_vm_numa_stats_mode value, restore saved string
>> +		 */
>> +		strncpy((char *)table->data, old_string, VM_NUMA_STAT_MODE_LEN);
>> +		vm_numa_stats_mode = oldval;
> 
> Do we need to restore vm_numa_stats_mode? 

Not necessary. 

AFAICS it didn't change. Also,
> should the EINVAL be returned also to userspace? (not sure what's the
> API here, hmm man 2 sysctl doesn't list EINVAL...)
> 

I don't think so. __parse is only be called in sysctl handler and returns
an invalid value to help restore back.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-10-09  2:20     ` kemi
  0 siblings, 0 replies; 48+ messages in thread
From: kemi @ 2017-10-09  2:20 UTC (permalink / raw)
  To: Vlastimil Babka, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Michal Hocko, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior
  Cc: Dave, Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel



On 2017年09月29日 15:03, Vlastimil Babka wrote:
> On 09/28/2017 08:11 AM, Kemi Wang wrote:
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
>>
>> =========================================================================
>> When page allocation performance becomes a bottleneck and you can tolerate
>> some possible tool breakage and decreased numa counter precision, you can
>> do:
>> 	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is ignored. We can see about
>> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
>> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
>> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
>> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
>> memory).
>>
>> Benchmark link provided by Jesper D Brouer(increase loop times to
>> 10000000):
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
>> bench
>>
>> =========================================================================
>> When page allocation performance is not a bottleneck and you want all
>> tooling to work, you can do:
>> 	echo [S|s]trict > /proc/sys/vm/numa_stats_mode
>>
>> =========================================================================
>> We recommend automatic detection of numa statistics by system, this is also
>> system default configuration, you can do:
>> 	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is skipped unless it has been read by
>> users at least once, e.g. cat /proc/zoneinfo.
>>
>> Branch target selection with jump label:
>> a) When numa_stats_mode is changed to *strict*, jump to the branch for numa
>> counters update.
>> b) When numa_stats_mode is changed to *coarse*, return back directly.
>> c) When numa_stats_mode is changed to *auto*, the branch target used in
>> last time is kept, and the branch target is changed to the branch for numa
>> counters update once numa counters are *read* by users.
>>
>> Therefore, with the help of jump label, the page allocation performance is
>> hardly affected when numa counters are updated with a call in
>> zone_statistics(). Meanwhile, the auto mode can give people benefit without
>> manual tuning.
>>
>> Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to
>> help improve the original patch.
>>
>> ChangeLog:
>>   V2->V3:
>>   a) Propose a better way to use jump label to eliminate the overhead of
>>   branch selection in zone_statistics(), as inspired by Ying Huang;
>>   b) Add a paragraph in commit log to describe the way for branch target
>>   selection;
>>   c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
>>   and change the description accordingly, as suggested by Michal Hocko;
>>   d) Make this functionality NUMA-specific via ifdef
>>
>>   V1->V2:
>>   a) Merge to one patch;
>>   b) Use jump label to eliminate the overhead of branch selection;
>>   c) Add a single-time log message at boot time to help tell users what
>>   happened.
>>
>> Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> Suggested-by: Dave Hansen <dave.hansen@intel.com>
>> Suggested-by: Ying Huang <ying.huang@intel.com>
>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>> ---
>>  Documentation/sysctl/vm.txt |  24 +++++++++
>>  drivers/base/node.c         |   4 ++
>>  include/linux/vmstat.h      |  23 ++++++++
>>  init/main.c                 |   3 ++
>>  kernel/sysctl.c             |   7 +++
>>  mm/page_alloc.c             |  10 ++++
>>  mm/vmstat.c                 | 129 ++++++++++++++++++++++++++++++++++++++++++++
>>  7 files changed, 200 insertions(+)
>>
>> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
>> index 9baf66a..e310e69 100644
>> --- a/Documentation/sysctl/vm.txt
>> +++ b/Documentation/sysctl/vm.txt
>> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
>>  - swappiness
>>  - user_reserve_kbytes
>>  - vfs_cache_pressure
>> +- numa_stats_mode
>>  - watermark_scale_factor
>>  - zone_reclaim_mode
>>  
>> @@ -843,6 +844,29 @@ ten times more freeable objects than there are.
>>  
>>  =============================================================
>>  
>> +numa_stats_mode
>> +
>> +This interface allows numa statistics configurable.
>> +
>> +When page allocation performance becomes a bottleneck and you can tolerate
>> +some possible tool breakage and decreased numa counter precision, you can
>> +do:
>> +	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
>> +
>> +When page allocation performance is not a bottleneck and you want all
>> +tooling to work, you can do:
>> +	echo [S|s]trict > /proc/sys/vm/numa_stat_mode
>> +
>> +We recommend automatic detection of numa statistics by system, because numa
>> +statistics does not affect system's decision and it is very rarely
>> +consumed. you can do:
>> +	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
>> +This is also system default configuration, with this default setting, numa
>> +counters update is skipped unless the counter is *read* by users at least
>> +once.
> 
> It says "the counter", but it seems multiple files in /proc and /sys are
> triggering this, so perhaps list them?

Exactly, four files use it.
/proc/zoneinfo
/proc/vmstat
/sys/devices/system/node/node*/vmstat
/sys/devices/system/node/node*/numastat
Well, I am not sure that it is worthy to list here.

> Also, is it possible that with contemporary userspace/distros (systemd
> etc.) there will always be something that will read one of those upon boot?
> 
It depends on the tools used in userspace. If some tool really read it,
the active state in auto mode will be triggered.

>> +
>> +==============================================================
>> +
>>  watermark_scale_factor:
>>  
>>  This factor controls the aggressiveness of kswapd. It defines the
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index 3855902..b57b5622 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
>>  static ssize_t node_read_numastat(struct device *dev,
>>  				struct device_attribute *attr, char *buf)
>>  {
>> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +		static_branch_enable(&vm_numa_stats_mode_key);
>>  	return sprintf(buf,
>>  		       "numa_hit %lu\n"
>>  		       "numa_miss %lu\n"
>> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
>>  		n += sprintf(buf+n, "%s %lu\n",
>>  			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>>  			     sum_zone_numa_state(nid, i));
>> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +		static_branch_enable(&vm_numa_stats_mode_key);
>>  #endif
>>  
>>  	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>> index ade7cb5..d52e882 100644
>> --- a/include/linux/vmstat.h
>> +++ b/include/linux/vmstat.h
>> @@ -6,9 +6,28 @@
>>  #include <linux/mmzone.h>
>>  #include <linux/vm_event_item.h>
>>  #include <linux/atomic.h>
>> +#include <linux/static_key.h>
>>  
>>  extern int sysctl_stat_interval;
>>  
>> +#ifdef CONFIG_NUMA
>> +DECLARE_STATIC_KEY_FALSE(vm_numa_stats_mode_key);
>> +/*
>> + * vm_numa_stats_mode:
>> + * 0 = auto mode of NUMA stats, automatic detection of NUMA statistics.
>> + * 1 = strict mode of NUMA stats, keep NUMA statistics.
>> + * 2 = coarse mode of NUMA stats, ignore NUMA statistics.
>> + */
>> +#define VM_NUMA_STAT_AUTO_MODE 0
>> +#define VM_NUMA_STAT_STRICT_MODE  1
>> +#define VM_NUMA_STAT_COARSE_MODE  2
>> +#define VM_NUMA_STAT_MODE_LEN 16
>> +extern int vm_numa_stats_mode;
>> +extern char sysctl_vm_numa_stats_mode[];
>> +extern int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
>> +		void __user *buffer, size_t *length, loff_t *ppos);
>> +#endif
>> +
>>  #ifdef CONFIG_VM_EVENT_COUNTERS
>>  /*
>>   * Light weight per cpu counter implementation.
>> @@ -229,6 +248,10 @@ extern unsigned long sum_zone_node_page_state(int node,
>>  extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
>>  extern unsigned long node_page_state(struct pglist_data *pgdat,
>>  						enum node_stat_item item);
>> +extern void zero_zone_numa_counters(struct zone *zone);
>> +extern void zero_zones_numa_counters(void);
>> +extern void zero_global_numa_counters(void);
>> +extern void invalid_numa_statistics(void);
> 
> These seem to be called only from within mm/vmstat.c where they live, so
> I'd suggest removing these extern declarations, and making them static
> in vmstat.c.
> 

Agree. Thanks for catching it.

> ...
> 
>>  #define NUMA_STATS_THRESHOLD (U16_MAX - 2)
>>  
>> +#ifdef CONFIG_NUMA
>> +int vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
>> +char sysctl_vm_numa_stats_mode[VM_NUMA_STAT_MODE_LEN] = "auto";
>> +static const char *vm_numa_stats_mode_name[3] = {"auto", "strict", "coarse"};
>> +static DEFINE_MUTEX(vm_numa_stats_mode_lock);
>> +
>> +static int __parse_vm_numa_stats_mode(char *s)
>> +{
>> +	const char *str = s;
>> +
>> +	if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0)
>> +		vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
>> +	else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0)
>> +		vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE;
>> +	else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0)
>> +		vm_numa_stats_mode = VM_NUMA_STAT_COARSE_MODE;
>> +	else {
>> +		pr_warn("Ignoring invalid vm_numa_stats_mode value: %s\n", s);
>> +		return -EINVAL;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
>> +		void __user *buffer, size_t *length, loff_t *ppos)
>> +{
>> +	char old_string[VM_NUMA_STAT_MODE_LEN];
>> +	int ret, oldval;
>> +
>> +	mutex_lock(&vm_numa_stats_mode_lock);
>> +	if (write)
>> +		strncpy(old_string, (char *)table->data, VM_NUMA_STAT_MODE_LEN);
>> +	ret = proc_dostring(table, write, buffer, length, ppos);
>> +	if (ret || !write) {
>> +		mutex_unlock(&vm_numa_stats_mode_lock);
>> +		return ret;
>> +	}
>> +
>> +	oldval = vm_numa_stats_mode;
>> +	if (__parse_vm_numa_stats_mode((char *)table->data)) {
>> +		/*
>> +		 * invalid sysctl_vm_numa_stats_mode value, restore saved string
>> +		 */
>> +		strncpy((char *)table->data, old_string, VM_NUMA_STAT_MODE_LEN);
>> +		vm_numa_stats_mode = oldval;
> 
> Do we need to restore vm_numa_stats_mode? 

Not necessary. 

AFAICS it didn't change. Also,
> should the EINVAL be returned also to userspace? (not sure what's the
> API here, hmm man 2 sysctl doesn't list EINVAL...)
> 

I don't think so. __parse is only be called in sysctl handler and returns
an invalid value to help restore back.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-10-09  2:20     ` kemi
  0 siblings, 0 replies; 48+ messages in thread
From: kemi @ 2017-10-09  2:20 UTC (permalink / raw)
  To: Vlastimil Babka, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Michal Hocko, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior
  Cc: Dave, Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel



On 2017a1'09ae??29ae?JPY 15:03, Vlastimil Babka wrote:
> On 09/28/2017 08:11 AM, Kemi Wang wrote:
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
>>
>> =========================================================================
>> When page allocation performance becomes a bottleneck and you can tolerate
>> some possible tool breakage and decreased numa counter precision, you can
>> do:
>> 	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is ignored. We can see about
>> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
>> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
>> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
>> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
>> memory).
>>
>> Benchmark link provided by Jesper D Brouer(increase loop times to
>> 10000000):
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
>> bench
>>
>> =========================================================================
>> When page allocation performance is not a bottleneck and you want all
>> tooling to work, you can do:
>> 	echo [S|s]trict > /proc/sys/vm/numa_stats_mode
>>
>> =========================================================================
>> We recommend automatic detection of numa statistics by system, this is also
>> system default configuration, you can do:
>> 	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is skipped unless it has been read by
>> users at least once, e.g. cat /proc/zoneinfo.
>>
>> Branch target selection with jump label:
>> a) When numa_stats_mode is changed to *strict*, jump to the branch for numa
>> counters update.
>> b) When numa_stats_mode is changed to *coarse*, return back directly.
>> c) When numa_stats_mode is changed to *auto*, the branch target used in
>> last time is kept, and the branch target is changed to the branch for numa
>> counters update once numa counters are *read* by users.
>>
>> Therefore, with the help of jump label, the page allocation performance is
>> hardly affected when numa counters are updated with a call in
>> zone_statistics(). Meanwhile, the auto mode can give people benefit without
>> manual tuning.
>>
>> Many thanks to Michal Hocko, Dave Hansen and Ying Huang for comments to
>> help improve the original patch.
>>
>> ChangeLog:
>>   V2->V3:
>>   a) Propose a better way to use jump label to eliminate the overhead of
>>   branch selection in zone_statistics(), as inspired by Ying Huang;
>>   b) Add a paragraph in commit log to describe the way for branch target
>>   selection;
>>   c) Use a more descriptive name numa_stats_mode instead of vmstat_mode,
>>   and change the description accordingly, as suggested by Michal Hocko;
>>   d) Make this functionality NUMA-specific via ifdef
>>
>>   V1->V2:
>>   a) Merge to one patch;
>>   b) Use jump label to eliminate the overhead of branch selection;
>>   c) Add a single-time log message at boot time to help tell users what
>>   happened.
>>
>> Reported-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> Suggested-by: Dave Hansen <dave.hansen@intel.com>
>> Suggested-by: Ying Huang <ying.huang@intel.com>
>> Signed-off-by: Kemi Wang <kemi.wang@intel.com>
>> ---
>>  Documentation/sysctl/vm.txt |  24 +++++++++
>>  drivers/base/node.c         |   4 ++
>>  include/linux/vmstat.h      |  23 ++++++++
>>  init/main.c                 |   3 ++
>>  kernel/sysctl.c             |   7 +++
>>  mm/page_alloc.c             |  10 ++++
>>  mm/vmstat.c                 | 129 ++++++++++++++++++++++++++++++++++++++++++++
>>  7 files changed, 200 insertions(+)
>>
>> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
>> index 9baf66a..e310e69 100644
>> --- a/Documentation/sysctl/vm.txt
>> +++ b/Documentation/sysctl/vm.txt
>> @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm:
>>  - swappiness
>>  - user_reserve_kbytes
>>  - vfs_cache_pressure
>> +- numa_stats_mode
>>  - watermark_scale_factor
>>  - zone_reclaim_mode
>>  
>> @@ -843,6 +844,29 @@ ten times more freeable objects than there are.
>>  
>>  =============================================================
>>  
>> +numa_stats_mode
>> +
>> +This interface allows numa statistics configurable.
>> +
>> +When page allocation performance becomes a bottleneck and you can tolerate
>> +some possible tool breakage and decreased numa counter precision, you can
>> +do:
>> +	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
>> +
>> +When page allocation performance is not a bottleneck and you want all
>> +tooling to work, you can do:
>> +	echo [S|s]trict > /proc/sys/vm/numa_stat_mode
>> +
>> +We recommend automatic detection of numa statistics by system, because numa
>> +statistics does not affect system's decision and it is very rarely
>> +consumed. you can do:
>> +	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
>> +This is also system default configuration, with this default setting, numa
>> +counters update is skipped unless the counter is *read* by users at least
>> +once.
> 
> It says "the counter", but it seems multiple files in /proc and /sys are
> triggering this, so perhaps list them?

Exactly, four files use it.
/proc/zoneinfo
/proc/vmstat
/sys/devices/system/node/node*/vmstat
/sys/devices/system/node/node*/numastat
Well, I am not sure that it is worthy to list here.

> Also, is it possible that with contemporary userspace/distros (systemd
> etc.) there will always be something that will read one of those upon boot?
> 
It depends on the tools used in userspace. If some tool really read it,
the active state in auto mode will be triggered.

>> +
>> +==============================================================
>> +
>>  watermark_scale_factor:
>>  
>>  This factor controls the aggressiveness of kswapd. It defines the
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index 3855902..b57b5622 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
>>  static ssize_t node_read_numastat(struct device *dev,
>>  				struct device_attribute *attr, char *buf)
>>  {
>> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +		static_branch_enable(&vm_numa_stats_mode_key);
>>  	return sprintf(buf,
>>  		       "numa_hit %lu\n"
>>  		       "numa_miss %lu\n"
>> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
>>  		n += sprintf(buf+n, "%s %lu\n",
>>  			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>>  			     sum_zone_numa_state(nid, i));
>> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +		static_branch_enable(&vm_numa_stats_mode_key);
>>  #endif
>>  
>>  	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
>> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>> index ade7cb5..d52e882 100644
>> --- a/include/linux/vmstat.h
>> +++ b/include/linux/vmstat.h
>> @@ -6,9 +6,28 @@
>>  #include <linux/mmzone.h>
>>  #include <linux/vm_event_item.h>
>>  #include <linux/atomic.h>
>> +#include <linux/static_key.h>
>>  
>>  extern int sysctl_stat_interval;
>>  
>> +#ifdef CONFIG_NUMA
>> +DECLARE_STATIC_KEY_FALSE(vm_numa_stats_mode_key);
>> +/*
>> + * vm_numa_stats_mode:
>> + * 0 = auto mode of NUMA stats, automatic detection of NUMA statistics.
>> + * 1 = strict mode of NUMA stats, keep NUMA statistics.
>> + * 2 = coarse mode of NUMA stats, ignore NUMA statistics.
>> + */
>> +#define VM_NUMA_STAT_AUTO_MODE 0
>> +#define VM_NUMA_STAT_STRICT_MODE  1
>> +#define VM_NUMA_STAT_COARSE_MODE  2
>> +#define VM_NUMA_STAT_MODE_LEN 16
>> +extern int vm_numa_stats_mode;
>> +extern char sysctl_vm_numa_stats_mode[];
>> +extern int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
>> +		void __user *buffer, size_t *length, loff_t *ppos);
>> +#endif
>> +
>>  #ifdef CONFIG_VM_EVENT_COUNTERS
>>  /*
>>   * Light weight per cpu counter implementation.
>> @@ -229,6 +248,10 @@ extern unsigned long sum_zone_node_page_state(int node,
>>  extern unsigned long sum_zone_numa_state(int node, enum numa_stat_item item);
>>  extern unsigned long node_page_state(struct pglist_data *pgdat,
>>  						enum node_stat_item item);
>> +extern void zero_zone_numa_counters(struct zone *zone);
>> +extern void zero_zones_numa_counters(void);
>> +extern void zero_global_numa_counters(void);
>> +extern void invalid_numa_statistics(void);
> 
> These seem to be called only from within mm/vmstat.c where they live, so
> I'd suggest removing these extern declarations, and making them static
> in vmstat.c.
> 

Agree. Thanks for catching it.

> ...
> 
>>  #define NUMA_STATS_THRESHOLD (U16_MAX - 2)
>>  
>> +#ifdef CONFIG_NUMA
>> +int vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
>> +char sysctl_vm_numa_stats_mode[VM_NUMA_STAT_MODE_LEN] = "auto";
>> +static const char *vm_numa_stats_mode_name[3] = {"auto", "strict", "coarse"};
>> +static DEFINE_MUTEX(vm_numa_stats_mode_lock);
>> +
>> +static int __parse_vm_numa_stats_mode(char *s)
>> +{
>> +	const char *str = s;
>> +
>> +	if (strcmp(str, "auto") == 0 || strcmp(str, "Auto") == 0)
>> +		vm_numa_stats_mode = VM_NUMA_STAT_AUTO_MODE;
>> +	else if (strcmp(str, "strict") == 0 || strcmp(str, "Strict") == 0)
>> +		vm_numa_stats_mode = VM_NUMA_STAT_STRICT_MODE;
>> +	else if (strcmp(str, "coarse") == 0 || strcmp(str, "Coarse") == 0)
>> +		vm_numa_stats_mode = VM_NUMA_STAT_COARSE_MODE;
>> +	else {
>> +		pr_warn("Ignoring invalid vm_numa_stats_mode value: %s\n", s);
>> +		return -EINVAL;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +int sysctl_vm_numa_stats_mode_handler(struct ctl_table *table, int write,
>> +		void __user *buffer, size_t *length, loff_t *ppos)
>> +{
>> +	char old_string[VM_NUMA_STAT_MODE_LEN];
>> +	int ret, oldval;
>> +
>> +	mutex_lock(&vm_numa_stats_mode_lock);
>> +	if (write)
>> +		strncpy(old_string, (char *)table->data, VM_NUMA_STAT_MODE_LEN);
>> +	ret = proc_dostring(table, write, buffer, length, ppos);
>> +	if (ret || !write) {
>> +		mutex_unlock(&vm_numa_stats_mode_lock);
>> +		return ret;
>> +	}
>> +
>> +	oldval = vm_numa_stats_mode;
>> +	if (__parse_vm_numa_stats_mode((char *)table->data)) {
>> +		/*
>> +		 * invalid sysctl_vm_numa_stats_mode value, restore saved string
>> +		 */
>> +		strncpy((char *)table->data, old_string, VM_NUMA_STAT_MODE_LEN);
>> +		vm_numa_stats_mode = oldval;
> 
> Do we need to restore vm_numa_stats_mode? 

Not necessary. 

AFAICS it didn't change. Also,
> should the EINVAL be returned also to userspace? (not sure what's the
> API here, hmm man 2 sysctl doesn't list EINVAL...)
> 

I don't think so. __parse is only be called in sysctl handler and returns
an invalid value to help restore back.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-10-03  9:23   ` Michal Hocko
  (?)
@ 2017-10-09  6:34     ` kemi
  -1 siblings, 0 replies; 48+ messages in thread
From: kemi @ 2017-10-09  6:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Luis R . Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Dave, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel



On 2017年10月03日 17:23, Michal Hocko wrote:
> On Thu 28-09-17 14:11:41, Kemi Wang wrote:
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
>>
>> =========================================================================
>> When page allocation performance becomes a bottleneck and you can tolerate
>> some possible tool breakage and decreased numa counter precision, you can
>> do:
>> 	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is ignored. We can see about
>> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
>> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
>> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
>> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
>> memory).
>>
>> Benchmark link provided by Jesper D Brouer(increase loop times to
>> 10000000):
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
>> bench
>>
>> =========================================================================
>> When page allocation performance is not a bottleneck and you want all
>> tooling to work, you can do:
>> 	echo [S|s]trict > /proc/sys/vm/numa_stats_mode
>>
>> =========================================================================
>> We recommend automatic detection of numa statistics by system, this is also
>> system default configuration, you can do:
>> 	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is skipped unless it has been read by
>> users at least once, e.g. cat /proc/zoneinfo.
> 
> I am still not convinced the auto mode is worth all the additional code
> and a safe default to use. The whole thing could have been 0/1 with a
> simpler parsing and less code to catch readers.
> 

I understood your concern. 
Well, we may get rid of auto mode if there is some obvious disadvantage
here. Now, I tend to keep it because most people may not touch this interface,
and auto mode is helpful in such case.

> E.g. why do we have to do static_branch_enable on any read or even
> vmstat_stop? Wouldn't open be sufficient?
> 

NUMA stats is used in four files:
/proc/zoneinfo
/proc/vmstat
/sys/devices/system/node/node*/numastat
/sys/devices/system/node/node*/vmstat
In auto mode, each *read* will trigger the update of NUMA counter. 
So, we should make sure the target branch is jumped to the branch 
for NUMA counter update once the file is read from user space.
the intension of static_branch_enable in vmstat_stop(in the call site 
of file->file_ops.read) is for reading /proc/vmstat in case.  

I guess the *open* means file->file_op.open here, right?
Do you suggest to move static_branch_enable to file->file_op.open? Thanks.

>> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
>>  static ssize_t node_read_numastat(struct device *dev,
>>  				struct device_attribute *attr, char *buf)
>>  {
>> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +		static_branch_enable(&vm_numa_stats_mode_key);
>>  	return sprintf(buf,
>>  		       "numa_hit %lu\n"
>>  		       "numa_miss %lu\n"
>> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
>>  		n += sprintf(buf+n, "%s %lu\n",
>>  			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>>  			     sum_zone_numa_state(nid, i));
>> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +		static_branch_enable(&vm_numa_stats_mode_key);
>>  #endif
>>  
>>  	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> [...]
>> @@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void *arg)
>>  {
>>  	pg_data_t *pgdat = (pg_data_t *)arg;
>>  	walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print);
>> +#ifdef CONFIG_NUMA
>> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +		static_branch_enable(&vm_numa_stats_mode_key);
>> +#endif
>>  	return 0;
>>  }
>>  
>> @@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg)
>>  
>>  static void vmstat_stop(struct seq_file *m, void *arg)
>>  {
>> +#ifdef CONFIG_NUMA
>> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +		static_branch_enable(&vm_numa_stats_mode_key);
>> +#endif
>>  	kfree(m->private);
>>  	m->private = NULL;
>>  }
>> -- 
>> 2.7.4
>>
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-10-09  6:34     ` kemi
  0 siblings, 0 replies; 48+ messages in thread
From: kemi @ 2017-10-09  6:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Luis R . Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Dave, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel



On 2017年10月03日 17:23, Michal Hocko wrote:
> On Thu 28-09-17 14:11:41, Kemi Wang wrote:
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
>>
>> =========================================================================
>> When page allocation performance becomes a bottleneck and you can tolerate
>> some possible tool breakage and decreased numa counter precision, you can
>> do:
>> 	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is ignored. We can see about
>> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
>> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
>> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
>> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
>> memory).
>>
>> Benchmark link provided by Jesper D Brouer(increase loop times to
>> 10000000):
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
>> bench
>>
>> =========================================================================
>> When page allocation performance is not a bottleneck and you want all
>> tooling to work, you can do:
>> 	echo [S|s]trict > /proc/sys/vm/numa_stats_mode
>>
>> =========================================================================
>> We recommend automatic detection of numa statistics by system, this is also
>> system default configuration, you can do:
>> 	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is skipped unless it has been read by
>> users at least once, e.g. cat /proc/zoneinfo.
> 
> I am still not convinced the auto mode is worth all the additional code
> and a safe default to use. The whole thing could have been 0/1 with a
> simpler parsing and less code to catch readers.
> 

I understood your concern. 
Well, we may get rid of auto mode if there is some obvious disadvantage
here. Now, I tend to keep it because most people may not touch this interface,
and auto mode is helpful in such case.

> E.g. why do we have to do static_branch_enable on any read or even
> vmstat_stop? Wouldn't open be sufficient?
> 

NUMA stats is used in four files:
/proc/zoneinfo
/proc/vmstat
/sys/devices/system/node/node*/numastat
/sys/devices/system/node/node*/vmstat
In auto mode, each *read* will trigger the update of NUMA counter. 
So, we should make sure the target branch is jumped to the branch 
for NUMA counter update once the file is read from user space.
the intension of static_branch_enable in vmstat_stop(in the call site 
of file->file_ops.read) is for reading /proc/vmstat in case.  

I guess the *open* means file->file_op.open here, right?
Do you suggest to move static_branch_enable to file->file_op.open? Thanks.

>> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
>>  static ssize_t node_read_numastat(struct device *dev,
>>  				struct device_attribute *attr, char *buf)
>>  {
>> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +		static_branch_enable(&vm_numa_stats_mode_key);
>>  	return sprintf(buf,
>>  		       "numa_hit %lu\n"
>>  		       "numa_miss %lu\n"
>> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
>>  		n += sprintf(buf+n, "%s %lu\n",
>>  			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>>  			     sum_zone_numa_state(nid, i));
>> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +		static_branch_enable(&vm_numa_stats_mode_key);
>>  #endif
>>  
>>  	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> [...]
>> @@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void *arg)
>>  {
>>  	pg_data_t *pgdat = (pg_data_t *)arg;
>>  	walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print);
>> +#ifdef CONFIG_NUMA
>> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +		static_branch_enable(&vm_numa_stats_mode_key);
>> +#endif
>>  	return 0;
>>  }
>>  
>> @@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg)
>>  
>>  static void vmstat_stop(struct seq_file *m, void *arg)
>>  {
>> +#ifdef CONFIG_NUMA
>> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +		static_branch_enable(&vm_numa_stats_mode_key);
>> +#endif
>>  	kfree(m->private);
>>  	m->private = NULL;
>>  }
>> -- 
>> 2.7.4
>>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-10-09  6:34     ` kemi
  0 siblings, 0 replies; 48+ messages in thread
From: kemi @ 2017-10-09  6:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Luis R . Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Dave, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel



On 2017a1'10ae??03ae?JPY 17:23, Michal Hocko wrote:
> On Thu 28-09-17 14:11:41, Kemi Wang wrote:
>> This is the second step which introduces a tunable interface that allow
>> numa stats configurable for optimizing zone_statistics(), as suggested by
>> Dave Hansen and Ying Huang.
>>
>> =========================================================================
>> When page allocation performance becomes a bottleneck and you can tolerate
>> some possible tool breakage and decreased numa counter precision, you can
>> do:
>> 	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is ignored. We can see about
>> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
>> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
>> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
>> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
>> memory).
>>
>> Benchmark link provided by Jesper D Brouer(increase loop times to
>> 10000000):
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
>> bench
>>
>> =========================================================================
>> When page allocation performance is not a bottleneck and you want all
>> tooling to work, you can do:
>> 	echo [S|s]trict > /proc/sys/vm/numa_stats_mode
>>
>> =========================================================================
>> We recommend automatic detection of numa statistics by system, this is also
>> system default configuration, you can do:
>> 	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
>> In this case, numa counter update is skipped unless it has been read by
>> users at least once, e.g. cat /proc/zoneinfo.
> 
> I am still not convinced the auto mode is worth all the additional code
> and a safe default to use. The whole thing could have been 0/1 with a
> simpler parsing and less code to catch readers.
> 

I understood your concern. 
Well, we may get rid of auto mode if there is some obvious disadvantage
here. Now, I tend to keep it because most people may not touch this interface,
and auto mode is helpful in such case.

> E.g. why do we have to do static_branch_enable on any read or even
> vmstat_stop? Wouldn't open be sufficient?
> 

NUMA stats is used in four files:
/proc/zoneinfo
/proc/vmstat
/sys/devices/system/node/node*/numastat
/sys/devices/system/node/node*/vmstat
In auto mode, each *read* will trigger the update of NUMA counter. 
So, we should make sure the target branch is jumped to the branch 
for NUMA counter update once the file is read from user space.
the intension of static_branch_enable in vmstat_stop(in the call site 
of file->file_ops.read) is for reading /proc/vmstat in case.  

I guess the *open* means file->file_op.open here, right?
Do you suggest to move static_branch_enable to file->file_op.open? Thanks.

>> @@ -153,6 +153,8 @@ static DEVICE_ATTR(meminfo, S_IRUGO, node_read_meminfo, NULL);
>>  static ssize_t node_read_numastat(struct device *dev,
>>  				struct device_attribute *attr, char *buf)
>>  {
>> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +		static_branch_enable(&vm_numa_stats_mode_key);
>>  	return sprintf(buf,
>>  		       "numa_hit %lu\n"
>>  		       "numa_miss %lu\n"
>> @@ -186,6 +188,8 @@ static ssize_t node_read_vmstat(struct device *dev,
>>  		n += sprintf(buf+n, "%s %lu\n",
>>  			     vmstat_text[i + NR_VM_ZONE_STAT_ITEMS],
>>  			     sum_zone_numa_state(nid, i));
>> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +		static_branch_enable(&vm_numa_stats_mode_key);
>>  #endif
>>  
>>  	for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
> [...]
>> @@ -1582,6 +1703,10 @@ static int zoneinfo_show(struct seq_file *m, void *arg)
>>  {
>>  	pg_data_t *pgdat = (pg_data_t *)arg;
>>  	walk_zones_in_node(m, pgdat, false, false, zoneinfo_show_print);
>> +#ifdef CONFIG_NUMA
>> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +		static_branch_enable(&vm_numa_stats_mode_key);
>> +#endif
>>  	return 0;
>>  }
>>  
>> @@ -1678,6 +1803,10 @@ static int vmstat_show(struct seq_file *m, void *arg)
>>  
>>  static void vmstat_stop(struct seq_file *m, void *arg)
>>  {
>> +#ifdef CONFIG_NUMA
>> +	if (vm_numa_stats_mode == VM_NUMA_STAT_AUTO_MODE)
>> +		static_branch_enable(&vm_numa_stats_mode_key);
>> +#endif
>>  	kfree(m->private);
>>  	m->private = NULL;
>>  }
>> -- 
>> 2.7.4
>>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-10-09  6:34     ` kemi
  (?)
@ 2017-10-09  7:55       ` Michal Hocko
  -1 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2017-10-09  7:55 UTC (permalink / raw)
  To: kemi
  Cc: Luis R . Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Dave, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel

On Mon 09-10-17 14:34:11, kemi wrote:
> 
> 
> On 2017年10月03日 17:23, Michal Hocko wrote:
> > On Thu 28-09-17 14:11:41, Kemi Wang wrote:
> >> This is the second step which introduces a tunable interface that allow
> >> numa stats configurable for optimizing zone_statistics(), as suggested by
> >> Dave Hansen and Ying Huang.
> >>
> >> =========================================================================
> >> When page allocation performance becomes a bottleneck and you can tolerate
> >> some possible tool breakage and decreased numa counter precision, you can
> >> do:
> >> 	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> >> In this case, numa counter update is ignored. We can see about
> >> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> >> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> >> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> >> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> >> memory).
> >>
> >> Benchmark link provided by Jesper D Brouer(increase loop times to
> >> 10000000):
> >> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> >> bench
> >>
> >> =========================================================================
> >> When page allocation performance is not a bottleneck and you want all
> >> tooling to work, you can do:
> >> 	echo [S|s]trict > /proc/sys/vm/numa_stats_mode
> >>
> >> =========================================================================
> >> We recommend automatic detection of numa statistics by system, this is also
> >> system default configuration, you can do:
> >> 	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> >> In this case, numa counter update is skipped unless it has been read by
> >> users at least once, e.g. cat /proc/zoneinfo.
> > 
> > I am still not convinced the auto mode is worth all the additional code
> > and a safe default to use. The whole thing could have been 0/1 with a
> > simpler parsing and less code to catch readers.
> > 
> 
> I understood your concern. 
> Well, we may get rid of auto mode if there is some obvious disadvantage
> here. Now, I tend to keep it because most people may not touch this interface,
> and auto mode is helpful in such case.

But you cannot guarantee it won't break any existing users, can you?
Besides I do not remember anybody complaining about the performance
impact of these counters other than very specialized workloads which are
going to disable the accounting altogether. So I simply fail to see a
reason to add more code with a questionable semantic (see below on
partial reads).

> > E.g. why do we have to do static_branch_enable on any read or even
> > vmstat_stop? Wouldn't open be sufficient?
> > 
> 
> NUMA stats is used in four files:
> /proc/zoneinfo
> /proc/vmstat
> /sys/devices/system/node/node*/numastat
> /sys/devices/system/node/node*/vmstat
> In auto mode, each *read* will trigger the update of NUMA counter. 
> So, we should make sure the target branch is jumped to the branch 
> for NUMA counter update once the file is read from user space.
> the intension of static_branch_enable in vmstat_stop(in the call site 
> of file->file_ops.read) is for reading /proc/vmstat in case.  
> 
> I guess the *open* means file->file_op.open here, right?
> Do you suggest to move static_branch_enable to file->file_op.open? Thanks.

I haven't checked closely but what happens (or should happen) when you
do a partial read? Should you get an inconsistent results? Or is this
impossible?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-10-09  7:55       ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2017-10-09  7:55 UTC (permalink / raw)
  To: kemi
  Cc: Luis R . Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Dave, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel

On Mon 09-10-17 14:34:11, kemi wrote:
> 
> 
> On 2017年10月03日 17:23, Michal Hocko wrote:
> > On Thu 28-09-17 14:11:41, Kemi Wang wrote:
> >> This is the second step which introduces a tunable interface that allow
> >> numa stats configurable for optimizing zone_statistics(), as suggested by
> >> Dave Hansen and Ying Huang.
> >>
> >> =========================================================================
> >> When page allocation performance becomes a bottleneck and you can tolerate
> >> some possible tool breakage and decreased numa counter precision, you can
> >> do:
> >> 	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> >> In this case, numa counter update is ignored. We can see about
> >> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> >> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> >> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> >> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> >> memory).
> >>
> >> Benchmark link provided by Jesper D Brouer(increase loop times to
> >> 10000000):
> >> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> >> bench
> >>
> >> =========================================================================
> >> When page allocation performance is not a bottleneck and you want all
> >> tooling to work, you can do:
> >> 	echo [S|s]trict > /proc/sys/vm/numa_stats_mode
> >>
> >> =========================================================================
> >> We recommend automatic detection of numa statistics by system, this is also
> >> system default configuration, you can do:
> >> 	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> >> In this case, numa counter update is skipped unless it has been read by
> >> users at least once, e.g. cat /proc/zoneinfo.
> > 
> > I am still not convinced the auto mode is worth all the additional code
> > and a safe default to use. The whole thing could have been 0/1 with a
> > simpler parsing and less code to catch readers.
> > 
> 
> I understood your concern. 
> Well, we may get rid of auto mode if there is some obvious disadvantage
> here. Now, I tend to keep it because most people may not touch this interface,
> and auto mode is helpful in such case.

But you cannot guarantee it won't break any existing users, can you?
Besides I do not remember anybody complaining about the performance
impact of these counters other than very specialized workloads which are
going to disable the accounting altogether. So I simply fail to see a
reason to add more code with a questionable semantic (see below on
partial reads).

> > E.g. why do we have to do static_branch_enable on any read or even
> > vmstat_stop? Wouldn't open be sufficient?
> > 
> 
> NUMA stats is used in four files:
> /proc/zoneinfo
> /proc/vmstat
> /sys/devices/system/node/node*/numastat
> /sys/devices/system/node/node*/vmstat
> In auto mode, each *read* will trigger the update of NUMA counter. 
> So, we should make sure the target branch is jumped to the branch 
> for NUMA counter update once the file is read from user space.
> the intension of static_branch_enable in vmstat_stop(in the call site 
> of file->file_ops.read) is for reading /proc/vmstat in case.  
> 
> I guess the *open* means file->file_op.open here, right?
> Do you suggest to move static_branch_enable to file->file_op.open? Thanks.

I haven't checked closely but what happens (or should happen) when you
do a partial read? Should you get an inconsistent results? Or is this
impossible?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-10-09  7:55       ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2017-10-09  7:55 UTC (permalink / raw)
  To: kemi
  Cc: Luis R . Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Dave, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel

On Mon 09-10-17 14:34:11, kemi wrote:
> 
> 
> On 2017a1'10ae??03ae?JPY 17:23, Michal Hocko wrote:
> > On Thu 28-09-17 14:11:41, Kemi Wang wrote:
> >> This is the second step which introduces a tunable interface that allow
> >> numa stats configurable for optimizing zone_statistics(), as suggested by
> >> Dave Hansen and Ying Huang.
> >>
> >> =========================================================================
> >> When page allocation performance becomes a bottleneck and you can tolerate
> >> some possible tool breakage and decreased numa counter precision, you can
> >> do:
> >> 	echo [C|c]oarse > /proc/sys/vm/numa_stats_mode
> >> In this case, numa counter update is ignored. We can see about
> >> *4.8%*(185->176) drop of cpu cycles per single page allocation and reclaim
> >> on Jesper's page_bench01 (single thread) and *8.1%*(343->315) drop of cpu
> >> cycles per single page allocation and reclaim on Jesper's page_bench03 (88
> >> threads) running on a 2-Socket Broadwell-based server (88 threads, 126G
> >> memory).
> >>
> >> Benchmark link provided by Jesper D Brouer(increase loop times to
> >> 10000000):
> >> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
> >> bench
> >>
> >> =========================================================================
> >> When page allocation performance is not a bottleneck and you want all
> >> tooling to work, you can do:
> >> 	echo [S|s]trict > /proc/sys/vm/numa_stats_mode
> >>
> >> =========================================================================
> >> We recommend automatic detection of numa statistics by system, this is also
> >> system default configuration, you can do:
> >> 	echo [A|a]uto > /proc/sys/vm/numa_stats_mode
> >> In this case, numa counter update is skipped unless it has been read by
> >> users at least once, e.g. cat /proc/zoneinfo.
> > 
> > I am still not convinced the auto mode is worth all the additional code
> > and a safe default to use. The whole thing could have been 0/1 with a
> > simpler parsing and less code to catch readers.
> > 
> 
> I understood your concern. 
> Well, we may get rid of auto mode if there is some obvious disadvantage
> here. Now, I tend to keep it because most people may not touch this interface,
> and auto mode is helpful in such case.

But you cannot guarantee it won't break any existing users, can you?
Besides I do not remember anybody complaining about the performance
impact of these counters other than very specialized workloads which are
going to disable the accounting altogether. So I simply fail to see a
reason to add more code with a questionable semantic (see below on
partial reads).

> > E.g. why do we have to do static_branch_enable on any read or even
> > vmstat_stop? Wouldn't open be sufficient?
> > 
> 
> NUMA stats is used in four files:
> /proc/zoneinfo
> /proc/vmstat
> /sys/devices/system/node/node*/numastat
> /sys/devices/system/node/node*/vmstat
> In auto mode, each *read* will trigger the update of NUMA counter. 
> So, we should make sure the target branch is jumped to the branch 
> for NUMA counter update once the file is read from user space.
> the intension of static_branch_enable in vmstat_stop(in the call site 
> of file->file_ops.read) is for reading /proc/vmstat in case.  
> 
> I guess the *open* means file->file_op.open here, right?
> Do you suggest to move static_branch_enable to file->file_op.open? Thanks.

I haven't checked closely but what happens (or should happen) when you
do a partial read? Should you get an inconsistent results? Or is this
impossible?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-10-09  7:55       ` Michal Hocko
@ 2017-10-10  5:49         ` Michal Hocko
  -1 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2017-10-10  5:49 UTC (permalink / raw)
  To: kemi
  Cc: Luis R . Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Dave, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel

On Mon 09-10-17 09:55:49, Michal Hocko wrote:
> I haven't checked closely but what happens (or should happen) when you
> do a partial read? Should you get an inconsistent results? Or is this
> impossible?

Well, after thinking about it little bit more, partial reads are always
inconsistent so this wouldn't add a new problem.

Anyway I still stand by my position that this sounds over-engineered and
a simple 0/1 resp. on/off interface would be both simpler and safer. If
anybody wants an auto mode it can be added later (as a value 2 resp.
auto).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-10-10  5:49         ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2017-10-10  5:49 UTC (permalink / raw)
  To: kemi
  Cc: Luis R . Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Dave, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel

On Mon 09-10-17 09:55:49, Michal Hocko wrote:
> I haven't checked closely but what happens (or should happen) when you
> do a partial read? Should you get an inconsistent results? Or is this
> impossible?

Well, after thinking about it little bit more, partial reads are always
inconsistent so this wouldn't add a new problem.

Anyway I still stand by my position that this sounds over-engineered and
a simple 0/1 resp. on/off interface would be both simpler and safer. If
anybody wants an auto mode it can be added later (as a value 2 resp.
auto).
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-10-10  5:49         ` Michal Hocko
  (?)
@ 2017-10-10  5:54           ` kemi
  -1 siblings, 0 replies; 48+ messages in thread
From: kemi @ 2017-10-10  5:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Luis R . Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Dave, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel



On 2017年10月10日 13:49, Michal Hocko wrote:
> On Mon 09-10-17 09:55:49, Michal Hocko wrote:
>> I haven't checked closely but what happens (or should happen) when you
>> do a partial read? Should you get an inconsistent results? Or is this
>> impossible?
> 
> Well, after thinking about it little bit more, partial reads are always
> inconsistent so this wouldn't add a new problem.
> 
> Anyway I still stand by my position that this sounds over-engineered and
> a simple 0/1 resp. on/off interface would be both simpler and safer. If
> anybody wants an auto mode it can be added later (as a value 2 resp.
> auto).
> 

It sounds good to me. If Andrew also tends to be a simple 0/1, I will submit
V4 patch for it. Thanks

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-10-10  5:54           ` kemi
  0 siblings, 0 replies; 48+ messages in thread
From: kemi @ 2017-10-10  5:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Luis R . Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Dave, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel



On 2017年10月10日 13:49, Michal Hocko wrote:
> On Mon 09-10-17 09:55:49, Michal Hocko wrote:
>> I haven't checked closely but what happens (or should happen) when you
>> do a partial read? Should you get an inconsistent results? Or is this
>> impossible?
> 
> Well, after thinking about it little bit more, partial reads are always
> inconsistent so this wouldn't add a new problem.
> 
> Anyway I still stand by my position that this sounds over-engineered and
> a simple 0/1 resp. on/off interface would be both simpler and safer. If
> anybody wants an auto mode it can be added later (as a value 2 resp.
> auto).
> 

It sounds good to me. If Andrew also tends to be a simple 0/1, I will submit
V4 patch for it. Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-10-10  5:54           ` kemi
  0 siblings, 0 replies; 48+ messages in thread
From: kemi @ 2017-10-10  5:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Luis R . Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Dave, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel



On 2017a1'10ae??10ae?JPY 13:49, Michal Hocko wrote:
> On Mon 09-10-17 09:55:49, Michal Hocko wrote:
>> I haven't checked closely but what happens (or should happen) when you
>> do a partial read? Should you get an inconsistent results? Or is this
>> impossible?
> 
> Well, after thinking about it little bit more, partial reads are always
> inconsistent so this wouldn't add a new problem.
> 
> Anyway I still stand by my position that this sounds over-engineered and
> a simple 0/1 resp. on/off interface would be both simpler and safer. If
> anybody wants an auto mode it can be added later (as a value 2 resp.
> auto).
> 

It sounds good to me. If Andrew also tends to be a simple 0/1, I will submit
V4 patch for it. Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-10-10  5:49         ` Michal Hocko
@ 2017-10-10 14:29           ` Dave Hansen
  -1 siblings, 0 replies; 48+ messages in thread
From: Dave Hansen @ 2017-10-10 14:29 UTC (permalink / raw)
  To: Michal Hocko, kemi
  Cc: Luis R . Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Tim Chen, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Proc sysctl,
	Linux MM, Linux Kernel

On 10/09/2017 10:49 PM, Michal Hocko wrote:
> On Mon 09-10-17 09:55:49, Michal Hocko wrote:
>> I haven't checked closely but what happens (or should happen) when you
>> do a partial read? Should you get an inconsistent results? Or is this
>> impossible?
> Well, after thinking about it little bit more, partial reads are always
> inconsistent so this wouldn't add a new problem.
> 
> Anyway I still stand by my position that this sounds over-engineered and
> a simple 0/1 resp. on/off interface would be both simpler and safer. If
> anybody wants an auto mode it can be added later (as a value 2 resp.
> auto).

0/1 with the default set to the strict, slower mode?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-10-10 14:29           ` Dave Hansen
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Hansen @ 2017-10-10 14:29 UTC (permalink / raw)
  To: Michal Hocko, kemi
  Cc: Luis R . Rodriguez, Kees Cook, Andrew Morton, Jonathan Corbet,
	Mel Gorman, Johannes Weiner, Christopher Lameter,
	Sebastian Andrzej Siewior, Vlastimil Babka, Tim Chen, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Proc sysctl,
	Linux MM, Linux Kernel

On 10/09/2017 10:49 PM, Michal Hocko wrote:
> On Mon 09-10-17 09:55:49, Michal Hocko wrote:
>> I haven't checked closely but what happens (or should happen) when you
>> do a partial read? Should you get an inconsistent results? Or is this
>> impossible?
> Well, after thinking about it little bit more, partial reads are always
> inconsistent so this wouldn't add a new problem.
> 
> Anyway I still stand by my position that this sounds over-engineered and
> a simple 0/1 resp. on/off interface would be both simpler and safer. If
> anybody wants an auto mode it can be added later (as a value 2 resp.
> auto).

0/1 with the default set to the strict, slower mode?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-10-10 14:29           ` Dave Hansen
@ 2017-10-10 14:31             ` Michal Hocko
  -1 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2017-10-10 14:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kemi, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior, Vlastimil Babka,
	Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel

On Tue 10-10-17 07:29:31, Dave Hansen wrote:
> On 10/09/2017 10:49 PM, Michal Hocko wrote:
> > On Mon 09-10-17 09:55:49, Michal Hocko wrote:
> >> I haven't checked closely but what happens (or should happen) when you
> >> do a partial read? Should you get an inconsistent results? Or is this
> >> impossible?
> > Well, after thinking about it little bit more, partial reads are always
> > inconsistent so this wouldn't add a new problem.
> > 
> > Anyway I still stand by my position that this sounds over-engineered and
> > a simple 0/1 resp. on/off interface would be both simpler and safer. If
> > anybody wants an auto mode it can be added later (as a value 2 resp.
> > auto).
> 
> 0/1 with the default set to the strict, slower mode?

yes, keep the current semantic and allow users who care to disable
something that stands in the way.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-10-10 14:31             ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2017-10-10 14:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kemi, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior, Vlastimil Babka,
	Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel

On Tue 10-10-17 07:29:31, Dave Hansen wrote:
> On 10/09/2017 10:49 PM, Michal Hocko wrote:
> > On Mon 09-10-17 09:55:49, Michal Hocko wrote:
> >> I haven't checked closely but what happens (or should happen) when you
> >> do a partial read? Should you get an inconsistent results? Or is this
> >> impossible?
> > Well, after thinking about it little bit more, partial reads are always
> > inconsistent so this wouldn't add a new problem.
> > 
> > Anyway I still stand by my position that this sounds over-engineered and
> > a simple 0/1 resp. on/off interface would be both simpler and safer. If
> > anybody wants an auto mode it can be added later (as a value 2 resp.
> > auto).
> 
> 0/1 with the default set to the strict, slower mode?

yes, keep the current semantic and allow users who care to disable
something that stands in the way.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-10-10 14:31             ` Michal Hocko
@ 2017-10-10 14:53               ` Dave Hansen
  -1 siblings, 0 replies; 48+ messages in thread
From: Dave Hansen @ 2017-10-10 14:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kemi, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior, Vlastimil Babka,
	Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel

On 10/10/2017 07:31 AM, Michal Hocko wrote:
> On Tue 10-10-17 07:29:31, Dave Hansen wrote:
>> On 10/09/2017 10:49 PM, Michal Hocko wrote:
>>> Anyway I still stand by my position that this sounds over-engineered and
>>> a simple 0/1 resp. on/off interface would be both simpler and safer. If
>>> anybody wants an auto mode it can be added later (as a value 2 resp.
>>> auto).
>>
>> 0/1 with the default set to the strict, slower mode?
> 
> yes, keep the current semantic and allow users who care to disable
> something that stands in the way.

But, let's be honest, this leaves us with an option that nobody is ever
going to turn on.  IOW, nobody except a very small portion of our users
will ever see any benefit from this.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-10-10 14:53               ` Dave Hansen
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Hansen @ 2017-10-10 14:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kemi, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior, Vlastimil Babka,
	Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel

On 10/10/2017 07:31 AM, Michal Hocko wrote:
> On Tue 10-10-17 07:29:31, Dave Hansen wrote:
>> On 10/09/2017 10:49 PM, Michal Hocko wrote:
>>> Anyway I still stand by my position that this sounds over-engineered and
>>> a simple 0/1 resp. on/off interface would be both simpler and safer. If
>>> anybody wants an auto mode it can be added later (as a value 2 resp.
>>> auto).
>>
>> 0/1 with the default set to the strict, slower mode?
> 
> yes, keep the current semantic and allow users who care to disable
> something that stands in the way.

But, let's be honest, this leaves us with an option that nobody is ever
going to turn on.  IOW, nobody except a very small portion of our users
will ever see any benefit from this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-10-10 14:53               ` Dave Hansen
@ 2017-10-10 14:57                 ` Michal Hocko
  -1 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2017-10-10 14:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kemi, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior, Vlastimil Babka,
	Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel

On Tue 10-10-17 07:53:50, Dave Hansen wrote:
> On 10/10/2017 07:31 AM, Michal Hocko wrote:
> > On Tue 10-10-17 07:29:31, Dave Hansen wrote:
> >> On 10/09/2017 10:49 PM, Michal Hocko wrote:
> >>> Anyway I still stand by my position that this sounds over-engineered and
> >>> a simple 0/1 resp. on/off interface would be both simpler and safer. If
> >>> anybody wants an auto mode it can be added later (as a value 2 resp.
> >>> auto).
> >>
> >> 0/1 with the default set to the strict, slower mode?
> > 
> > yes, keep the current semantic and allow users who care to disable
> > something that stands in the way.
> 
> But, let's be honest, this leaves us with an option that nobody is ever
> going to turn on.  IOW, nobody except a very small portion of our users
> will ever see any benefit from this.

But aren't those small groups who would like to squeeze every single
cycle out from the page allocator path the targeted audience?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-10-10 14:57                 ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2017-10-10 14:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kemi, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior, Vlastimil Babka,
	Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel

On Tue 10-10-17 07:53:50, Dave Hansen wrote:
> On 10/10/2017 07:31 AM, Michal Hocko wrote:
> > On Tue 10-10-17 07:29:31, Dave Hansen wrote:
> >> On 10/09/2017 10:49 PM, Michal Hocko wrote:
> >>> Anyway I still stand by my position that this sounds over-engineered and
> >>> a simple 0/1 resp. on/off interface would be both simpler and safer. If
> >>> anybody wants an auto mode it can be added later (as a value 2 resp.
> >>> auto).
> >>
> >> 0/1 with the default set to the strict, slower mode?
> > 
> > yes, keep the current semantic and allow users who care to disable
> > something that stands in the way.
> 
> But, let's be honest, this leaves us with an option that nobody is ever
> going to turn on.  IOW, nobody except a very small portion of our users
> will ever see any benefit from this.

But aren't those small groups who would like to squeeze every single
cycle out from the page allocator path the targeted audience?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-10-10 14:57                 ` Michal Hocko
  (?)
@ 2017-10-10 15:14                 ` Christopher Lameter
  -1 siblings, 0 replies; 48+ messages in thread
From: Christopher Lameter @ 2017-10-10 15:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Hansen, kemi, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Mel Gorman, Johannes Weiner,
	Sebastian Andrzej Siewior, Vlastimil Babka, Tim Chen, Andi Kleen,
	Jesper Dangaard Brouer, Ying Huang, Aaron Lu, Proc sysctl,
	Linux MM, Linux Kernel

On Tue, 10 Oct 2017, Michal Hocko wrote:

> > But, let's be honest, this leaves us with an option that nobody is ever
> > going to turn on.  IOW, nobody except a very small portion of our users
> > will ever see any benefit from this.
>
> But aren't those small groups who would like to squeeze every single
> cycle out from the page allocator path the targeted audience?

Those have long sine raised the white flag and succumbed to the
featuritis. Resigned to try to keep the bloat restricted to a couple of
cores so that the rest of the cores stay usable.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-10-10 14:57                 ` Michal Hocko
  (?)
@ 2017-10-10 15:39                   ` Dave Hansen
  -1 siblings, 0 replies; 48+ messages in thread
From: Dave Hansen @ 2017-10-10 15:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kemi, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior, Vlastimil Babka,
	Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel, Aaron Lu

On 10/10/2017 07:57 AM, Michal Hocko wrote:
>> But, let's be honest, this leaves us with an option that nobody is ever
>> going to turn on.  IOW, nobody except a very small portion of our users
>> will ever see any benefit from this.
> But aren't those small groups who would like to squeeze every single
> cycle out from the page allocator path the targeted audience?

They're the reason we started looking at this.  They also care the most.

But, the cost of these stats, especially we get more and more cores in a
NUMA node is really making them show up in profiles.  It would be nice
to get rid of them there, too.

Aaron, do you remember offhand how much of the allocator overhead was
coming from NUMA stats?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-10-10 15:39                   ` Dave Hansen
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Hansen @ 2017-10-10 15:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kemi, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior, Vlastimil Babka,
	Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel, Aaron Lu

On 10/10/2017 07:57 AM, Michal Hocko wrote:
>> But, let's be honest, this leaves us with an option that nobody is ever
>> going to turn on.  IOW, nobody except a very small portion of our users
>> will ever see any benefit from this.
> But aren't those small groups who would like to squeeze every single
> cycle out from the page allocator path the targeted audience?

They're the reason we started looking at this.  They also care the most.

But, the cost of these stats, especially we get more and more cores in a
NUMA node is really making them show up in profiles.  It would be nice
to get rid of them there, too.

Aaron, do you remember offhand how much of the allocator overhead was
coming from NUMA stats?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-10-10 15:39                   ` Dave Hansen
  0 siblings, 0 replies; 48+ messages in thread
From: Dave Hansen @ 2017-10-10 15:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kemi, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior, Vlastimil Babka,
	Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel

On 10/10/2017 07:57 AM, Michal Hocko wrote:
>> But, let's be honest, this leaves us with an option that nobody is ever
>> going to turn on.  IOW, nobody except a very small portion of our users
>> will ever see any benefit from this.
> But aren't those small groups who would like to squeeze every single
> cycle out from the page allocator path the targeted audience?

They're the reason we started looking at this.  They also care the most.

But, the cost of these stats, especially we get more and more cores in a
NUMA node is really making them show up in profiles.  It would be nice
to get rid of them there, too.

Aaron, do you remember offhand how much of the allocator overhead was
coming from NUMA stats?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-10-10 15:39                   ` Dave Hansen
@ 2017-10-10 17:51                     ` Michal Hocko
  -1 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2017-10-10 17:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kemi, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior, Vlastimil Babka,
	Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel

On Tue 10-10-17 08:39:47, Dave Hansen wrote:
> On 10/10/2017 07:57 AM, Michal Hocko wrote:
> >> But, let's be honest, this leaves us with an option that nobody is ever
> >> going to turn on.  IOW, nobody except a very small portion of our users
> >> will ever see any benefit from this.
> > But aren't those small groups who would like to squeeze every single
> > cycle out from the page allocator path the targeted audience?
> 
> They're the reason we started looking at this.  They also care the most.
> 
> But, the cost of these stats, especially we get more and more cores in a
> NUMA node is really making them show up in profiles.  It would be nice
> to get rid of them there, too.

I am not opposing to the auto mode. I am just not sure it is a safe
default and I also think that we should add this on top if it is really
needed.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-10-10 17:51                     ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2017-10-10 17:51 UTC (permalink / raw)
  To: Dave Hansen
  Cc: kemi, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior, Vlastimil Babka,
	Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel

On Tue 10-10-17 08:39:47, Dave Hansen wrote:
> On 10/10/2017 07:57 AM, Michal Hocko wrote:
> >> But, let's be honest, this leaves us with an option that nobody is ever
> >> going to turn on.  IOW, nobody except a very small portion of our users
> >> will ever see any benefit from this.
> > But aren't those small groups who would like to squeeze every single
> > cycle out from the page allocator path the targeted audience?
> 
> They're the reason we started looking at this.  They also care the most.
> 
> But, the cost of these stats, especially we get more and more cores in a
> NUMA node is really making them show up in profiles.  It would be nice
> to get rid of them there, too.

I am not opposing to the auto mode. I am just not sure it is a safe
default and I also think that we should add this on top if it is really
needed.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-10-10 17:51                     ` Michal Hocko
@ 2017-10-10 21:34                       ` Andrew Morton
  -1 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2017-10-10 21:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Hansen, kemi, Luis R . Rodriguez, Kees Cook,
	Jonathan Corbet, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior, Vlastimil Babka,
	Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel

On Tue, 10 Oct 2017 19:51:43 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> On Tue 10-10-17 08:39:47, Dave Hansen wrote:
> > On 10/10/2017 07:57 AM, Michal Hocko wrote:
> > >> But, let's be honest, this leaves us with an option that nobody is ever
> > >> going to turn on.  IOW, nobody except a very small portion of our users
> > >> will ever see any benefit from this.
> > > But aren't those small groups who would like to squeeze every single
> > > cycle out from the page allocator path the targeted audience?
> > 
> > They're the reason we started looking at this.  They also care the most.
> > 
> > But, the cost of these stats, especially we get more and more cores in a
> > NUMA node is really making them show up in profiles.  It would be nice
> > to get rid of them there, too.
> 
> I am not opposing to the auto mode. I am just not sure it is a safe
> default and I also think that we should add this on top if it is really
> needed.

Yup.  Let's keep things simple unless a real need is demonstrated, please.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-10-10 21:34                       ` Andrew Morton
  0 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2017-10-10 21:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Hansen, kemi, Luis R . Rodriguez, Kees Cook,
	Jonathan Corbet, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior, Vlastimil Babka,
	Tim Chen, Andi Kleen, Jesper Dangaard Brouer, Ying Huang,
	Aaron Lu, Proc sysctl, Linux MM, Linux Kernel

On Tue, 10 Oct 2017 19:51:43 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> On Tue 10-10-17 08:39:47, Dave Hansen wrote:
> > On 10/10/2017 07:57 AM, Michal Hocko wrote:
> > >> But, let's be honest, this leaves us with an option that nobody is ever
> > >> going to turn on.  IOW, nobody except a very small portion of our users
> > >> will ever see any benefit from this.
> > > But aren't those small groups who would like to squeeze every single
> > > cycle out from the page allocator path the targeted audience?
> > 
> > They're the reason we started looking at this.  They also care the most.
> > 
> > But, the cost of these stats, especially we get more and more cores in a
> > NUMA node is really making them show up in profiles.  It would be nice
> > to get rid of them there, too.
> 
> I am not opposing to the auto mode. I am just not sure it is a safe
> default and I also think that we should add this on top if it is really
> needed.

Yup.  Let's keep things simple unless a real need is demonstrated, please.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
  2017-10-10 15:39                   ` Dave Hansen
@ 2017-10-11  6:16                     ` Vlastimil Babka
  -1 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2017-10-11  6:16 UTC (permalink / raw)
  To: Dave Hansen, Michal Hocko
  Cc: kemi, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel

On 10/10/2017 05:39 PM, Dave Hansen wrote:
> On 10/10/2017 07:57 AM, Michal Hocko wrote:
>>> But, let's be honest, this leaves us with an option that nobody is ever
>>> going to turn on.  IOW, nobody except a very small portion of our users
>>> will ever see any benefit from this.
>> But aren't those small groups who would like to squeeze every single
>> cycle out from the page allocator path the targeted audience?
> 
> They're the reason we started looking at this.  They also care the most.
> 
> But, the cost of these stats, especially we get more and more cores in a
> NUMA node is really making them show up in profiles.  It would be nice
> to get rid of them there, too.

Furthermore, the group that actually looks at those stats, could be also
expected to be quite small. The group that cares neither about the
stats, nor relies on top allocator performance, might still arguably
benefit from improved allocator performance, but won't for sure benefit
from the stats.

> Aaron, do you remember offhand how much of the allocator overhead was
> coming from NUMA stats?
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v3] mm, sysctl: make NUMA stats configurable
@ 2017-10-11  6:16                     ` Vlastimil Babka
  0 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2017-10-11  6:16 UTC (permalink / raw)
  To: Dave Hansen, Michal Hocko
  Cc: kemi, Luis R . Rodriguez, Kees Cook, Andrew Morton,
	Jonathan Corbet, Mel Gorman, Johannes Weiner,
	Christopher Lameter, Sebastian Andrzej Siewior, Tim Chen,
	Andi Kleen, Jesper Dangaard Brouer, Ying Huang, Aaron Lu,
	Proc sysctl, Linux MM, Linux Kernel

On 10/10/2017 05:39 PM, Dave Hansen wrote:
> On 10/10/2017 07:57 AM, Michal Hocko wrote:
>>> But, let's be honest, this leaves us with an option that nobody is ever
>>> going to turn on.  IOW, nobody except a very small portion of our users
>>> will ever see any benefit from this.
>> But aren't those small groups who would like to squeeze every single
>> cycle out from the page allocator path the targeted audience?
> 
> They're the reason we started looking at this.  They also care the most.
> 
> But, the cost of these stats, especially we get more and more cores in a
> NUMA node is really making them show up in profiles.  It would be nice
> to get rid of them there, too.

Furthermore, the group that actually looks at those stats, could be also
expected to be quite small. The group that cares neither about the
stats, nor relies on top allocator performance, might still arguably
benefit from improved allocator performance, but won't for sure benefit
from the stats.

> Aaron, do you remember offhand how much of the allocator overhead was
> coming from NUMA stats?
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2017-10-11  6:16 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-28  6:11 [PATCH v3] mm, sysctl: make NUMA stats configurable Kemi Wang
2017-09-28  6:11 ` Kemi Wang
2017-09-28 21:29 ` Andrew Morton
2017-09-28 21:29   ` Andrew Morton
2017-09-29  1:44   ` kemi
2017-09-29  1:44     ` kemi
2017-09-29  1:44     ` kemi
2017-09-29  7:09   ` Vlastimil Babka
2017-09-29  7:09     ` Vlastimil Babka
2017-09-29  7:03 ` Vlastimil Babka
2017-09-29  7:03   ` Vlastimil Babka
2017-10-09  2:20   ` kemi
2017-10-09  2:20     ` kemi
2017-10-09  2:20     ` kemi
2017-09-29  7:27 ` Vlastimil Babka
2017-09-29  7:27   ` Vlastimil Babka
2017-09-29  7:27   ` Vlastimil Babka
2017-10-03  9:23 ` Michal Hocko
2017-10-03  9:23   ` Michal Hocko
2017-10-09  6:34   ` kemi
2017-10-09  6:34     ` kemi
2017-10-09  6:34     ` kemi
2017-10-09  7:55     ` Michal Hocko
2017-10-09  7:55       ` Michal Hocko
2017-10-09  7:55       ` Michal Hocko
2017-10-10  5:49       ` Michal Hocko
2017-10-10  5:49         ` Michal Hocko
2017-10-10  5:54         ` kemi
2017-10-10  5:54           ` kemi
2017-10-10  5:54           ` kemi
2017-10-10 14:29         ` Dave Hansen
2017-10-10 14:29           ` Dave Hansen
2017-10-10 14:31           ` Michal Hocko
2017-10-10 14:31             ` Michal Hocko
2017-10-10 14:53             ` Dave Hansen
2017-10-10 14:53               ` Dave Hansen
2017-10-10 14:57               ` Michal Hocko
2017-10-10 14:57                 ` Michal Hocko
2017-10-10 15:14                 ` Christopher Lameter
2017-10-10 15:39                 ` Dave Hansen
2017-10-10 15:39                   ` Dave Hansen
2017-10-10 15:39                   ` Dave Hansen
2017-10-10 17:51                   ` Michal Hocko
2017-10-10 17:51                     ` Michal Hocko
2017-10-10 21:34                     ` Andrew Morton
2017-10-10 21:34                       ` Andrew Morton
2017-10-11  6:16                   ` Vlastimil Babka
2017-10-11  6:16                     ` Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.