Linux-Fsdevel Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/3] sched/numa: introduce advanced numa statistic
@ 2019-11-13  3:43 王贇
  2019-11-13  3:44 ` [PATCH 1/3] sched/numa: advanced per-cgroup " 王贇
                   ` (5 more replies)
  0 siblings, 6 replies; 45+ messages in thread
From: 王贇 @ 2019-11-13  3:43 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney

Modern production environment could use hundreds of cgroup to control
the resources for different workloads, along with the complicated
resource binding.

On NUMA platforms where we have multiple nodes, things become even more
complicated, we hope there are more local memory access to improve the
performance, and NUMA Balancing keep working hard to achieve that,
however, wrong memory policy or node binding could easily waste the
effort, result a lot of remote page accessing.

We need to perceive such problems, then we got chance to fix it before
there are too much damages, however, there are no good approach yet to
help catch the mouse who introduced the remote access.

This patch set is trying to fill in the missing pieces, by introduce
the per-cgroup NUMA locality/exectime statistics, and expose the per-task
page migration failure counter, with these statistics, we could achieve
the daily monitoring on NUMA efficiency, to give warning when things going
too wrong.

Please check the third patch for more details.

Thanks to Peter, Mel and Michal for the good advices :-)

Michael Wang (3):
  sched/numa: advanced per-cgroup numa statistic
  sched/numa: expose per-task pages-migration-failure counter
  sched/numa: documentation for per-cgroup numa stat

 Documentation/admin-guide/cg-numa-stat.rst      | 161 ++++++++++++++++++++++++
 Documentation/admin-guide/kernel-parameters.txt |   4 +
 Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
 include/linux/sched.h                           |  18 ++-
 include/linux/sched/sysctl.h                    |   6 +
 init/Kconfig                                    |   9 ++
 kernel/sched/core.c                             |  91 ++++++++++++++
 kernel/sched/debug.c                            |   1 +
 kernel/sched/fair.c                             |  33 +++++
 kernel/sched/sched.h                            |  17 +++
 kernel/sysctl.c                                 |  11 ++
 11 files changed, 359 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/admin-guide/cg-numa-stat.rst


-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 1/3] sched/numa: advanced per-cgroup numa statistic
  2019-11-13  3:43 [PATCH 0/3] sched/numa: introduce advanced numa statistic 王贇
@ 2019-11-13  3:44 ` " 王贇
  2019-11-13  3:45 ` [PATCH 2/3] sched/numa: expose per-task pages-migration-failure 王贇
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: 王贇 @ 2019-11-13  3:44 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney

Currently there are no good approach to monitoring the per-cgroup
numa efficiency, this could be a trouble especially when groups
are sharing CPUs, it's impossible to tell which one caused the
remote-memory access by reading hardware counter since multiple
workloads could sharing the same CPU, which make it painful when
one want to find out the root cause and fix the issue.

In order to address this, we introduced new per-cgroup statistic
for numa:
  * the numa locality to imply the numa balancing efficiency
  * the numa execution time on each node

The task locality is the local page accessing ratio traced on numa
balancing PF, and the group locality is the topology of task execution
time, sectioned by the locality into 7 regions.

For example the new entry 'cpu.numa_stat' show:
  locality 39541 60962 36842 72519 118605 721778 946553
  exectime 1220127 1458684

Here we know the workloads in hierarchy executed 1220127ms on node_0
and 1458684ms on node_1 in total, tasks with locality around 0~13%
executed for 39541 ms, and tasks with locality around 86~100% executed
for 946553 ms, which imply most of the memory access are local access.

By monitoring the new statistic, we will be able to know the numa
efficiency of each per-cgroup workloads on machine, whatever they
sharing the CPUs or not, we will be able to find out which one
introduced the remote access mostly.

Besides, per-node memory topology from 'memory.numa_stat' become
more useful when we have the per-node execution time, workloads
always executing on node_0 while it's memory is all on node_1 is
usually a bad case.

Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 include/linux/sched.h        | 18 ++++++++-
 include/linux/sched/sysctl.h |  6 +++
 init/Kconfig                 |  9 +++++
 kernel/sched/core.c          | 91 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c          | 33 ++++++++++++++++
 kernel/sched/sched.h         | 17 +++++++++
 kernel/sysctl.c              | 11 ++++++
 7 files changed, 184 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 263cf089d1b3..e3daadc5a799 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1113,9 +1113,25 @@ struct task_struct {
 	 * numa_faults_locality tracks if faults recorded during the last
 	 * scan window were remote/local or failed to migrate. The task scan
 	 * period is adapted based on the locality of the faults with different
-	 * weights depending on whether they were shared or private faults
+	 * weights depending on whether they were shared or private faults.
+	 *
+	 * Counter id stand for:
+	 * 0 -- remote faults
+	 * 1 -- local faults
+	 * 2 -- page migration failure
+	 *
+	 * Extra counters when CONFIG_CGROUP_NUMA_STAT enabled:
+	 * 3 -- remote page accessing
+	 * 4 -- local page accessing
+	 *
+	 * The 'remote/local faults' records the cpu-page relationship before
+	 * page migration, while the 'remote/local page accessing' is after.
 	 */
+#ifndef CONFIG_CGROUP_NUMA_STAT
 	unsigned long			numa_faults_locality[3];
+#else
+	unsigned long			numa_faults_locality[5];
+#endif

 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 89f55e914673..2d6a515df544 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -102,4 +102,10 @@ extern int sched_energy_aware_handler(struct ctl_table *table, int write,
 				 loff_t *ppos);
 #endif

+#ifdef CONFIG_CGROUP_NUMA_STAT
+extern int sysctl_cg_numa_stat(struct ctl_table *table, int write,
+				 void __user *buffer, size_t *lenp,
+				 loff_t *ppos);
+#endif
+
 #endif /* _LINUX_SCHED_SYSCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index b4daad2bac23..b0f9bfbd1c3f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -817,6 +817,15 @@ config NUMA_BALANCING_DEFAULT_ENABLED
 	  If set, automatic NUMA balancing will be enabled if running on a NUMA
 	  machine.

+config CGROUP_NUMA_STAT
+	bool "Advanced per-cgroup NUMA statistics"
+	default n
+	depends on CGROUP_SCHED && NUMA_BALANCING
+	help
+	  This option adds support for per-cgroup NUMA locality/execution
+	  statistics, for monitoring NUMA efficiency of per-cgroup workloads
+	  on NUMA platforms with NUMA Balancing enabled.
+
 menuconfig CGROUPS
 	bool "Control Group support"
 	select KERNFS
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index eb42b71faab9..4f05576f371a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7638,6 +7638,84 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */

+#ifdef CONFIG_CGROUP_NUMA_STAT
+DEFINE_STATIC_KEY_FALSE(sched_cg_numa_stat);
+
+#ifdef CONFIG_PROC_SYSCTL
+int sysctl_cg_numa_stat(struct ctl_table *table, int write,
+			 void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table t;
+	int err;
+	int state = static_branch_likely(&sched_cg_numa_stat);
+
+	if (write && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	t = *table;
+	t.data = &state;
+	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+	if (err < 0 || !write)
+		return err;
+
+	if (state)
+		static_branch_enable(&sched_cg_numa_stat);
+	else
+		static_branch_disable(&sched_cg_numa_stat);
+
+	return err;
+}
+#endif
+
+static inline struct cfs_rq *tg_cfs_rq(struct task_group *tg, int cpu)
+{
+	return tg == &root_task_group ? &cpu_rq(cpu)->cfs : tg->cfs_rq[cpu];
+}
+
+static int cpu_numa_stat_show(struct seq_file *sf, void *v)
+{
+	int nr;
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	if (!static_branch_likely(&sched_cg_numa_stat))
+		return 0;
+
+	seq_puts(sf, "locality");
+	for (nr = 0; nr < NR_NL_INTERVAL; nr++) {
+		int cpu;
+		u64 sum = 0;
+
+		for_each_possible_cpu(cpu)
+			sum += tg_cfs_rq(tg, cpu)->nstat.locality[nr];
+
+		seq_printf(sf, " %u", jiffies_to_msecs(sum));
+	}
+	seq_putc(sf, '\n');
+
+	seq_puts(sf, "exectime");
+	for_each_online_node(nr) {
+		int cpu;
+		u64 sum = 0;
+
+		for_each_cpu(cpu, cpumask_of_node(nr))
+			sum += tg_cfs_rq(tg, cpu)->nstat.jiffies;
+
+		seq_printf(sf, " %u", jiffies_to_msecs(sum));
+	}
+	seq_putc(sf, '\n');
+
+	return 0;
+}
+
+static __init int cg_numa_stat_setup(char *opt)
+{
+	static_branch_enable(&sched_cg_numa_stat);
+
+	return 0;
+}
+__setup("cg_numa_stat", cg_numa_stat_setup);
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -7687,6 +7765,12 @@ static struct cftype cpu_legacy_files[] = {
 		.seq_show = cpu_uclamp_max_show,
 		.write = cpu_uclamp_max_write,
 	},
+#endif
+#ifdef CONFIG_CGROUP_NUMA_STAT
+	{
+		.name = "numa_stat",
+		.seq_show = cpu_numa_stat_show,
+	},
 #endif
 	{ }	/* Terminate */
 };
@@ -7868,6 +7952,13 @@ static struct cftype cpu_files[] = {
 		.seq_show = cpu_uclamp_max_show,
 		.write = cpu_uclamp_max_write,
 	},
+#endif
+#ifdef CONFIG_CGROUP_NUMA_STAT
+	{
+		.name = "numa_stat",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_numa_stat_show,
+	},
 #endif
 	{ }	/* terminate */
 };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a81c36472822..706519104c9f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2466,6 +2466,18 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 	p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages;
 	p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages;
 	p->numa_faults_locality[local] += pages;
+
+#ifdef CONFIG_CGROUP_NUMA_STAT
+	if (!static_branch_unlikely(&sched_cg_numa_stat))
+		return;
+
+	/*
+	 * We want to record the real local/remote page access statistic
+	 * here, so use 'mem_node' which is the real residential node of
+	 * page after migrate_misplaced_page().
+	 */
+	p->numa_faults_locality[3 + !!(mem_node == numa_node_id())] += pages;
+#endif
 }

 static void reset_ptenuma_scan(struct task_struct *p)
@@ -4274,6 +4286,23 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 	cfs_rq->curr = NULL;
 }

+#ifdef CONFIG_CGROUP_NUMA_STAT
+static void update_numa_statistics(struct cfs_rq *cfs_rq)
+{
+	int idx;
+	unsigned long remote = current->numa_faults_locality[3];
+	unsigned long local = current->numa_faults_locality[4];
+
+	cfs_rq->nstat.jiffies++;
+
+	if (!remote && !local)
+		return;
+
+	idx = (NR_NL_INTERVAL - 1) * local / (remote + local);
+	cfs_rq->nstat.locality[idx]++;
+}
+#endif
+
 static void
 entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 {
@@ -4287,6 +4316,10 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	 */
 	update_load_avg(cfs_rq, curr, UPDATE_TG);
 	update_cfs_group(curr);
+#ifdef CONFIG_CGROUP_NUMA_STAT
+	if (static_branch_unlikely(&sched_cg_numa_stat))
+		update_numa_statistics(cfs_rq);
+#endif

 #ifdef CONFIG_SCHED_HRTICK
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0db2c1b3361e..0476e6f95013 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -486,6 +486,16 @@ struct cfs_bandwidth { };

 #endif	/* CONFIG_CGROUP_SCHED */

+#ifdef CONFIG_CGROUP_NUMA_STAT
+/* NUMA Locality Interval, 7 buckets for cache align */
+#define NR_NL_INTERVAL	7
+
+struct numa_statistics {
+	u64 jiffies;
+	u64 locality[NR_NL_INTERVAL];
+};
+#endif
+
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight	load;
@@ -575,6 +585,9 @@ struct cfs_rq {
 	struct list_head	throttled_list;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
+#ifdef CONFIG_CGROUP_NUMA_STAT
+	struct numa_statistics	nstat ____cacheline_aligned;
+#endif
 };

 static inline int rt_bandwidth_enabled(void)
@@ -1601,6 +1614,10 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features =
 extern struct static_key_false sched_numa_balancing;
 extern struct static_key_false sched_schedstats;

+#ifdef CONFIG_CGROUP_NUMA_STAT
+extern struct static_key_false sched_cg_numa_stat;
+#endif
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 31ece1120aa4..63a62c4df918 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -428,6 +428,17 @@ static struct ctl_table kern_table[] = {
 		.extra2		= SYSCTL_ONE,
 	},
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_CGROUP_NUMA_STAT
+	{
+		.procname	= "cg_numa_stat",
+		.data		= NULL, /* filled in by handler */
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_cg_numa_stat,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
+#endif /* CONFIG_CGROUP_NUMA_STAT */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
 		.procname	= "sched_rt_period_us",
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 2/3] sched/numa: expose per-task pages-migration-failure
  2019-11-13  3:43 [PATCH 0/3] sched/numa: introduce advanced numa statistic 王贇
  2019-11-13  3:44 ` [PATCH 1/3] sched/numa: advanced per-cgroup " 王贇
@ 2019-11-13  3:45 ` 王贇
  2019-11-13  3:45 ` [PATCH 3/3] sched/numa: documentation for per-cgroup numa stat 王贇
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 45+ messages in thread
From: 王贇 @ 2019-11-13  3:45 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney

NUMA balancing will try to migrate pages between nodes, which
could caused by memory policy or numa group aggregation, while
the page migration could failed too for eg when the target node
run out of memory.

Since this is critical to the performance, admin should know
how serious the problem is, and take actions before it causing
too much performance damage, thus this patch expose the counter
as 'migfailed' in '/proc/PID/sched'.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Koutný <mkoutny@suse.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 kernel/sched/debug.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index f7e4579e746c..73c4809c8f37 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -848,6 +848,7 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m)
 	P(total_numa_faults);
 	SEQ_printf(m, "current_node=%d, numa_group_id=%d\n",
 			task_node(p), task_numa_group_id(p));
+	SEQ_printf(m, "migfailed=%lu\n", p->numa_faults_locality[2]);
 	show_numa_stats(p, m);
 	mpol_put(pol);
 #endif
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 3/3] sched/numa: documentation for per-cgroup numa stat
  2019-11-13  3:43 [PATCH 0/3] sched/numa: introduce advanced numa statistic 王贇
  2019-11-13  3:44 ` [PATCH 1/3] sched/numa: advanced per-cgroup " 王贇
  2019-11-13  3:45 ` [PATCH 2/3] sched/numa: expose per-task pages-migration-failure 王贇
@ 2019-11-13  3:45 ` 王贇
  2019-11-13 15:09   ` Jonathan Corbet
                     ` (2 more replies)
  2019-11-20  9:45 ` [PATCH 0/3] sched/numa: introduce advanced numa statistic 王贇
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 45+ messages in thread
From: 王贇 @ 2019-11-13  3:45 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney

Add the description for 'cg_numa_stat', also a new doc to explain
the details on how to deal with the per-cgroup numa statistics.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 Documentation/admin-guide/cg-numa-stat.rst      | 161 ++++++++++++++++++++++++
 Documentation/admin-guide/kernel-parameters.txt |   4 +
 Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
 3 files changed, 174 insertions(+)
 create mode 100644 Documentation/admin-guide/cg-numa-stat.rst

diff --git a/Documentation/admin-guide/cg-numa-stat.rst b/Documentation/admin-guide/cg-numa-stat.rst
new file mode 100644
index 000000000000..87b716c51e16
--- /dev/null
+++ b/Documentation/admin-guide/cg-numa-stat.rst
@@ -0,0 +1,161 @@
+===============================
+Per-cgroup NUMA statistics
+===============================
+
+Background
+----------
+
+On NUMA platforms, remote memory accessing always has a performance penalty,
+although we have NUMA balancing working hard to maximum the local accessing
+proportion, there are still situations it can't helps.
+
+This could happen in modern production environment, using bunch of cgroups
+to classify and control resources which introduced complex configuration on
+memory policy, CPUs and NUMA node, NUMA balancing could facing the wrong
+memory policy or exhausted local NUMA node, lead into the low local page
+accessing proportion.
+
+We need to perceive such cases, figure out which workloads from which cgroup
+has introduced the issues, then we got chance to do adjustment to avoid
+performance damages.
+
+However, there are no hardware counter for per-task local/remote accessing
+info, we don't know how many remote page accessing has been done for a
+particular task.
+
+Statistics
+----------
+
+Fortunately, we have NUMA Balancing which scan task's mapping and trigger PF
+periodically, give us the opportunity to record per-task page accessing info.
+
+By "echo 1 > /proc/sys/kernel/cg_numa_stat" on runtime or add boot parameter
+'cg_numa_stat', we will enable the accounting of per-cgroup numa statistics,
+the 'cpu.numa_stat' entry of CPU cgroup will show statistics:
+
+  locality -- execution time sectioned by task NUMA locality (in ms)
+  exectime -- execution time sectioned by NUMA node (in ms)
+
+We define 'task NUMA locality' as:
+
+  nr_local_page_access * 100 / (nr_local_page_access + nr_remote_page_access)
+
+this per-task percentage value will be updated on the ticks for current task,
+and the access counter will be updated on task's NUMA balancing PF, so only
+the pages which NUMA Balancing paid attention to will be accounted.
+
+On each tick, we acquire the locality of current task on that CPU, accumulating
+the ticks into the counter of corresponding locality region, tasks from the
+same group sharing the counters, becoming the group locality.
+
+Similarly, we acquire the NUMA node of current CPU where the current task is
+executing on, accumulating the ticks into the counter of corresponding node,
+becoming the per-cgroup node execution time.
+
+To be noticed, the accounting is in a hierarchy way, which means the numa
+statistics representing not only the workload of this group, but also the
+workloads of all it's descendants.
+
+For example the 'cpu.numa_stat' show:
+  locality 39541 60962 36842 72519 118605 721778 946553
+  exectime 1220127 1458684
+
+The locality is sectioned into 7 regions, closely as:
+  0-13% 14-27% 28-42% 43-56% 57-71% 72-85% 86-100%
+
+And exectime is sectioned into 2 nodes, 0 and 1 in this case.
+
+Thus we know the workload of this group and it's descendants have totally
+executed 1220127ms on node_0 and 1458684ms on node_1, tasks with locality
+around 0~13% executed for 39541 ms, and tasks with locality around 87~100%
+executed for 946553 ms, which imply most of the memory access are local.
+
+Monitoring
+-----------------
+
+By monitoring the increments of these statistics, we can easily know whether
+NUMA balancing is working well for a particular workload.
+
+For example we take a 5 secs sample period, and consider locality under 27%
+is bad, then on each sampling we have:
+
+  region_bad = region_1 + region_2
+  region_all = region_1 + region_2 + ... + region_7
+
+and we have the increments as:
+
+  region_bad_diff = region_bad - last_region_bad
+  region_all_diff = region_all - last_region_all
+
+which finally become:
+
+  region_bad_percent = region_bad_diff * 100 / region_all_diff
+
+we can draw a line for region_bad_percent, when the line close to 0 things
+are good, when getting close to 100% something is wrong, we can pick a proper
+watermark to trigger warning message.
+
+You may want to drop the data if the region_all is too small, which imply
+there are not much available pages for NUMA Balancing, just ignore would be
+fine since most likely the workload is insensitive to NUMA.
+
+Monitoring root group help you control the overall situation, while you may
+also want to monitoring all the leaf groups which contain the workloads, this
+help to catch the mouse.
+
+The exectime could be useful when NUMA Balancing is disabled, or when locality
+become too small, for NUMA node X we have:
+
+  exectime_X_diff = exectime_X - last_exectime_X
+  exectime_all_diff = exectime_all - last_exectime_all
+
+try put your workload into a memory cgroup which providing per-node memory
+consumption by 'memory.numa_stat' entry, then we could get:
+
+  memory_percent_X = memory_X * 100 / memory_all
+  exectime_percent_X = exectime_X_diff * 100 / exectime_all_diff
+
+These two percentage are usually matched on each node, workload should execute
+mostly on the node contain most of it's memory, but it's not guaranteed.
+
+Depends on which part of the memory accessed mostly by the workload, locality
+could still be good with just a little piece of memory locally.
+
+Thus to tell if things are find or not depends on the understanding of system
+resource deployment, however, if you find node X got 100% memory percent but 0%
+exectime percent, definitely something is wrong.
+
+Troubleshooting
+---------------
+
+After locate which workloads introduced the bad locality, check:
+
+1). Is the workloads bind into a particular NUMA node?
+2). Is there any NUMA node run out of resources?
+
+There are several ways to bind task's memory with a NUMA node, the strict way
+like the MPOL_BIND memory policy or 'cpuset.mems' will limiting the memory
+node where to allocate pages, in this situation, admin should make sure the
+task is allowed to run on the CPUs of that NUMA node, and make sure there are
+available CPU resource there.
+
+There are also ways to bind task's CPU with a NUMA node, like 'cpuset.cpus' or
+sched_setaffinity() syscall, in this situation, NUMA Balancing help to migrate
+pages into that node, admin should make sure there are available memory there.
+
+Admin could try rebind or unbind the NUMA node to erase the damage, make a
+change then observe the statistics see if things get better until the situation
+is acceptable.
+
+Highlights
+----------
+
+For some tasks, NUMA Balancing may found no necessary to scan pages, and
+locality could always be 0 or small number, don't pay attention to them
+since they most likely insensitive to NUMA.
+
+There are no accounting until the option turned on, so enable it in advance
+if you want to have the whole history.
+
+We have per-task migfailed counter to tell how many page migration has been
+failed for a particular task, you will find it in /proc/PID/sched entry.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 5e27d74e2b74..220df1f0beb8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3191,6 +3191,10 @@
 	numa_balancing=	[KNL,X86] Enable or disable automatic NUMA balancing.
 			Allowed values are enable and disable

+	cg_numa_atat	[KNL] Enable advanced per-cgroup numa statistics.
+			Useful to debug NUMA efficiency problems when there are
+			lot's of per-cgroup workloads.
+
 	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
 			'node', 'default' can be specified
 			This can be set from sysctl after boot.
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 614179dc79a9..719593e8be20 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -572,6 +572,15 @@ rate for each task.
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.

+cg_numa_stat:
+=============
+
+Enables/disables advanced per-cgroup NUMA statistic.
+
+0: disabled (default).
+1: enabled.
+
+Check Documentation/admin-guide/cg-numa-stat.rst for details.

 osrelease, ostype & version:
 ============================
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/3] sched/numa: documentation for per-cgroup numa stat
  2019-11-13  3:45 ` [PATCH 3/3] sched/numa: documentation for per-cgroup numa stat 王贇
@ 2019-11-13 15:09   ` Jonathan Corbet
  2019-11-14  1:52     ` 王贇
  2019-11-13 18:28   ` Iurii Zaikin
  2019-11-15  2:29   ` [PATCH v2 " 王贇
  2 siblings, 1 reply; 45+ messages in thread
From: Jonathan Corbet @ 2019-11-13 15:09 UTC (permalink / raw)
  To: 王贇
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney

On Wed, 13 Nov 2019 11:45:59 +0800
王贇 <yun.wang@linux.alibaba.com> wrote:

> Add the description for 'cg_numa_stat', also a new doc to explain
> the details on how to deal with the per-cgroup numa statistics.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Michal Koutný <mkoutny@suse.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
> ---
>  Documentation/admin-guide/cg-numa-stat.rst      | 161 ++++++++++++++++++++++++
>  Documentation/admin-guide/kernel-parameters.txt |   4 +
>  Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
>  3 files changed, 174 insertions(+)
>  create mode 100644 Documentation/admin-guide/cg-numa-stat.rst

Thanks for adding documentation for your new feature!  When you add a new
RST file, though, you should also add it to index.rst so that it becomes a
part of the docs build.

A couple of nits below...

> diff --git a/Documentation/admin-guide/cg-numa-stat.rst b/Documentation/admin-guide/cg-numa-stat.rst
> new file mode 100644
> index 000000000000..87b716c51e16
> --- /dev/null
> +++ b/Documentation/admin-guide/cg-numa-stat.rst
> @@ -0,0 +1,161 @@
> +===============================
> +Per-cgroup NUMA statistics
> +===============================
> +
> +Background
> +----------
> +
> +On NUMA platforms, remote memory accessing always has a performance penalty,
> +although we have NUMA balancing working hard to maximum the local accessing
> +proportion, there are still situations it can't helps.
> +
> +This could happen in modern production environment, using bunch of cgroups
> +to classify and control resources which introduced complex configuration on
> +memory policy, CPUs and NUMA node, NUMA balancing could facing the wrong
> +memory policy or exhausted local NUMA node, lead into the low local page
> +accessing proportion.
> +
> +We need to perceive such cases, figure out which workloads from which cgroup
> +has introduced the issues, then we got chance to do adjustment to avoid
> +performance damages.
> +
> +However, there are no hardware counter for per-task local/remote accessing
> +info, we don't know how many remote page accessing has been done for a
> +particular task.
> +
> +Statistics
> +----------
> +
> +Fortunately, we have NUMA Balancing which scan task's mapping and trigger PF
> +periodically, give us the opportunity to record per-task page accessing info.
> +
> +By "echo 1 > /proc/sys/kernel/cg_numa_stat" on runtime or add boot parameter
> +'cg_numa_stat', we will enable the accounting of per-cgroup numa statistics,
> +the 'cpu.numa_stat' entry of CPU cgroup will show statistics:
> +
> +  locality -- execution time sectioned by task NUMA locality (in ms)
> +  exectime -- execution time sectioned by NUMA node (in ms)
> +
> +We define 'task NUMA locality' as:
> +
> +  nr_local_page_access * 100 / (nr_local_page_access + nr_remote_page_access)
> +
> +this per-task percentage value will be updated on the ticks for current task,
> +and the access counter will be updated on task's NUMA balancing PF, so only
> +the pages which NUMA Balancing paid attention to will be accounted.
> +
> +On each tick, we acquire the locality of current task on that CPU, accumulating
> +the ticks into the counter of corresponding locality region, tasks from the
> +same group sharing the counters, becoming the group locality.
> +
> +Similarly, we acquire the NUMA node of current CPU where the current task is
> +executing on, accumulating the ticks into the counter of corresponding node,
> +becoming the per-cgroup node execution time.
> +
> +To be noticed, the accounting is in a hierarchy way, which means the numa
> +statistics representing not only the workload of this group, but also the
> +workloads of all it's descendants.
> +
> +For example the 'cpu.numa_stat' show:
> +  locality 39541 60962 36842 72519 118605 721778 946553
> +  exectime 1220127 1458684

You almost certainly want that rendered as a literal block, so say
"show::".  There are other places where you'll want to do that as well. 

> +The locality is sectioned into 7 regions, closely as:
> +  0-13% 14-27% 28-42% 43-56% 57-71% 72-85% 86-100%
> +
> +And exectime is sectioned into 2 nodes, 0 and 1 in this case.
> +
> +Thus we know the workload of this group and it's descendants have totally
> +executed 1220127ms on node_0 and 1458684ms on node_1, tasks with locality
> +around 0~13% executed for 39541 ms, and tasks with locality around 87~100%
> +executed for 946553 ms, which imply most of the memory access are local.
> +
> +Monitoring
> +-----------------

A slightly long underline :)

I'll stop here; thanks again for adding documentation.

jon

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/3] sched/numa: documentation for per-cgroup numa stat
  2019-11-13  3:45 ` [PATCH 3/3] sched/numa: documentation for per-cgroup numa stat 王贇
  2019-11-13 15:09   ` Jonathan Corbet
@ 2019-11-13 18:28   ` Iurii Zaikin
  2019-11-14  2:22     ` 王贇
  2019-11-15  2:29   ` [PATCH v2 " 王贇
  2 siblings, 1 reply; 45+ messages in thread
From: Iurii Zaikin @ 2019-11-13 18:28 UTC (permalink / raw)
  To: 王贇
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Michal Koutný,
	linux-fsdevel, Linux Kernel Mailing List,
	open list:DOCUMENTATION, Paul E. McKenney

Since the documentation talks about fairly advanced concepts, every little bit
of readability improvement helps. I tried to make suggestions that I feel make
it easier to read, hopefully my nitpicking is not too annoying.
On Tue, Nov 12, 2019 at 7:46 PM 王贇 <yun.wang@linux.alibaba.com> wrote:
> +On NUMA platforms, remote memory accessing always has a performance penalty,
> +although we have NUMA balancing working hard to maximum the local accessing
> +proportion, there are still situations it can't helps.
Nit: working hard to maximize the access locality...
can't helps -> can't help
> +
> +This could happen in modern production environment, using bunch of cgroups
> +to classify and control resources which introduced complex configuration on
> +memory policy, CPUs and NUMA node, NUMA balancing could facing the wrong
> +memory policy or exhausted local NUMA node, lead into the low local page
> +accessing proportion.
I find the below a bit easier to read.
This could happen in modern production environment. When a large
number of cgroups
are used to classify and control resources, this creates a complex
memory policy configuration
for CPUs and NUMA nodes. In such cases NUMA balancing could end up
with the wrong
memory policy or exhausted local NUMA node, which would lead to low
percentage of local page
accesses.

> +We need to perceive such cases, figure out which workloads from which cgroup
> +has introduced the issues, then we got chance to do adjustment to avoid
> +performance damages.
Nit: perceive -> detect, got-> get, damages-> degradation

> +However, there are no hardware counter for per-task local/remote accessing
> +info, we don't know how many remote page accessing has been done for a
> +particular task.
Nit: counters.
Nit: we don't know how many remote page accesses have occurred for a

> +
> +Statistics
> +----------
> +
> +Fortunately, we have NUMA Balancing which scan task's mapping and trigger PF
> +periodically, give us the opportunity to record per-task page accessing info.
Nit: scans, triggers, gives.

> +By "echo 1 > /proc/sys/kernel/cg_numa_stat" on runtime or add boot parameter
Nit: at runtime or adding boot parameter
> +To be noticed, the accounting is in a hierarchy way, which means the numa
> +statistics representing not only the workload of this group, but also the
> +workloads of all it's descendants.
Note that the accounting is hierarchical, which means the numa
statistics for a given group represents not only the workload of this
group, but also the
workloads of all it's descendants.
> +
> +For example the 'cpu.numa_stat' show:
> +  locality 39541 60962 36842 72519 118605 721778 946553
> +  exectime 1220127 1458684
> +
> +The locality is sectioned into 7 regions, closely as:
> +  0-13% 14-27% 28-42% 43-56% 57-71% 72-85% 86-100%
Nit: closely -> approximately?

> +we can draw a line for region_bad_percent, when the line close to 0 things
nit: we can plot?
> +are good, when getting close to 100% something is wrong, we can pick a proper
> +watermark to trigger warning message.

> +You may want to drop the data if the region_all is too small, which imply
Nit: implies
> +there are not much available pages for NUMA Balancing, just ignore would be
Nit: not many... ingoring
> +fine since most likely the workload is insensitive to NUMA.
> +Monitoring root group help you control the overall situation, while you may
Nit: helps
> +also want to monitoring all the leaf groups which contain the workloads, this
Nit: monitor
> +help to catch the mouse.
Nit: helps
> +become too small, for NUMA node X we have:
Nit: becomes
> +try put your workload into a memory cgroup which providing per-node memory
Nit: try to put
> +These two percentage are usually matched on each node, workload should execute
Nit: percentages
> +Depends on which part of the memory accessed mostly by the workload, locality
Depending on which part of the memory is accessed.
"mostly by the workload" - not sure what you mean here, the majority
of accesses from the
workload fall into this part of memory or that accesses from processes
other than the workload
are rare?
> +could still be good with just a little piece of memory locally.
?
> +Thus to tell if things are find or not depends on the understanding of system
are fine
> +After locate which workloads introduced the bad locality, check:
locate -> indentifying
> +
> +1). Is the workloads bind into a particular NUMA node?
bind into -> bound to
> +2). Is there any NUMA node run out of resources?
Has any .. run out of resources
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 5e27d74e2b74..220df1f0beb8 100644
> +                       lot's of per-cgroup workloads.
lots

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/3] sched/numa: documentation for per-cgroup numa stat
  2019-11-13 15:09   ` Jonathan Corbet
@ 2019-11-14  1:52     ` 王贇
  0 siblings, 0 replies; 45+ messages in thread
From: 王贇 @ 2019-11-14  1:52 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney

Hi, Jonathan

On 2019/11/13 下午11:09, Jonathan Corbet wrote:
> On Wed, 13 Nov 2019 11:45:59 +0800
> 王贇 <yun.wang@linux.alibaba.com> wrote:
> 
>> Add the description for 'cg_numa_stat', also a new doc to explain
>> the details on how to deal with the per-cgroup numa statistics.
>>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Michal Koutný <mkoutny@suse.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
>> ---
>>  Documentation/admin-guide/cg-numa-stat.rst      | 161 ++++++++++++++++++++++++
>>  Documentation/admin-guide/kernel-parameters.txt |   4 +
>>  Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
>>  3 files changed, 174 insertions(+)
>>  create mode 100644 Documentation/admin-guide/cg-numa-stat.rst
> 
> Thanks for adding documentation for your new feature!  When you add a new
> RST file, though, you should also add it to index.rst so that it becomes a
> part of the docs build.

Thanks for pointing out :-) will fix this in next version.

> 
> A couple of nits below...
> 
>> diff --git a/Documentation/admin-guide/cg-numa-stat.rst b/Documentation/admin-guide/cg-numa-stat.rst
>> new file mode 100644
>> index 000000000000..87b716c51e16
>> --- /dev/null
[snip]
>> +For example the 'cpu.numa_stat' show:
>> +  locality 39541 60962 36842 72519 118605 721778 946553
>> +  exectime 1220127 1458684
> 
> You almost certainly want that rendered as a literal block, so say
> "show::".  There are other places where you'll want to do that as well. 

I see, will fix such cases.

> 
>> +The locality is sectioned into 7 regions, closely as:
>> +  0-13% 14-27% 28-42% 43-56% 57-71% 72-85% 86-100%
>> +
>> +And exectime is sectioned into 2 nodes, 0 and 1 in this case.
>> +
>> +Thus we know the workload of this group and it's descendants have totally
>> +executed 1220127ms on node_0 and 1458684ms on node_1, tasks with locality
>> +around 0~13% executed for 39541 ms, and tasks with locality around 87~100%
>> +executed for 946553 ms, which imply most of the memory access are local.
>> +
>> +Monitoring
>> +-----------------
> 
> A slightly long underline :)

Aha, will fix this too.

Regards,
Michael Wang

> 
> I'll stop here; thanks again for adding documentation.
> 
> jon
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 3/3] sched/numa: documentation for per-cgroup numa stat
  2019-11-13 18:28   ` Iurii Zaikin
@ 2019-11-14  2:22     ` 王贇
  0 siblings, 0 replies; 45+ messages in thread
From: 王贇 @ 2019-11-14  2:22 UTC (permalink / raw)
  To: Iurii Zaikin
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Michal Koutný,
	linux-fsdevel, Linux Kernel Mailing List,
	open list:DOCUMENTATION, Paul E. McKenney

Hi, Iurii

On 2019/11/14 上午2:28, Iurii Zaikin wrote:
> Since the documentation talks about fairly advanced concepts, every little bit
> of readability improvement helps. I tried to make suggestions that I feel make
> it easier to read, hopefully my nitpicking is not too annoying.

Any comments are welcomed :-)

> On Tue, Nov 12, 2019 at 7:46 PM 王贇 <yun.wang@linux.alibaba.com> wrote:
>> +On NUMA platforms, remote memory accessing always has a performance penalty,
>> +although we have NUMA balancing working hard to maximum the local accessing
>> +proportion, there are still situations it can't helps.
> Nit: working hard to maximize the access locality...
> can't helps -> can't help>> +
>> +This could happen in modern production environment, using bunch of cgroups
>> +to classify and control resources which introduced complex configuration on
>> +memory policy, CPUs and NUMA node, NUMA balancing could facing the wrong
>> +memory policy or exhausted local NUMA node, lead into the low local page
>> +accessing proportion.
> I find the below a bit easier to read.
> This could happen in modern production environment. When a large
> number of cgroups
> are used to classify and control resources, this creates a complex
> memory policy configuration
> for CPUs and NUMA nodes. In such cases NUMA balancing could end up
> with the wrong
> memory policy or exhausted local NUMA node, which would lead to low
> percentage of local page
> accesses.

Sounds better, just for the configuration part, since memory policy, CPUs
and NUMA nodes are configured by different approach, maybe we should still
separate them like:

This could happen in modern production environment. When a large
number of cgroups are used to classify and control resources, this
creates a complex configuration for memory policy, CPUs and NUMA nodes.
In such cases NUMA balancing could end up with the wrong memory policy
or exhausted local NUMA node, which would lead to low percentage of local
page accesses.

> 
>> +We need to perceive such cases, figure out which workloads from which cgroup
>> +has introduced the issues, then we got chance to do adjustment to avoid
>> +performance damages.
> Nit: perceive -> detect, got-> get, damages-> degradation
> 
>> +However, there are no hardware counter for per-task local/remote accessing
>> +info, we don't know how many remote page accessing has been done for a
>> +particular task.
> Nit: counters.
> Nit: we don't know how many remote page accesses have occurred for a
> 
>> +
>> +Statistics
>> +----------
>> +
>> +Fortunately, we have NUMA Balancing which scan task's mapping and trigger PF
>> +periodically, give us the opportunity to record per-task page accessing info.
> Nit: scans, triggers, gives.
> 
>> +By "echo 1 > /proc/sys/kernel/cg_numa_stat" on runtime or add boot parameter
> Nit: at runtime or adding boot parameter
>> +To be noticed, the accounting is in a hierarchy way, which means the numa
>> +statistics representing not only the workload of this group, but also the
>> +workloads of all it's descendants.
> Note that the accounting is hierarchical, which means the numa
> statistics for a given group represents not only the workload of this
> group, but also the
> workloads of all it's descendants.
>> +
>> +For example the 'cpu.numa_stat' show:
>> +  locality 39541 60962 36842 72519 118605 721778 946553
>> +  exectime 1220127 1458684
>> +
>> +The locality is sectioned into 7 regions, closely as:
>> +  0-13% 14-27% 28-42% 43-56% 57-71% 72-85% 86-100%
> Nit: closely -> approximately?
> 
>> +we can draw a line for region_bad_percent, when the line close to 0 things
> nit: we can plot?
>> +are good, when getting close to 100% something is wrong, we can pick a proper
>> +watermark to trigger warning message.
> 
>> +You may want to drop the data if the region_all is too small, which imply
> Nit: implies
>> +there are not much available pages for NUMA Balancing, just ignore would be
> Nit: not many... ingoring
>> +fine since most likely the workload is insensitive to NUMA.
>> +Monitoring root group help you control the overall situation, while you may
> Nit: helps
>> +also want to monitoring all the leaf groups which contain the workloads, this
> Nit: monitor
>> +help to catch the mouse.
> Nit: helps
>> +become too small, for NUMA node X we have:
> Nit: becomes
>> +try put your workload into a memory cgroup which providing per-node memory
> Nit: try to put
>> +These two percentage are usually matched on each node, workload should execute
> Nit: percentages
>> +Depends on which part of the memory accessed mostly by the workload, locality> Depending on which part of the memory is accessed.
> "mostly by the workload" - not sure what you mean here, the majority
> of accesses from the
> workload fall into this part of memory or that accesses from processes
> other than the workload
> are rare?

The prev one actually, sometime the workload only access part of it's
memory, could be a small part but as long as this part is local, things
could be fine.

>> +could still be good with just a little piece of memory locally.
> ?

whatabout:

workload may only access a small part of it's memory, in such cases, although
the majority of memory are remotely, locality could still be good.

>> +Thus to tell if things are find or not depends on the understanding of system
> are fine
>> +After locate which workloads introduced the bad locality, check:
> locate -> indentifying
>> +
>> +1). Is the workloads bind into a particular NUMA node?
> bind into -> bound to
>> +2). Is there any NUMA node run out of resources?
> Has any .. run out of resources
>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
>> index 5e27d74e2b74..220df1f0beb8 100644
>> +                       lot's of per-cgroup workloads.
> lots

Thanks for point out all these issues, very helpful :-)

Should apply them in next version.

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 3/3] sched/numa: documentation for per-cgroup numa stat
  2019-11-13  3:45 ` [PATCH 3/3] sched/numa: documentation for per-cgroup numa stat 王贇
  2019-11-13 15:09   ` Jonathan Corbet
  2019-11-13 18:28   ` Iurii Zaikin
@ 2019-11-15  2:29   ` " 王贇
  2 siblings, 0 replies; 45+ messages in thread
From: 王贇 @ 2019-11-15  2:29 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Jonathan Corbet, Iurii Zaikin

Add the description for 'cg_numa_stat', also a new doc to explain
the details on how to deal with the per-cgroup numa statistics.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Iurii Zaikin <yzaikin@google.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
Since v1:
  * thanks to Iurii for the better sentences
  * thanks to Jonathan for the better format

 Documentation/admin-guide/cg-numa-stat.rst      | 163 ++++++++++++++++++++++++
 Documentation/admin-guide/index.rst             |   1 +
 Documentation/admin-guide/kernel-parameters.txt |   4 +
 Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
 4 files changed, 177 insertions(+)
 create mode 100644 Documentation/admin-guide/cg-numa-stat.rst

diff --git a/Documentation/admin-guide/cg-numa-stat.rst b/Documentation/admin-guide/cg-numa-stat.rst
new file mode 100644
index 000000000000..1aed3b1d23c4
--- /dev/null
+++ b/Documentation/admin-guide/cg-numa-stat.rst
@@ -0,0 +1,163 @@
+===============================
+Per-cgroup NUMA statistics
+===============================
+
+Background
+----------
+
+On NUMA platforms, remote memory accessing always has a performance penalty,
+although we have NUMA balancing working hard to maximize the access locality,
+there are still situations it can't help.
+
+This could happen in modern production environment. When a large number of
+cgroups are used to classify and control resources, this creates a complex
+configuration for memory policy, CPUs and NUMA nodes. In such cases NUMA
+balancing could end up with the wrong memory policy or exhausted local NUMA
+node, which would lead to low percentage of local page accesses.
+
+We need to detect such cases, figure out which workloads from which cgroup
+has introduced the issues, then we get chance to do adjustment to avoid
+performance degradation.
+
+However, there are no hardware counters for per-task local/remote accessing
+info, we don't know how many remote page accesses have occurred for a
+particular task.
+
+Statistics
+----------
+
+Fortunately, we have NUMA Balancing which scans task's mapping and triggers PF
+periodically, gives us the opportunity to record per-task page accessing info.
+
+By "echo 1 > /proc/sys/kernel/cg_numa_stat" at runtime or adding boot parameter
+'cg_numa_stat', we will enable the accounting of per-cgroup numa statistics,
+the 'cpu.numa_stat' entry of CPU cgroup will show statistics::
+
+  locality -- execution time sectioned by task NUMA locality (in ms)
+  exectime -- execution time sectioned by NUMA node (in ms)
+
+We define 'task NUMA locality' as::
+
+  nr_local_page_access * 100 / (nr_local_page_access + nr_remote_page_access)
+
+this per-task percentage value will be updated on the ticks for current task,
+and the access counter will be updated on task's NUMA balancing PF, so only
+the pages which NUMA Balancing paid attention to will be accounted.
+
+On each tick, we acquire the locality of current task on that CPU, accumulating
+the ticks into the counter of corresponding locality region, tasks from the
+same group sharing the counters, becoming the group locality.
+
+Similarly, we acquire the NUMA node of current CPU where the current task is
+executing on, accumulating the ticks into the counter of corresponding node,
+becoming the per-cgroup node execution time.
+
+Note that the accounting is hierarchical, which means the numa statistics for
+a given group represents not only the workload of this group, but also the
+workloads of all it's descendants.
+
+For example the 'cpu.numa_stat' show::
+
+  locality 39541 60962 36842 72519 118605 721778 946553
+  exectime 1220127 1458684
+
+The locality is sectioned into 7 regions, approximately as::
+
+  0-13% 14-27% 28-42% 43-56% 57-71% 72-85% 86-100%
+
+And exectime is sectioned into 2 nodes, 0 and 1 in this case.
+
+Thus we know the workload of this group and it's descendants have totally
+executed 1220127ms on node_0 and 1458684ms on node_1, tasks with locality
+around 0~13% executed for 39541 ms, and tasks with locality around 87~100%
+executed for 946553 ms, which imply most of the memory access are local.
+
+Monitoring
+----------
+
+By monitoring the increments of these statistics, we can easily know whether
+NUMA balancing is working well for a particular workload.
+
+For example we take a 5 secs sample period, and consider locality under 27%
+is bad, then on each sampling we have::
+
+  region_bad = region_1 + region_2
+  region_all = region_1 + region_2 + ... + region_7
+
+and we have the increments as::
+
+  region_bad_diff = region_bad - last_region_bad
+  region_all_diff = region_all - last_region_all
+
+which finally become::
+
+  region_bad_percent = region_bad_diff * 100 / region_all_diff
+
+we can plot a line for region_bad_percent, when the line close to 0 things
+are good, when getting close to 100% something is wrong, we can pick a proper
+watermark to trigger warning message.
+
+You may want to drop the data if the region_all is too small, which implies
+there are not many available pages for NUMA Balancing, ignoring would be fine
+since most likely the workload is insensitive to NUMA.
+
+Monitoring root group helps you control the overall situation, while you may
+also want to monitor all the leaf groups which contain the workloads, this
+helps to catch the mouse.
+
+The exectime could be useful when NUMA Balancing is disabled, or when locality
+becomes too small, for NUMA node X we have::
+
+  exectime_X_diff = exectime_X - last_exectime_X
+  exectime_all_diff = exectime_all - last_exectime_all
+
+try to put your workload into a memory cgroup which providing per-node memory
+consumption by 'memory.numa_stat' entry, then we could get::
+
+  memory_percent_X = memory_X * 100 / memory_all
+  exectime_percent_X = exectime_X_diff * 100 / exectime_all_diff
+
+These two percentages are usually matched on each node, workload should execute
+mostly on the node contain most of it's memory, but it's not guaranteed.
+
+The workload may only access a small part of it's memory, in such cases although
+the majority of memory are remotely, locality could still be good.
+
+Thus to tell if things are fine or not depends on the understanding of system
+resource deployment, however, if you find node X got 100% memory percent but 0%
+exectime percent, definitely something is wrong.
+
+Troubleshooting
+---------------
+
+After identifying which workload introduced the bad locality, check:
+
+1). Is the workload bound to a particular NUMA node?
+2). Has any NUMA node run out of resources?
+
+There are several ways to bind task's memory with a NUMA node, the strict way
+like the MPOL_BIND memory policy or 'cpuset.mems' will limiting the memory
+node where to allocate pages, in this situation, admin should make sure the
+task is allowed to run on the CPUs of that NUMA node, and make sure there are
+available CPU resource there.
+
+There are also ways to bind task's CPU with a NUMA node, like 'cpuset.cpus' or
+sched_setaffinity() syscall, in this situation, NUMA Balancing help to migrate
+pages into that node, admin should make sure there are available memory there.
+
+Admin could try rebind or unbind the NUMA node to erase the damage, make a
+change then observe the statistics see if things get better until the situation
+is acceptable.
+
+Highlights
+----------
+
+For some tasks, NUMA Balancing may found no necessary to scan pages, and
+locality could always be 0 or small number, don't pay attention to them
+since they most likely insensitive to NUMA.
+
+There are no accounting until the option turned on, so enable it in advance
+if you want to have the whole history.
+
+We have per-task migfailed counter to tell how many page migration has been
+failed for a particular task, you will find it in /proc/PID/sched entry.
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 4405b7485312..c75a3fdfcd94 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -112,6 +112,7 @@ configure specific aspects of kernel behavior to your liking.
    video-output
    wimax/index
    xfs
+   cg-numa-stat

 .. only::  subproject and html

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 5e27d74e2b74..475b8351be6d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3191,6 +3191,10 @@
 	numa_balancing=	[KNL,X86] Enable or disable automatic NUMA balancing.
 			Allowed values are enable and disable

+	cg_numa_atat	[KNL] Enable advanced per-cgroup numa statistics.
+			Useful to debug NUMA efficiency problems when there are
+			lots of per-cgroup workloads.
+
 	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
 			'node', 'default' can be specified
 			This can be set from sysctl after boot.
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 614179dc79a9..719593e8be20 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -572,6 +572,15 @@ rate for each task.
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.

+cg_numa_stat:
+=============
+
+Enables/disables advanced per-cgroup NUMA statistic.
+
+0: disabled (default).
+1: enabled.
+
+Check Documentation/admin-guide/cg-numa-stat.rst for details.

 osrelease, ostype & version:
 ============================
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/3] sched/numa: introduce advanced numa statistic
  2019-11-13  3:43 [PATCH 0/3] sched/numa: introduce advanced numa statistic 王贇
                   ` (2 preceding siblings ...)
  2019-11-13  3:45 ` [PATCH 3/3] sched/numa: documentation for per-cgroup numa stat 王贇
@ 2019-11-20  9:45 ` 王贇
  2019-11-25  1:35 ` 王贇
  2019-11-27  1:48 ` [PATCH v2 " 王贇
  5 siblings, 0 replies; 45+ messages in thread
From: 王贇 @ 2019-11-20  9:45 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney

Hi, Mel

I hope this version has addressed the issues you mentioned :-)

Does it looks better now?

Regards,
Michael Wang

On 2019/11/13 上午11:43, 王贇 wrote:
> Modern production environment could use hundreds of cgroup to control
> the resources for different workloads, along with the complicated
> resource binding.
> 
> On NUMA platforms where we have multiple nodes, things become even more
> complicated, we hope there are more local memory access to improve the
> performance, and NUMA Balancing keep working hard to achieve that,
> however, wrong memory policy or node binding could easily waste the
> effort, result a lot of remote page accessing.
> 
> We need to perceive such problems, then we got chance to fix it before
> there are too much damages, however, there are no good approach yet to
> help catch the mouse who introduced the remote access.
> 
> This patch set is trying to fill in the missing pieces, by introduce
> the per-cgroup NUMA locality/exectime statistics, and expose the per-task
> page migration failure counter, with these statistics, we could achieve
> the daily monitoring on NUMA efficiency, to give warning when things going
> too wrong.
> 
> Please check the third patch for more details.
> 
> Thanks to Peter, Mel and Michal for the good advices :-)
> 
> Michael Wang (3):
>   sched/numa: advanced per-cgroup numa statistic
>   sched/numa: expose per-task pages-migration-failure counter
>   sched/numa: documentation for per-cgroup numa stat
> 
>  Documentation/admin-guide/cg-numa-stat.rst      | 161 ++++++++++++++++++++++++
>  Documentation/admin-guide/kernel-parameters.txt |   4 +
>  Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
>  include/linux/sched.h                           |  18 ++-
>  include/linux/sched/sysctl.h                    |   6 +
>  init/Kconfig                                    |   9 ++
>  kernel/sched/core.c                             |  91 ++++++++++++++
>  kernel/sched/debug.c                            |   1 +
>  kernel/sched/fair.c                             |  33 +++++
>  kernel/sched/sched.h                            |  17 +++
>  kernel/sysctl.c                                 |  11 ++
>  11 files changed, 359 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/admin-guide/cg-numa-stat.rst
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 0/3] sched/numa: introduce advanced numa statistic
  2019-11-13  3:43 [PATCH 0/3] sched/numa: introduce advanced numa statistic 王贇
                   ` (3 preceding siblings ...)
  2019-11-20  9:45 ` [PATCH 0/3] sched/numa: introduce advanced numa statistic 王贇
@ 2019-11-25  1:35 ` 王贇
  2019-11-27  1:48 ` [PATCH v2 " 王贇
  5 siblings, 0 replies; 45+ messages in thread
From: 王贇 @ 2019-11-25  1:35 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney

Hi, Peter

How do you think about this version, does it looks fine to you?

Regards,
Michael Wang

On 2019/11/13 上午11:43, 王贇 wrote:
> Modern production environment could use hundreds of cgroup to control
> the resources for different workloads, along with the complicated
> resource binding.
> 
> On NUMA platforms where we have multiple nodes, things become even more
> complicated, we hope there are more local memory access to improve the
> performance, and NUMA Balancing keep working hard to achieve that,
> however, wrong memory policy or node binding could easily waste the
> effort, result a lot of remote page accessing.
> 
> We need to perceive such problems, then we got chance to fix it before
> there are too much damages, however, there are no good approach yet to
> help catch the mouse who introduced the remote access.
> 
> This patch set is trying to fill in the missing pieces, by introduce
> the per-cgroup NUMA locality/exectime statistics, and expose the per-task
> page migration failure counter, with these statistics, we could achieve
> the daily monitoring on NUMA efficiency, to give warning when things going
> too wrong.
> 
> Please check the third patch for more details.
> 
> Thanks to Peter, Mel and Michal for the good advices :-)
> 
> Michael Wang (3):
>   sched/numa: advanced per-cgroup numa statistic
>   sched/numa: expose per-task pages-migration-failure counter
>   sched/numa: documentation for per-cgroup numa stat
> 
>  Documentation/admin-guide/cg-numa-stat.rst      | 161 ++++++++++++++++++++++++
>  Documentation/admin-guide/kernel-parameters.txt |   4 +
>  Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
>  include/linux/sched.h                           |  18 ++-
>  include/linux/sched/sysctl.h                    |   6 +
>  init/Kconfig                                    |   9 ++
>  kernel/sched/core.c                             |  91 ++++++++++++++
>  kernel/sched/debug.c                            |   1 +
>  kernel/sched/fair.c                             |  33 +++++
>  kernel/sched/sched.h                            |  17 +++
>  kernel/sysctl.c                                 |  11 ++
>  11 files changed, 359 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/admin-guide/cg-numa-stat.rst
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 0/3] sched/numa: introduce advanced numa statistic
  2019-11-13  3:43 [PATCH 0/3] sched/numa: introduce advanced numa statistic 王贇
                   ` (4 preceding siblings ...)
  2019-11-25  1:35 ` 王贇
@ 2019-11-27  1:48 ` " 王贇
  2019-11-27  1:49   ` [PATCH v2 1/3] sched/numa: advanced per-cgroup " 王贇
                     ` (3 more replies)
  5 siblings, 4 replies; 45+ messages in thread
From: 王贇 @ 2019-11-27  1:48 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney

Since v1:
  * Improved documentation

Modern production environment could use hundreds of cgroup to control
the resources for different workloads, along with the complicated
resource binding.

On NUMA platforms where we have multiple nodes, things become even more
complicated, we hope there are more local memory access to improve the
performance, and NUMA Balancing keep working hard to achieve that,
however, wrong memory policy or node binding could easily waste the
effort, result a lot of remote page accessing.

We need to perceive such problems, then we got chance to fix it before
there are too much damages, however, there are no good approach yet to
help catch the mouse who introduced the remote access.

This patch set is trying to fill in the missing pieces, by introduce
the per-cgroup NUMA locality/exectime statistics, and expose the per-task
page migration failure counter, with these statistics, we could achieve
the daily monitoring on NUMA efficiency, to give warning when things going
too wrong.

Please check the third patch for more details.

Thanks to Peter, Mel and Michal for the good advice.

Michael Wang (3):
  sched/numa: advanced per-cgroup numa statistic
  sched/numa: expose per-task pages-migration-failure counter
  sched/numa: documentation for per-cgroup numa stat

 Documentation/admin-guide/cg-numa-stat.rst      | 163 ++++++++++++++++++++++++
 Documentation/admin-guide/index.rst             |   1 +
 Documentation/admin-guide/kernel-parameters.txt |   4 +
 Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
 include/linux/sched.h                           |  18 ++-
 include/linux/sched/sysctl.h                    |   6 +
 init/Kconfig                                    |   9 ++
 kernel/sched/core.c                             |  91 +++++++++++++
 kernel/sched/debug.c                            |   1 +
 kernel/sched/fair.c                             |  33 +++++
 kernel/sched/sched.h                            |  17 +++
 kernel/sysctl.c                                 |  11 ++
 12 files changed, 362 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/admin-guide/cg-numa-stat.rst

-- 
2.14.4.44.g2045bb6

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 1/3] sched/numa: advanced per-cgroup numa statistic
  2019-11-27  1:48 ` [PATCH v2 " 王贇
@ 2019-11-27  1:49   ` " 王贇
  2019-11-27 10:19     ` Mel Gorman
  2019-11-27  1:50   ` [PATCH v2 2/3] sched/numa: expose per-task pages-migration-failure 王贇
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 45+ messages in thread
From: 王贇 @ 2019-11-27  1:49 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney

Currently there are no good approach to monitoring the per-cgroup
numa efficiency, this could be a trouble especially when groups
are sharing CPUs, it's impossible to tell which one caused the
remote-memory access by reading hardware counter since multiple
workloads could sharing the same CPU, which make it painful when
one want to find out the root cause and fix the issue.

In order to address this, we introduced new per-cgroup statistic
for numa:
  * the numa locality to imply the numa balancing efficiency
  * the numa execution time on each node

The task locality is the local page accessing ratio traced on numa
balancing PF, and the group locality is the topology of task execution
time, sectioned by the locality into 7 regions.

For example the new entry 'cpu.numa_stat' show:
  locality 39541 60962 36842 72519 118605 721778 946553
  exectime 1220127 1458684

Here we know the workloads in hierarchy executed 1220127ms on node_0
and 1458684ms on node_1 in total, tasks with locality around 0~13%
executed for 39541 ms, and tasks with locality around 86~100% executed
for 946553 ms, which imply most of the memory access are local access.

By monitoring the new statistic, we will be able to know the numa
efficiency of each per-cgroup workloads on machine, whatever they
sharing the CPUs or not, we will be able to find out which one
introduced the remote access mostly.

Besides, per-node memory topology from 'memory.numa_stat' become
more useful when we have the per-node execution time, workloads
always executing on node_0 while it's memory is all on node_1 is
usually a bad case.

Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 include/linux/sched.h        | 18 ++++++++-
 include/linux/sched/sysctl.h |  6 +++
 init/Kconfig                 |  9 +++++
 kernel/sched/core.c          | 91 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c          | 33 ++++++++++++++++
 kernel/sched/sched.h         | 17 +++++++++
 kernel/sysctl.c              | 11 ++++++
 7 files changed, 184 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8f6607cd40ac..505b041594ef 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1118,9 +1118,25 @@ struct task_struct {
 	 * numa_faults_locality tracks if faults recorded during the last
 	 * scan window were remote/local or failed to migrate. The task scan
 	 * period is adapted based on the locality of the faults with different
-	 * weights depending on whether they were shared or private faults
+	 * weights depending on whether they were shared or private faults.
+	 *
+	 * Counter id stand for:
+	 * 0 -- remote faults
+	 * 1 -- local faults
+	 * 2 -- page migration failure
+	 *
+	 * Extra counters when CONFIG_CGROUP_NUMA_STAT enabled:
+	 * 3 -- remote page accessing
+	 * 4 -- local page accessing
+	 *
+	 * The 'remote/local faults' records the cpu-page relationship before
+	 * page migration, while the 'remote/local page accessing' is after.
 	 */
+#ifndef CONFIG_CGROUP_NUMA_STAT
 	unsigned long			numa_faults_locality[3];
+#else
+	unsigned long			numa_faults_locality[5];
+#endif

 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 89f55e914673..2d6a515df544 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -102,4 +102,10 @@ extern int sched_energy_aware_handler(struct ctl_table *table, int write,
 				 loff_t *ppos);
 #endif

+#ifdef CONFIG_CGROUP_NUMA_STAT
+extern int sysctl_cg_numa_stat(struct ctl_table *table, int write,
+				 void __user *buffer, size_t *lenp,
+				 loff_t *ppos);
+#endif
+
 #endif /* _LINUX_SCHED_SYSCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 4d8d145c41d2..b31d2b560493 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -817,6 +817,15 @@ config NUMA_BALANCING_DEFAULT_ENABLED
 	  If set, automatic NUMA balancing will be enabled if running on a NUMA
 	  machine.

+config CGROUP_NUMA_STAT
+	bool "Advanced per-cgroup NUMA statistics"
+	default n
+	depends on CGROUP_SCHED && NUMA_BALANCING
+	help
+	  This option adds support for per-cgroup NUMA locality/execution
+	  statistics, for monitoring NUMA efficiency of per-cgroup workloads
+	  on NUMA platforms with NUMA Balancing enabled.
+
 menuconfig CGROUPS
 	bool "Control Group support"
 	select KERNFS
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aaa1740e6497..eabcab25be50 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7657,6 +7657,84 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */

+#ifdef CONFIG_CGROUP_NUMA_STAT
+DEFINE_STATIC_KEY_FALSE(sched_cg_numa_stat);
+
+#ifdef CONFIG_PROC_SYSCTL
+int sysctl_cg_numa_stat(struct ctl_table *table, int write,
+			 void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table t;
+	int err;
+	int state = static_branch_likely(&sched_cg_numa_stat);
+
+	if (write && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	t = *table;
+	t.data = &state;
+	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+	if (err < 0 || !write)
+		return err;
+
+	if (state)
+		static_branch_enable(&sched_cg_numa_stat);
+	else
+		static_branch_disable(&sched_cg_numa_stat);
+
+	return err;
+}
+#endif
+
+static inline struct cfs_rq *tg_cfs_rq(struct task_group *tg, int cpu)
+{
+	return tg == &root_task_group ? &cpu_rq(cpu)->cfs : tg->cfs_rq[cpu];
+}
+
+static int cpu_numa_stat_show(struct seq_file *sf, void *v)
+{
+	int nr;
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	if (!static_branch_likely(&sched_cg_numa_stat))
+		return 0;
+
+	seq_puts(sf, "locality");
+	for (nr = 0; nr < NR_NL_INTERVAL; nr++) {
+		int cpu;
+		u64 sum = 0;
+
+		for_each_possible_cpu(cpu)
+			sum += tg_cfs_rq(tg, cpu)->nstat.locality[nr];
+
+		seq_printf(sf, " %u", jiffies_to_msecs(sum));
+	}
+	seq_putc(sf, '\n');
+
+	seq_puts(sf, "exectime");
+	for_each_online_node(nr) {
+		int cpu;
+		u64 sum = 0;
+
+		for_each_cpu(cpu, cpumask_of_node(nr))
+			sum += tg_cfs_rq(tg, cpu)->nstat.jiffies;
+
+		seq_printf(sf, " %u", jiffies_to_msecs(sum));
+	}
+	seq_putc(sf, '\n');
+
+	return 0;
+}
+
+static __init int cg_numa_stat_setup(char *opt)
+{
+	static_branch_enable(&sched_cg_numa_stat);
+
+	return 0;
+}
+__setup("cg_numa_stat", cg_numa_stat_setup);
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -7706,6 +7784,12 @@ static struct cftype cpu_legacy_files[] = {
 		.seq_show = cpu_uclamp_max_show,
 		.write = cpu_uclamp_max_write,
 	},
+#endif
+#ifdef CONFIG_CGROUP_NUMA_STAT
+	{
+		.name = "numa_stat",
+		.seq_show = cpu_numa_stat_show,
+	},
 #endif
 	{ }	/* Terminate */
 };
@@ -7887,6 +7971,13 @@ static struct cftype cpu_files[] = {
 		.seq_show = cpu_uclamp_max_show,
 		.write = cpu_uclamp_max_write,
 	},
+#endif
+#ifdef CONFIG_CGROUP_NUMA_STAT
+	{
+		.name = "numa_stat",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_numa_stat_show,
+	},
 #endif
 	{ }	/* terminate */
 };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 81eba554db8d..d5fa038d07d3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2465,6 +2465,18 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 	p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages;
 	p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages;
 	p->numa_faults_locality[local] += pages;
+
+#ifdef CONFIG_CGROUP_NUMA_STAT
+	if (!static_branch_unlikely(&sched_cg_numa_stat))
+		return;
+
+	/*
+	 * We want to record the real local/remote page access statistic
+	 * here, so use 'mem_node' which is the real residential node of
+	 * page after migrate_misplaced_page().
+	 */
+	p->numa_faults_locality[3 + !!(mem_node == numa_node_id())] += pages;
+#endif
 }

 static void reset_ptenuma_scan(struct task_struct *p)
@@ -4285,6 +4297,23 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 	cfs_rq->curr = NULL;
 }

+#ifdef CONFIG_CGROUP_NUMA_STAT
+static void update_numa_statistics(struct cfs_rq *cfs_rq)
+{
+	int idx;
+	unsigned long remote = current->numa_faults_locality[3];
+	unsigned long local = current->numa_faults_locality[4];
+
+	cfs_rq->nstat.jiffies++;
+
+	if (!remote && !local)
+		return;
+
+	idx = (NR_NL_INTERVAL - 1) * local / (remote + local);
+	cfs_rq->nstat.locality[idx]++;
+}
+#endif
+
 static void
 entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 {
@@ -4298,6 +4327,10 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	 */
 	update_load_avg(cfs_rq, curr, UPDATE_TG);
 	update_cfs_group(curr);
+#ifdef CONFIG_CGROUP_NUMA_STAT
+	if (static_branch_unlikely(&sched_cg_numa_stat))
+		update_numa_statistics(cfs_rq);
+#endif

 #ifdef CONFIG_SCHED_HRTICK
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 05c282775f21..87080c7164ac 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -486,6 +486,16 @@ struct cfs_bandwidth { };

 #endif	/* CONFIG_CGROUP_SCHED */

+#ifdef CONFIG_CGROUP_NUMA_STAT
+/* NUMA Locality Interval, 7 buckets for cache align */
+#define NR_NL_INTERVAL	7
+
+struct numa_statistics {
+	u64 jiffies;
+	u64 locality[NR_NL_INTERVAL];
+};
+#endif
+
 /* CFS-related fields in a runqueue */
 struct cfs_rq {
 	struct load_weight	load;
@@ -575,6 +585,9 @@ struct cfs_rq {
 	struct list_head	throttled_list;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
+#ifdef CONFIG_CGROUP_NUMA_STAT
+	struct numa_statistics	nstat ____cacheline_aligned;
+#endif
 };

 static inline int rt_bandwidth_enabled(void)
@@ -1601,6 +1614,10 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features =
 extern struct static_key_false sched_numa_balancing;
 extern struct static_key_false sched_schedstats;

+#ifdef CONFIG_CGROUP_NUMA_STAT
+extern struct static_key_false sched_cg_numa_stat;
+#endif
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 50373984a5e2..2df9a5e7f875 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -428,6 +428,17 @@ static struct ctl_table kern_table[] = {
 		.extra2		= SYSCTL_ONE,
 	},
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_CGROUP_NUMA_STAT
+	{
+		.procname	= "cg_numa_stat",
+		.data		= NULL, /* filled in by handler */
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_cg_numa_stat,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
+#endif /* CONFIG_CGROUP_NUMA_STAT */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
 		.procname	= "sched_rt_period_us",
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 2/3] sched/numa: expose per-task pages-migration-failure
  2019-11-27  1:48 ` [PATCH v2 " 王贇
  2019-11-27  1:49   ` [PATCH v2 1/3] sched/numa: advanced per-cgroup " 王贇
@ 2019-11-27  1:50   ` 王贇
  2019-11-27 10:00     ` Mel Gorman
  2019-12-02  2:22     ` 王贇
  2019-11-27  1:50   ` [PATCH v2 3/3] sched/numa: documentation for per-cgroup numa stat 王贇
  2019-12-03  5:59   ` [PATCH v3 0/2] sched/numa: introduce numa locality 王贇
  3 siblings, 2 replies; 45+ messages in thread
From: 王贇 @ 2019-11-27  1:50 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney

NUMA balancing will try to migrate pages between nodes, which
could caused by memory policy or numa group aggregation, while
the page migration could failed too for eg when the target node
run out of memory.

Since this is critical to the performance, admin should know
how serious the problem is, and take actions before it causing
too much performance damage, thus this patch expose the counter
as 'migfailed' in '/proc/PID/sched'.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Koutný <mkoutny@suse.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 kernel/sched/debug.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index f7e4579e746c..73c4809c8f37 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -848,6 +848,7 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m)
 	P(total_numa_faults);
 	SEQ_printf(m, "current_node=%d, numa_group_id=%d\n",
 			task_node(p), task_numa_group_id(p));
+	SEQ_printf(m, "migfailed=%lu\n", p->numa_faults_locality[2]);
 	show_numa_stats(p, m);
 	mpol_put(pol);
 #endif
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 3/3] sched/numa: documentation for per-cgroup numa stat
  2019-11-27  1:48 ` [PATCH v2 " 王贇
  2019-11-27  1:49   ` [PATCH v2 1/3] sched/numa: advanced per-cgroup " 王贇
  2019-11-27  1:50   ` [PATCH v2 2/3] sched/numa: expose per-task pages-migration-failure 王贇
@ 2019-11-27  1:50   ` 王贇
  2019-11-27  4:58     ` Randy Dunlap
  2019-12-03  5:59   ` [PATCH v3 0/2] sched/numa: introduce numa locality 王贇
  3 siblings, 1 reply; 45+ messages in thread
From: 王贇 @ 2019-11-27  1:50 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney

Since v1:
  * thanks to Iurii for the better sentence
  * thanks to Jonathan for the better format

Add the description for 'cg_numa_stat', also a new doc to explain
the details on how to deal with the per-cgroup numa statistics.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Iurii Zaikin <yzaikin@google.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 Documentation/admin-guide/cg-numa-stat.rst      | 163 ++++++++++++++++++++++++
 Documentation/admin-guide/index.rst             |   1 +
 Documentation/admin-guide/kernel-parameters.txt |   4 +
 Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
 4 files changed, 177 insertions(+)
 create mode 100644 Documentation/admin-guide/cg-numa-stat.rst

diff --git a/Documentation/admin-guide/cg-numa-stat.rst b/Documentation/admin-guide/cg-numa-stat.rst
new file mode 100644
index 000000000000..6f505f46fe00
--- /dev/null
+++ b/Documentation/admin-guide/cg-numa-stat.rst
@@ -0,0 +1,163 @@
+===============================
+Per-cgroup NUMA statistics
+===============================
+
+Background
+----------
+
+On NUMA platforms, remote memory accessing always has a performance penalty,
+although we have NUMA balancing working hard to maximize the access locality,
+there are still situations it can't help.
+
+This could happen in modern production environment. When a large number of
+cgroups are used to classify and control resources, this creates a complex
+configuration for memory policy, CPUs and NUMA nodes. In such cases NUMA
+balancing could end up with the wrong memory policy or exhausted local NUMA
+node, which would lead to low percentage of local page accesses.
+
+We need to detect such cases, figure out which workloads from which cgroup
+has introduced the issues, then we get chance to do adjustment to avoid
+performance degradation.
+
+However, there are no hardware counters for per-task local/remote accessing
+info, we don't know how many remote page accesses have occurred for a
+particular task.
+
+Statistics
+----------
+
+Fortunately, we have NUMA Balancing which scans task's mapping and triggers PF
+periodically, gives us the opportunity to record per-task page accessing info.
+
+By "echo 1 > /proc/sys/kernel/cg_numa_stat" at runtime or adding boot parameter
+'cg_numa_stat', we will enable the accounting of per-cgroup numa statistics,
+the 'cpu.numa_stat' entry of CPU cgroup will show statistics::
+
+  locality -- execution time sectioned by task NUMA locality (in ms)
+  exectime -- execution time sectioned by NUMA node (in ms)
+
+We define 'task NUMA locality' as::
+
+  nr_local_page_access * 100 / (nr_local_page_access + nr_remote_page_access)
+
+this per-task percentage value will be updated on the ticks for current task,
+and the access counter will be updated on task's NUMA balancing PF, so only
+the pages which NUMA Balancing paid attention to will be accounted.
+
+On each tick, we acquire the locality of current task on that CPU, accumulating
+the ticks into the counter of corresponding locality region, tasks from the
+same group sharing the counters, becoming the group locality.
+
+Similarly, we acquire the NUMA node of current CPU where the current task is
+executing on, accumulating the ticks into the counter of corresponding node,
+becoming the per-cgroup node execution time.
+
+Note that the accounting is hierarchical, which means the numa statistics for
+a given group represents not only the workload of this group, but also the
+workloads of all it's descendants.
+
+For example the 'cpu.numa_stat' show::
+
+  locality 39541 60962 36842 72519 118605 721778 946553
+  exectime 1220127 1458684
+
+The locality is sectioned into 7 regions, approximately as::
+
+  0-13% 14-27% 28-42% 43-56% 57-71% 72-85% 86-100%
+
+And exectime is sectioned into 2 nodes, 0 and 1 in this case.
+
+Thus we know the workload of this group and it's descendants have totally
+executed 1220127ms on node_0 and 1458684ms on node_1, tasks with locality
+around 0~13% executed for 39541 ms, and tasks with locality around 87~100%
+executed for 946553 ms, which imply most of the memory access are local.
+
+Monitoring
+----------
+
+By monitoring the increments of these statistics, we can easily know whether
+NUMA balancing is working well for a particular workload.
+
+For example we take a 5 secs sample period, and consider locality under 27%
+is bad, then on each sampling we have::
+
+  region_bad = region_1 + region_2
+  region_all = region_1 + region_2 + ... + region_7
+
+and we have the increments as::
+
+  region_bad_diff = region_bad - last_region_bad
+  region_all_diff = region_all - last_region_all
+
+which finally become::
+
+  region_bad_percent = region_bad_diff * 100 / region_all_diff
+
+we can plot a line for region_bad_percent, when the line close to 0 things
+are good, when getting close to 100% something is wrong, we can pick a proper
+watermark to trigger warning message.
+
+You may want to drop the data if the region_all is too small, which implies
+there are not many available pages for NUMA Balancing, ignoring would be fine
+since most likely the workload is insensitive to NUMA.
+
+Monitoring root group helps you control the overall situation, while you may
+also want to monitor all the leaf groups which contain the workloads, this
+helps to catch the mouse.
+
+The exectime could be useful when NUMA Balancing is disabled, or when locality
+becomes too small, for NUMA node X we have::
+
+  exectime_X_diff = exectime_X - last_exectime_X
+  exectime_all_diff = exectime_all - last_exectime_all
+
+try to put your workload into a memory cgroup which providing per-node memory
+consumption by 'memory.numa_stat' entry, then we could get::
+
+  memory_percent_X = memory_X * 100 / memory_all
+  exectime_percent_X = exectime_X_diff * 100 / exectime_all_diff
+
+These two percentages are usually matched on each node, workload should execute
+mostly on the node contain most of it's memory, but it's not guaranteed.
+
+The workload may only access a small part of it's memory, in such cases although
+the majority of memory are remotely, locality could still be good.
+
+Thus to tell if things are fine or not depends on the understanding of system
+resource deployment, however, if you find node X got 100% memory percent but 0%
+exectime percent, definitely something is wrong.
+
+Troubleshooting
+---------------
+
+After identifying which workload introduced the bad locality, check:
+
+1). Is the workload bound to a particular NUMA node?
+2). Has any NUMA node run out of resources?
+
+There are several ways to bind task's memory with a NUMA node, the strict way
+like the MPOL_BIND memory policy or 'cpuset.mems' will limiting the memory
+node where to allocate pages, in this situation, admin should make sure the
+task is allowed to run on the CPUs of that NUMA node, and make sure there are
+available CPU resource there.
+
+There are also ways to bind task's CPU with a NUMA node, like 'cpuset.cpus' or
+sched_setaffinity() syscall, in this situation, NUMA Balancing help to migrate
+pages into that node, admin should make sure there are available memory there.
+
+Admin could try rebind or unbind the NUMA node to erase the damage, make a
+change then observe the statistics see if things get better until the situation
+is acceptable.
+
+Highlights
+----------
+
+For some tasks, NUMA Balancing may found no necessary to scan pages, and
+locality could always be 0 or small number, don't pay attention to them
+since they most likely insensitive to NUMA.
+
+There are no accounting until the option turned on, so enable it in advance
+if you want to have the whole history.
+
+We have per-task migfailed counter to tell how many page migration has been
+failed for a particular task, you will find it in /proc/PID/sched entry.
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 4405b7485312..c75a3fdfcd94 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -112,6 +112,7 @@ configure specific aspects of kernel behavior to your liking.
    video-output
    wimax/index
    xfs
+   cg-numa-stat

 .. only::  subproject and html

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0945611b3877..674c1a4bae14 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3227,6 +3227,10 @@
 	numa_balancing=	[KNL,X86] Enable or disable automatic NUMA balancing.
 			Allowed values are enable and disable

+	cg_numa_atat	[KNL] Enable advanced per-cgroup numa statistics.
+			Useful to debug NUMA efficiency problems when there are
+			lots of per-cgroup workloads.
+
 	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
 			'node', 'default' can be specified
 			This can be set from sysctl after boot.
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 7e203b3ed331..918a26d2bd77 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -572,6 +572,15 @@ rate for each task.
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.

+cg_numa_stat:
+=============
+
+Enables/disables advanced per-cgroup NUMA statistic.
+
+0: disabled (default).
+1: enabled.
+
+Check Documentation/admin-guide/cg-numa-stat.rst for details.

 osrelease, ostype & version:
 ============================
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 3/3] sched/numa: documentation for per-cgroup numa stat
  2019-11-27  1:50   ` [PATCH v2 3/3] sched/numa: documentation for per-cgroup numa stat 王贇
@ 2019-11-27  4:58     ` Randy Dunlap
  2019-11-27  5:54       ` 王贇
  0 siblings, 1 reply; 45+ messages in thread
From: Randy Dunlap @ 2019-11-27  4:58 UTC (permalink / raw)
  To: 王贇,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney

On 11/26/19 5:50 PM, 王贇 wrote:
> Since v1:
>   * thanks to Iurii for the better sentence
>   * thanks to Jonathan for the better format
> 
> Add the description for 'cg_numa_stat', also a new doc to explain
> the details on how to deal with the per-cgroup numa statistics.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Michal Koutný <mkoutny@suse.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Iurii Zaikin <yzaikin@google.com>
> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>

Hi,
I have a few comments/corrections. Please see below.

> ---
>  Documentation/admin-guide/cg-numa-stat.rst      | 163 ++++++++++++++++++++++++
>  Documentation/admin-guide/index.rst             |   1 +
>  Documentation/admin-guide/kernel-parameters.txt |   4 +
>  Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
>  4 files changed, 177 insertions(+)
>  create mode 100644 Documentation/admin-guide/cg-numa-stat.rst
> 
> diff --git a/Documentation/admin-guide/cg-numa-stat.rst b/Documentation/admin-guide/cg-numa-stat.rst
> new file mode 100644
> index 000000000000..6f505f46fe00
> --- /dev/null
> +++ b/Documentation/admin-guide/cg-numa-stat.rst
> @@ -0,0 +1,163 @@
> +===============================
> +Per-cgroup NUMA statistics
> +===============================
> +
> +Background
> +----------
> +
> +On NUMA platforms, remote memory accessing always has a performance penalty,

                                                                       penalty.

> +although we have NUMA balancing working hard to maximize the access locality,

   Although

> +there are still situations it can't help.
> +
> +This could happen in modern production environment. When a large number of
> +cgroups are used to classify and control resources, this creates a complex
> +configuration for memory policy, CPUs and NUMA nodes. In such cases NUMA
> +balancing could end up with the wrong memory policy or exhausted local NUMA
> +node, which would lead to low percentage of local page accesses.
> +
> +We need to detect such cases, figure out which workloads from which cgroup
> +has introduced the issues, then we get chance to do adjustment to avoid

   have

> +performance degradation.
> +
> +However, there are no hardware counters for per-task local/remote accessing
> +info, we don't know how many remote page accesses have occurred for a
> +particular task.
> +
> +Statistics
> +----------
> +
> +Fortunately, we have NUMA Balancing which scans task's mapping and triggers PF
> +periodically, gives us the opportunity to record per-task page accessing info.

                 giving

> +
> +By "echo 1 > /proc/sys/kernel/cg_numa_stat" at runtime or adding boot parameter
> +'cg_numa_stat', we will enable the accounting of per-cgroup numa statistics,

                                                               NUMA

> +the 'cpu.numa_stat' entry of CPU cgroup will show statistics::
> +
> +  locality -- execution time sectioned by task NUMA locality (in ms)
> +  exectime -- execution time sectioned by NUMA node (in ms)
> +
> +We define 'task NUMA locality' as::
> +
> +  nr_local_page_access * 100 / (nr_local_page_access + nr_remote_page_access)
> +
> +this per-task percentage value will be updated on the ticks for current task,

   This

> +and the access counter will be updated on task's NUMA balancing PF, so only
> +the pages which NUMA Balancing paid attention to will be accounted.
> +
> +On each tick, we acquire the locality of current task on that CPU, accumulating
> +the ticks into the counter of corresponding locality region, tasks from the
> +same group sharing the counters, becoming the group locality.
> +
> +Similarly, we acquire the NUMA node of current CPU where the current task is
> +executing on, accumulating the ticks into the counter of corresponding node,
> +becoming the per-cgroup node execution time.
> +
> +Note that the accounting is hierarchical, which means the numa statistics for

                                                             NUMA

> +a given group represents not only the workload of this group, but also the

                 represent

> +workloads of all it's descendants.

                    its

> +
> +For example the 'cpu.numa_stat' show::
> +
> +  locality 39541 60962 36842 72519 118605 721778 946553
> +  exectime 1220127 1458684
> +
> +The locality is sectioned into 7 regions, approximately as::
> +
> +  0-13% 14-27% 28-42% 43-56% 57-71% 72-85% 86-100%
> +
> +And exectime is sectioned into 2 nodes, 0 and 1 in this case.
> +
> +Thus we know the workload of this group and it's descendants have totally

                                               its

> +executed 1220127ms on node_0 and 1458684ms on node_1, tasks with locality
> +around 0~13% executed for 39541 ms, and tasks with locality around 87~100%
> +executed for 946553 ms, which imply most of the memory access are local.
> +
> +Monitoring
> +----------
> +
> +By monitoring the increments of these statistics, we can easily know whether
> +NUMA balancing is working well for a particular workload.
> +
> +For example we take a 5 secs sample period, and consider locality under 27%

                           seconds

> +is bad, then on each sampling we have::
> +
> +  region_bad = region_1 + region_2
> +  region_all = region_1 + region_2 + ... + region_7
> +
> +and we have the increments as::
> +
> +  region_bad_diff = region_bad - last_region_bad
> +  region_all_diff = region_all - last_region_all
> +
> +which finally become::
> +
> +  region_bad_percent = region_bad_diff * 100 / region_all_diff
> +
> +we can plot a line for region_bad_percent, when the line close to 0 things

   We

> +are good, when getting close to 100% something is wrong, we can pick a proper
> +watermark to trigger warning message.
> +
> +You may want to drop the data if the region_all is too small, which implies
> +there are not many available pages for NUMA Balancing, ignoring would be fine
> +since most likely the workload is insensitive to NUMA.
> +
> +Monitoring root group helps you control the overall situation, while you may
> +also want to monitor all the leaf groups which contain the workloads, this
> +helps to catch the mouse.
> +
> +The exectime could be useful when NUMA Balancing is disabled, or when locality
> +becomes too small, for NUMA node X we have::

               small. For

> +
> +  exectime_X_diff = exectime_X - last_exectime_X
> +  exectime_all_diff = exectime_all - last_exectime_all
> +
> +try to put your workload into a memory cgroup which providing per-node memory

   Try                                                 provides


> +consumption by 'memory.numa_stat' entry, then we could get::
> +
> +  memory_percent_X = memory_X * 100 / memory_all
> +  exectime_percent_X = exectime_X_diff * 100 / exectime_all_diff
> +
> +These two percentages are usually matched on each node, workload should execute
> +mostly on the node contain most of it's memory, but it's not guaranteed.

                 node that contains most of its

> +
> +The workload may only access a small part of it's memory, in such cases although

                                                its

> +the majority of memory are remotely, locality could still be good.
> +
> +Thus to tell if things are fine or not depends on the understanding of system
> +resource deployment, however, if you find node X got 100% memory percent but 0%
> +exectime percent, definitely something is wrong.
> +
> +Troubleshooting
> +---------------
> +
> +After identifying which workload introduced the bad locality, check:
> +
> +1). Is the workload bound to a particular NUMA node?
> +2). Has any NUMA node run out of resources?
> +
> +There are several ways to bind task's memory with a NUMA node, the strict way
> +like the MPOL_BIND memory policy or 'cpuset.mems' will limiting the memory

                                                     will limit

> +node where to allocate pages, in this situation, admin should make sure the

                          pages. In

> +task is allowed to run on the CPUs of that NUMA node, and make sure there are
> +available CPU resource there.
> +
> +There are also ways to bind task's CPU with a NUMA node, like 'cpuset.cpus' or
> +sched_setaffinity() syscall, in this situation, NUMA Balancing help to migrate

                       syscall. In

> +pages into that node, admin should make sure there are available memory there.
> +
> +Admin could try rebind or unbind the NUMA node to erase the damage, make a

               try to

> +change then observe the statistics see if things get better until the situation

               observe the statistics to see if

> +is acceptable.
> +
> +Highlights
> +----------
> +
> +For some tasks, NUMA Balancing may found no necessary to scan pages, and
> +locality could always be 0 or small number, don't pay attention to them
> +since they most likely insensitive to NUMA.
> +
> +There are no accounting until the option turned on, so enable it in advance

         is no accounting until the option is turned on,

> +if you want to have the whole history.
> +
> +We have per-task migfailed counter to tell how many page migration has been

I can't find any occurrence of 'migfailed' in the entire kernel source tree.
Maybe it is misspelled?

> +failed for a particular task, you will find it in /proc/PID/sched entry.


HTH.

-- 
~Randy


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 3/3] sched/numa: documentation for per-cgroup numa stat
  2019-11-27  4:58     ` Randy Dunlap
@ 2019-11-27  5:54       ` 王贇
  0 siblings, 0 replies; 45+ messages in thread
From: 王贇 @ 2019-11-27  5:54 UTC (permalink / raw)
  To: Randy Dunlap, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney

Hi, Randy

On 2019/11/27 下午12:58, Randy Dunlap wrote:
> On 11/26/19 5:50 PM, 王贇 wrote:
>> Since v1:
>>   * thanks to Iurii for the better sentence
>>   * thanks to Jonathan for the better format
>>
>> Add the description for 'cg_numa_stat', also a new doc to explain
>> the details on how to deal with the per-cgroup numa statistics.
>>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Michal Koutný <mkoutny@suse.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Jonathan Corbet <corbet@lwn.net>
>> Cc: Iurii Zaikin <yzaikin@google.com>
>> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
> 
> Hi,
> I have a few comments/corrections. Please see below.

Thanks for the comments :-) will apply them all in next version.

> 
>> ---
>>  Documentation/admin-guide/cg-numa-stat.rst      | 163 
[snip]
>> +if you want to have the whole history.
>> +
>> +We have per-task migfailed counter to tell how many page migration has been
> 
> I can't find any occurrence of 'migfailed' in the entire kernel source tree.
> Maybe it is misspelled?

This one is added by the secondary patch:
  [PATCH v2 2/3] sched/numa: expose per-task pages-migration-failure

As suggested by Mel.

Regards,
Michael Wang

> 
>> +failed for a particular task, you will find it in /proc/PID/sched entry.
> 
> 
> HTH.
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 2/3] sched/numa: expose per-task pages-migration-failure
  2019-11-27  1:50   ` [PATCH v2 2/3] sched/numa: expose per-task pages-migration-failure 王贇
@ 2019-11-27 10:00     ` Mel Gorman
  2019-12-02  2:22     ` 王贇
  1 sibling, 0 replies; 45+ messages in thread
From: Mel Gorman @ 2019-11-27 10:00 UTC (permalink / raw)
  To: ??????
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Michal Koutn?,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney

On Wed, Nov 27, 2019 at 09:50:01AM +0800, ?????? wrote:
> NUMA balancing will try to migrate pages between nodes, which
> could caused by memory policy or numa group aggregation, while
> the page migration could failed too for eg when the target node
> run out of memory.
> 
> Since this is critical to the performance, admin should know
> how serious the problem is, and take actions before it causing
> too much performance damage, thus this patch expose the counter
> as 'migfailed' in '/proc/PID/sched'.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Michal Koutný <mkoutny@suse.com>
> Suggested-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>

This patch can be treated independently of the rest of the series as
it's not directly related.

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 1/3] sched/numa: advanced per-cgroup numa statistic
  2019-11-27  1:49   ` [PATCH v2 1/3] sched/numa: advanced per-cgroup " 王贇
@ 2019-11-27 10:19     ` Mel Gorman
  2019-11-28  2:09       ` 王贇
  0 siblings, 1 reply; 45+ messages in thread
From: Mel Gorman @ 2019-11-27 10:19 UTC (permalink / raw)
  To: ??????
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Michal Koutn?,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney

On Wed, Nov 27, 2019 at 09:49:34AM +0800, ?????? wrote:
> Currently there are no good approach to monitoring the per-cgroup
> numa efficiency, this could be a trouble especially when groups
> are sharing CPUs, it's impossible to tell which one caused the
> remote-memory access by reading hardware counter since multiple
> workloads could sharing the same CPU, which make it painful when
> one want to find out the root cause and fix the issue.
> 

It's already possible to identify specific tasks triggering PMU events
so this is not exactly true.

> In order to address this, we introduced new per-cgroup statistic
> for numa:
>   * the numa locality to imply the numa balancing efficiency
>   * the numa execution time on each node
> 
> The task locality is the local page accessing ratio traced on numa
> balancing PF, and the group locality is the topology of task execution
> time, sectioned by the locality into 7 regions.
> 
> For example the new entry 'cpu.numa_stat' show:
>   locality 39541 60962 36842 72519 118605 721778 946553
>   exectime 1220127 1458684
> 
> Here we know the workloads in hierarchy executed 1220127ms on node_0
> and 1458684ms on node_1 in total, tasks with locality around 0~13%
> executed for 39541 ms, and tasks with locality around 86~100% executed
> for 946553 ms, which imply most of the memory access are local access.
> 
> By monitoring the new statistic, we will be able to know the numa
> efficiency of each per-cgroup workloads on machine, whatever they
> sharing the CPUs or not, we will be able to find out which one
> introduced the remote access mostly.
> 
> Besides, per-node memory topology from 'memory.numa_stat' become
> more useful when we have the per-node execution time, workloads
> always executing on node_0 while it's memory is all on node_1 is
> usually a bad case.
> 
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Michal Koutný <mkoutny@suse.com>
> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
> ---
>  include/linux/sched.h        | 18 ++++++++-
>  include/linux/sched/sysctl.h |  6 +++
>  init/Kconfig                 |  9 +++++
>  kernel/sched/core.c          | 91 ++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/fair.c          | 33 ++++++++++++++++
>  kernel/sched/sched.h         | 17 +++++++++
>  kernel/sysctl.c              | 11 ++++++
>  7 files changed, 184 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8f6607cd40ac..505b041594ef 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1118,9 +1118,25 @@ struct task_struct {
>  	 * numa_faults_locality tracks if faults recorded during the last
>  	 * scan window were remote/local or failed to migrate. The task scan
>  	 * period is adapted based on the locality of the faults with different
> -	 * weights depending on whether they were shared or private faults
> +	 * weights depending on whether they were shared or private faults.
> +	 *
> +	 * Counter id stand for:
> +	 * 0 -- remote faults
> +	 * 1 -- local faults
> +	 * 2 -- page migration failure
> +	 *
> +	 * Extra counters when CONFIG_CGROUP_NUMA_STAT enabled:
> +	 * 3 -- remote page accessing
> +	 * 4 -- local page accessing
> +	 *
> +	 * The 'remote/local faults' records the cpu-page relationship before
> +	 * page migration, while the 'remote/local page accessing' is after.
>  	 */
> +#ifndef CONFIG_CGROUP_NUMA_STAT
>  	unsigned long			numa_faults_locality[3];
> +#else
> +	unsigned long			numa_faults_locality[5];
> +#endif
> 
>  	unsigned long			numa_pages_migrated;
>  #endif /* CONFIG_NUMA_BALANCING */
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index 89f55e914673..2d6a515df544 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -102,4 +102,10 @@ extern int sched_energy_aware_handler(struct ctl_table *table, int write,
>  				 loff_t *ppos);
>  #endif
> 
> +#ifdef CONFIG_CGROUP_NUMA_STAT
> +extern int sysctl_cg_numa_stat(struct ctl_table *table, int write,
> +				 void __user *buffer, size_t *lenp,
> +				 loff_t *ppos);
> +#endif
> +
>  #endif /* _LINUX_SCHED_SYSCTL_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index 4d8d145c41d2..b31d2b560493 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -817,6 +817,15 @@ config NUMA_BALANCING_DEFAULT_ENABLED
>  	  If set, automatic NUMA balancing will be enabled if running on a NUMA
>  	  machine.
> 
> +config CGROUP_NUMA_STAT
> +	bool "Advanced per-cgroup NUMA statistics"
> +	default n
> +	depends on CGROUP_SCHED && NUMA_BALANCING
> +	help
> +	  This option adds support for per-cgroup NUMA locality/execution
> +	  statistics, for monitoring NUMA efficiency of per-cgroup workloads
> +	  on NUMA platforms with NUMA Balancing enabled.
> +
>  menuconfig CGROUPS
>  	bool "Control Group support"
>  	select KERNFS
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index aaa1740e6497..eabcab25be50 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7657,6 +7657,84 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
>  }
>  #endif /* CONFIG_RT_GROUP_SCHED */
> 
> +#ifdef CONFIG_CGROUP_NUMA_STAT
> +DEFINE_STATIC_KEY_FALSE(sched_cg_numa_stat);
> +
> +#ifdef CONFIG_PROC_SYSCTL
> +int sysctl_cg_numa_stat(struct ctl_table *table, int write,
> +			 void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	struct ctl_table t;
> +	int err;
> +	int state = static_branch_likely(&sched_cg_numa_stat);
> +
> +	if (write && !capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	t = *table;
> +	t.data = &state;
> +	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
> +	if (err < 0 || !write)
> +		return err;
> +
> +	if (state)
> +		static_branch_enable(&sched_cg_numa_stat);
> +	else
> +		static_branch_disable(&sched_cg_numa_stat);
> +
> +	return err;
> +}
> +#endif
> +

Why is this implemented as a toggle? I'm finding it hard to make sense
of this. The numa_stat should not even exist if the feature is disabled.

Assuming that is fixed then the runtime overhead is fine but the same
issues with the quality of the information relying on NUMA balancing
limits the usefulness of this. Disabling NUMA balancing or the scan rate
dropping to a very low frequency would lead in misleading conclusions as
well as false positives if the CPU and memory policies force remote memory
usage. Similarly, the timing of the information available is variable du
to how numa_faults_locality gets reset so sometimes the information is
fine-grained and sometimes it's coarse grained. It will also pretend to
display useful information even if NUMA balancing is disabled.

I find it hard to believe it would be useful in practice and I think users
would have real trouble interpreting the data given how much it relies on
internal implementation details of NUMA balancing. I cannot be certain
as clearly something motivated the creation of this patch although it's
unclear if it has ever been used to debug and fix an actual problem in
the field. Hence, I'm neutral on the patch and will neither ack or nack
it and will defer to the scheduler maintainers but if I was pushed on it,
I would be disinclined to merge the patch due to the potential confusion
caused by users who believe it provides accurate information when at best
it gives a rough approximation with variable granularity.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 1/3] sched/numa: advanced per-cgroup numa statistic
  2019-11-27 10:19     ` Mel Gorman
@ 2019-11-28  2:09       ` 王贇
  2019-11-28 12:39         ` Michal Koutný
  0 siblings, 1 reply; 45+ messages in thread
From: 王贇 @ 2019-11-28  2:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Michal Koutn?,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney

On 2019/11/27 下午6:19, Mel Gorman wrote:
> On Wed, Nov 27, 2019 at 09:49:34AM +0800, ?????? wrote:
>> Currently there are no good approach to monitoring the per-cgroup
>> numa efficiency, this could be a trouble especially when groups
>> are sharing CPUs, it's impossible to tell which one caused the
>> remote-memory access by reading hardware counter since multiple
>> workloads could sharing the same CPU, which make it painful when
>> one want to find out the root cause and fix the issue>>
> 
> It's already possible to identify specific tasks triggering PMU events
> so this is not exactly true.

Should fix the description regarding this...

I think you mean tools like numatop which showing per task local/remote
accessing info from PMU, correct?

It's a good one for debugging, but when we talking about monitoring over
cluster sharing by multiple users, still not very practical... compared
to the workloads classified historical data.

I'm not sure about the overhead and limitation of this PMU approach, or
whether there are any platform it's not yet supported, worth a survey.

> 
>> In order to address this, we introduced new per-cgroup statistic
>> for numa:
>>   * the numa locality to imply the numa balancing efficiency
>>   * the numa execution time on each node
>>
[snip]
>> +#ifdef CONFIG_PROC_SYSCTL
>> +int sysctl_cg_numa_stat(struct ctl_table *table, int write,
>> +			 void __user *buffer, size_t *lenp, loff_t *ppos)
>> +{
>> +	struct ctl_table t;
>> +	int err;
>> +	int state = static_branch_likely(&sched_cg_numa_stat);
>> +
>> +	if (write && !capable(CAP_SYS_ADMIN))
>> +		return -EPERM;
>> +
>> +	t = *table;
>> +	t.data = &state;
>> +	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
>> +	if (err < 0 || !write)
>> +		return err;
>> +
>> +	if (state)
>> +		static_branch_enable(&sched_cg_numa_stat);
>> +	else
>> +		static_branch_disable(&sched_cg_numa_stat);
>> +
>> +	return err;
>> +}
>> +#endif
>> +
> 
> Why is this implemented as a toggle? I'm finding it hard to make sense
> of this. The numa_stat should not even exist if the feature is disabled.

numa_stat will not exist if CONFIG is not enabled, do you mean it should
also disappear when dynamically turn off?

> 
> Assuming that is fixed then the runtime overhead is fine but the same
> issues with the quality of the information relying on NUMA balancing
> limits the usefulness of this. Disabling NUMA balancing or the scan rate
> dropping to a very low frequency would lead in misleading conclusions as
> well as false positives if the CPU and memory policies force remote memory
> usage. Similarly, the timing of the information available is variable du
> to how numa_faults_locality gets reset so sometimes the information is
> fine-grained and sometimes it's coarse grained. It will also pretend to
> display useful information even if NUMA balancing is disabled.

The data just represent what we traced on NUMA balancing PF, so yes folks
need some understanding on NUMA balancing to figure out the real meaning
behind locality.

We want it to tell the real story, if NUMA balancing disabled or scan rate
dropped very low, the locality increments should be very small, when it
keep failing for memory policy or CPU binding reason, we want it to tell how
bad it is, locality just show us how NUMA Balancing is performing, the data
could contains many information since how OS dealing with NUMA could be
complicated...

> 
> I find it hard to believe it would be useful in practice and I think users
> would have real trouble interpreting the data given how much it relies on
> internal implementation details of NUMA balancing. I cannot be certain
> as clearly something motivated the creation of this patch although it's
> unclear if it has ever been used to debug and fix an actual problem in
> the field. Hence, I'm neutral on the patch and will neither ack or nack
> it and will defer to the scheduler maintainers but if I was pushed on it,
> I would be disinclined to merge the patch due to the potential confusion
> caused by users who believe it provides accurate information when at best
> it gives a rough approximation with variable granularity.

We have our cluster enabled this feature already, an old version though but
still helpful, when we want to debug NUMA issues, this could give good hints.

Consider it as load_1/5/15 which not accurate but tell the trend of system
behavior, locality giving the trend of NUMA Balancing as long as it's working,
when increasing slowly it means locality already good enough or no more memory
to adjust, and that's fine, for those who disabled the NUMA Balancing, they do
their own NUMA optimization and find their own ways to estimate the results.

Anyway, we thanks for all those good inputs from your side :-)

Regards,
Michael Wang


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 1/3] sched/numa: advanced per-cgroup numa statistic
  2019-11-28  2:09       ` 王贇
@ 2019-11-28 12:39         ` Michal Koutný
  2019-11-28 13:41           ` 王贇
  0 siblings, 1 reply; 45+ messages in thread
From: Michal Koutný @ 2019-11-28 12:39 UTC (permalink / raw)
  To: 王贇
  Cc: Mel Gorman, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, linux-fsdevel,
	linux-kernel, linux-doc, Paul E. McKenney

[-- Attachment #1: Type: text/plain, Size: 1622 bytes --]

Hello.

My primary concern is still the measuring of per-NUMA node execution
time.

First, I think exposing the aggregated data into the numa_stat file is
loss of information. The data are collected per-CPU and then summed over
NUMA nodes -- this could be easily done by the userspace consumer of the
data, keeping the per-CPU data available.

Second, comparing with the cpuacct implementation, yours has only jiffy
granularity (I may have overlooked something or I miss some context,
then it's a non-concern).

IOW, to me it sounds like duplicating cpuacct job and if that is deemed
useful for cgroup v2, I think it should be done (only once) and at
proper place (i.e. how cputime is measured in the default hierarchy).

The previous two are design/theoretical remarks, however, your patch
misses measuring of other than fair_sched_class policy tasks. Is that
intentional?

My last two comments are to locality measurement but are based on no
experience or specific knowledge.

The seven percentile groups seem quite arbitrary to me, I find it
strange that the ratio of cache-line size and u64 leaks and is fixed in
the generally visible file. Wouldn't such a form be better hidden under
a _DEBUG config option?


On Thu, Nov 28, 2019 at 10:09:13AM +0800, 王贇 <yun.wang@linux.alibaba.com> wrote:
> Consider it as load_1/5/15 which not accurate but tell the trend of system
I understood your patchset provides cumulative data over time, i.e. if
a user wants to see an immediate trend, they have to calculate
differences. Have I overlooked some back-off or regular zeroing?

Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 1/3] sched/numa: advanced per-cgroup numa statistic
  2019-11-28 12:39         ` Michal Koutný
@ 2019-11-28 13:41           ` 王贇
  2019-11-28 15:58             ` Michal Koutný
  0 siblings, 1 reply; 45+ messages in thread
From: 王贇 @ 2019-11-28 13:41 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Mel Gorman, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, linux-fsdevel,
	linux-kernel, linux-doc, Paul E. McKenney

On 2019/11/28 下午8:39, Michal Koutný wrote:
> Hello.
> 
> My primary concern is still the measuring of per-NUMA node execution
> time.
> 
> First, I think exposing the aggregated data into the numa_stat file is
> loss of information. The data are collected per-CPU and then summed over
> NUMA nodes -- this could be easily done by the userspace consumer of the
> data, keeping the per-CPU data available.
> 
> Second, comparing with the cpuacct implementation, yours has only jiffy
> granularity (I may have overlooked something or I miss some context,
> then it's a non-concern).

There are used to be a discussion on this, Peter mentioned we no longer
expose raw ticks into userspace and micro seconds could be fine.

Basically we use this to calculate percentages, for which jiffy could be
accurate enough :-)

> 
> IOW, to me it sounds like duplicating cpuacct job and if that is deemed
> useful for cgroup v2, I think it should be done (only once) and at
> proper place (i.e. how cputime is measured in the default hierarchy).

But still, what if folks don't use v2... any good suggestions?

> 
> The previous two are design/theoretical remarks, however, your patch
> misses measuring of other than fair_sched_class policy tasks. Is that
> intentional?

Yes, since they don't have NUMA balancing to do optimization, and
generally they are not that much.

> 
> My last two comments are to locality measurement but are based on no
> experience or specific knowledge.
> 
> The seven percentile groups seem quite arbitrary to me, I find it
> strange that the ratio of cache-line size and u64 leaks and is fixed in
> the generally visible file. Wouldn't such a form be better hidden under
> a _DEBUG config option?

Sorry but I don't get it... at first it was 10 regions, as Peter suggested
we pick 8, but now to insert member 'jiffies' it become 7, the address of
'jiffies' is cache aligned, so we pick u64 * 8 == 64Bytes to make sure the
whole thing could be load in cache once a time, or did I misunderstand
something?

> 
> 
> On Thu, Nov 28, 2019 at 10:09:13AM +0800, 王贇 <yun.wang@linux.alibaba.com> wrote:
>> Consider it as load_1/5/15 which not accurate but tell the trend of system
> I understood your patchset provides cumulative data over time, i.e. if
> a user wants to see an immediate trend, they have to calculate
> differences. Have I overlooked some back-off or regular zeroing?

Yes, here what I try to highlight is the similar usage, but not the way of
monitoring ;-) as the docs tell, we monitoring increments.

Regards,
Michale Wang

> 
> Michal
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 1/3] sched/numa: advanced per-cgroup numa statistic
  2019-11-28 13:41           ` 王贇
@ 2019-11-28 15:58             ` Michal Koutný
  2019-11-29  1:52               ` 王贇
  0 siblings, 1 reply; 45+ messages in thread
From: Michal Koutný @ 2019-11-28 15:58 UTC (permalink / raw)
  To: 王贇
  Cc: Mel Gorman, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, linux-fsdevel,
	linux-kernel, linux-doc, Paul E. McKenney

[-- Attachment #1: Type: text/plain, Size: 2085 bytes --]

On Thu, Nov 28, 2019 at 09:41:37PM +0800, 王贇 <yun.wang@linux.alibaba.com> wrote:
> There are used to be a discussion on this, Peter mentioned we no longer
> expose raw ticks into userspace and micro seconds could be fine.
I don't mean the unit presented but the precision.

> Basically we use this to calculate percentages, for which jiffy could be
> accurate enough :-)
You also report the raw times.

Ad percentages (or raw times precision), on average, it should be fine
but can't there be any "aliasing" artifacts when only an unimportant
task is regularly sampled, hence not capturing the real pattern on the
CPU? (Again, I'm not confident I'm not missing anything that prevents
that behavior.)

> But still, what if folks don't use v2... any good suggestions?
(Note this applies to exectimes not locality.) On v1, they can add up
per CPU values from cpuacct. (So it's v2 that's missing the records.)


> Yes, since they don't have NUMA balancing to do optimization, and
> generally they are not that much.
Aha, I didn't realize that.

> Sorry but I don't get it... at first it was 10 regions, as Peter suggested
> we pick 8, but now to insert member 'jiffies' it become 7,
See, there are various arguments for different values :-)

I meant that the currently chosen one is imprinted into the API file.
That is IMO fixable by documenting (e.g. the number of bands may change,
assume uniform division) or making all this just a debug API. Or, see
below.

> Yes, here what I try to highlight is the similar usage, but not the way of
> monitoring ;-) as the docs tell, we monitoring increments.
I see, the docs give me an idea what's the supposed use case.

What about exposing only the counters for local, remote and let the user
do their monitoring based on Δlocal/(Δlocal + Δremote)?

That would avoid the partitioning question completely, exposed values
would be simple numbers and provided information should be equal. A
drawback is that such a sampling would be slower (but sufficient for the
illustrating example).

Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 1/3] sched/numa: advanced per-cgroup numa statistic
  2019-11-28 15:58             ` Michal Koutný
@ 2019-11-29  1:52               ` 王贇
  2019-11-29  5:19                 ` 王贇
  0 siblings, 1 reply; 45+ messages in thread
From: 王贇 @ 2019-11-29  1:52 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Mel Gorman, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, linux-fsdevel,
	linux-kernel, linux-doc, Paul E. McKenney



On 2019/11/28 下午11:58, Michal Koutný wrote:
> On Thu, Nov 28, 2019 at 09:41:37PM +0800, 王贇 <yun.wang@linux.alibaba.com> wrote:
>> There are used to be a discussion on this, Peter mentioned we no longer
>> expose raw ticks into userspace and micro seconds could be fine.
> I don't mean the unit presented but the precision.
> 
>> Basically we use this to calculate percentages, for which jiffy could be
>> accurate enough :-)
> You also report the raw times.>
> Ad percentages (or raw times precision), on average, it should be fine
> but can't there be any "aliasing" artifacts when only an unimportant
> task is regularly sampled, hence not capturing the real pattern on the
> CPU? (Again, I'm not confident I'm not missing anything that prevents
> that behavior.)

Hmm.. I think I get your point now, so the concern is about the missing
situation between each ticks, correct?

It could be, like one tick hit task A running, then A switched to B, B
switched back to A before next tick, then we missing the exectime of B
in next tick, since it hit A again.

Actually we have the same issue for those data in /proc/stat too, don't
we? The user, sys, iowait was sampled in the similar way.

So if we have to pick a precision, I may still pick jiffy since the
exectime is some thing similar to user/sys time IMHO.

> 
>> But still, what if folks don't use v2... any good suggestions?
> (Note this applies to exectimes not locality.) On v1, they can add up
> per CPU values from cpuacct. (So it's v2 that's missing the records.)

Whatabout move the whole stuff into cpuacct cgroup?

I'm not sure but maybe we could use some data there to save the sample
of jiffies, for those v1 user who need these statistics, they should have
cpuacct enabled.

> 
> 
>> Yes, since they don't have NUMA balancing to do optimization, and
>> generally they are not that much.
> Aha, I didn't realize that.
> 
>> Sorry but I don't get it... at first it was 10 regions, as Peter suggested
>> we pick 8, but now to insert member 'jiffies' it become 7,
> See, there are various arguments for different values :-)
> 
> I meant that the currently chosen one is imprinted into the API file.
> That is IMO fixable by documenting (e.g. the number of bands may change,
> assume uniform division) or making all this just a debug API. Or, see
> below.
> 
>> Yes, here what I try to highlight is the similar usage, but not the way of
>> monitoring ;-) as the docs tell, we monitoring increments.
> I see, the docs give me an idea what's the supposed use case.
> 
> What about exposing only the counters for local, remote and let the user
> do their monitoring based on Δlocal/(Δlocal + Δremote)?
> 
> That would avoid the partitioning question completely, exposed values
> would be simple numbers and provided information should be equal. A
> drawback is that such a sampling would be slower (but sufficient for the
> illustrating example).

You mean the cgroup numa stat just give the accumulated local/remote access?

As long as the counter won't overflow, maybe... sounds easier to explain too.

So user tracing locality will then get just one percentage (calculated on
their own) from a cgroup, but one should be enough to represent the situation.

Sounds like a good idea to me :-) will try to do that in next version.

Regards,
Michael Wang

> 
> Michal
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 1/3] sched/numa: advanced per-cgroup numa statistic
  2019-11-29  1:52               ` 王贇
@ 2019-11-29  5:19                 ` 王贇
  2019-11-29 10:06                   ` Michal Koutný
  0 siblings, 1 reply; 45+ messages in thread
From: 王贇 @ 2019-11-29  5:19 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Mel Gorman, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, linux-fsdevel,
	linux-kernel, linux-doc, Paul E. McKenney



On 2019/11/29 上午9:52, 王贇 wrote:
[snip]
>> That would avoid the partitioning question completely, exposed values
>> would be simple numbers and provided information should be equal. A
>> drawback is that such a sampling would be slower (but sufficient for the
>> illustrating example).
> 
> You mean the cgroup numa stat just give the accumulated local/remote access?
> 
> As long as the counter won't overflow, maybe... sounds easier to explain too.
> 
> So user tracing locality will then get just one percentage (calculated on
> their own) from a cgroup, but one should be enough to represent the situation.
> 
> Sounds like a good idea to me :-) will try to do that in next version.

I did some research regarding cpuacct, and find cpuacct_charge() is a good
place to do hierarchical update, however, what we get there is the execution
time delta since last update_curr().

I'm afraid we can't just do local/remote accumulation since the sample period
now is changing, still have to accumulate the execution time into locality
regions.

While at least we should be able to address your concern regarding exectime
collection :-)

Regards,
Michael Wang

> 
> Regards,
> Michael Wang
> 
>>
>> Michal
>>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 1/3] sched/numa: advanced per-cgroup numa statistic
  2019-11-29  5:19                 ` 王贇
@ 2019-11-29 10:06                   ` Michal Koutný
  2019-12-02  2:11                     ` 王贇
  0 siblings, 1 reply; 45+ messages in thread
From: Michal Koutný @ 2019-11-29 10:06 UTC (permalink / raw)
  To: 王贇
  Cc: Mel Gorman, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, linux-fsdevel,
	linux-kernel, linux-doc, Paul E. McKenney

[-- Attachment #1: Type: text/plain, Size: 830 bytes --]

On Fri, Nov 29, 2019 at 01:19:33PM +0800, 王贇 <yun.wang@linux.alibaba.com> wrote:
> I did some research regarding cpuacct, and find cpuacct_charge() is a good
> place to do hierarchical update, however, what we get there is the execution
> time delta since last update_curr().
I wouldn't extend cpuacct, I'd like to look into using the rstat
mechanism for per-CPU runtime collection. (Most certainly I won't get
down to this until mid December though.)

> I'm afraid we can't just do local/remote accumulation since the sample period
> now is changing, still have to accumulate the execution time into locality
> regions.
My idea was to decouple time from the locality counters completely. It'd
be up to the monitoring application to normalize differences wrt
sampling rate (and handle wrap arounds).


Michal

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 1/3] sched/numa: advanced per-cgroup numa statistic
  2019-11-29 10:06                   ` Michal Koutný
@ 2019-12-02  2:11                     ` 王贇
  0 siblings, 0 replies; 45+ messages in thread
From: 王贇 @ 2019-12-02  2:11 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Mel Gorman, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, linux-fsdevel,
	linux-kernel, linux-doc, Paul E. McKenney



On 2019/11/29 下午6:06, Michal Koutný wrote:
> On Fri, Nov 29, 2019 at 01:19:33PM +0800, 王贇 <yun.wang@linux.alibaba.com> wrote:
>> I did some research regarding cpuacct, and find cpuacct_charge() is a good
>> place to do hierarchical update, however, what we get there is the execution
>> time delta since last update_curr().
> I wouldn't extend cpuacct, I'd like to look into using the rstat
> mechanism for per-CPU runtime collection. (Most certainly I won't get
> down to this until mid December though.)
> 
>> I'm afraid we can't just do local/remote accumulation since the sample period
>> now is changing, still have to accumulate the execution time into locality
>> regions.y
> My idea was to decouple time from the locality counters completely. It'd
> be up to the monitoring application to normalize differences wrt
> sampling rate (and handle wrap arounds).

I see, basically I understand your proposal as utilize cpuacct's runtime
and only expose per-cgroup local/remote counters, I'm not sure if the
locality still helpful after decouple time factor from it, both need some
investigation, anyway, once I could convince myself it's working, I'll
be happy to make things simple ;-)

Regards,
Michael Wang

> 
> 
> Michal
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 2/3] sched/numa: expose per-task pages-migration-failure
  2019-11-27  1:50   ` [PATCH v2 2/3] sched/numa: expose per-task pages-migration-failure 王贇
  2019-11-27 10:00     ` Mel Gorman
@ 2019-12-02  2:22     ` 王贇
  1 sibling, 0 replies; 45+ messages in thread
From: 王贇 @ 2019-12-02  2:22 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney

Hi, Peter,

This has been acked by Mel Gorman, since it is not much related
with the rest patches, would you like to pick this one now?

Regards,
Michael Wang

On 2019/11/27 上午9:50, 王贇 wrote:
> NUMA balancing will try to migrate pages between nodes, which
> could caused by memory policy or numa group aggregation, while
> the page migration could failed too for eg when the target node
> run out of memory.
> 
> Since this is critical to the performance, admin should know
> how serious the problem is, and take actions before it causing
> too much performance damage, thus this patch expose the counter
> as 'migfailed' in '/proc/PID/sched'.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Michal Koutný <mkoutny@suse.com>
> Suggested-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
> ---
>  kernel/sched/debug.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index f7e4579e746c..73c4809c8f37 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -848,6 +848,7 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m)
>  	P(total_numa_faults);
>  	SEQ_printf(m, "current_node=%d, numa_group_id=%d\n",
>  			task_node(p), task_numa_group_id(p));
> +	SEQ_printf(m, "migfailed=%lu\n", p->numa_faults_locality[2]);
>  	show_numa_stats(p, m);
>  	mpol_put(pol);
>  #endif
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v3 0/2] sched/numa: introduce numa locality
  2019-11-27  1:48 ` [PATCH v2 " 王贇
                     ` (2 preceding siblings ...)
  2019-11-27  1:50   ` [PATCH v2 3/3] sched/numa: documentation for per-cgroup numa stat 王贇
@ 2019-12-03  5:59   ` 王贇
  2019-12-03  6:00     ` [PATCH v3 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
                       ` (2 more replies)
  3 siblings, 3 replies; 45+ messages in thread
From: 王贇 @ 2019-12-03  5:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

Since last patch set:
  Because the locality region concept is too complicated, we tried to
  decouple the time factor from it now as Michal commented.

  Now the numa_stat just expose the local/remote page accessing counter
  , gathering from all the tasks in hierarchy. This should be much more
  easier to understand, also the meaning of counter is straightforward.

  Now we have just one locality percentage for each cgroup, to represent
  how NUMA Balancing is working and imply NUMA efficiency.

Modern production environment could use hundreds of cgroup to control
the resources for different workloads, along with the complicated
resource binding.

On NUMA platforms where we have multiple nodes, things become even more
complicated, we hope there are more local memory access to improve the
performance, and NUMA Balancing keep working hard to achieve that,
however, wrong memory policy or node binding could easily waste the
effort, result a lot of remote page accessing.

We need to notice such problems, then we got chance to fix it before
there are too much damages, however, there are no good monitoring
approach yet to help catch the mouse who introduced the remote access.

This patch set is trying to fill in the missing pieces, by introduce
the per-cgroup NUMA locality info, with this new statistics, we could
achieve the daily monitoring on NUMA efficiency, to give warning when
things going too wrong.

Please check the second patch for more details.

Michael Wang (2):
  sched/numa: introduce per-cgroup NUMA locality info
  sched/numa: documentation for per-cgroup numa statistics

 Documentation/admin-guide/cg-numa-stat.rst      | 176 ++++++++++++++++++++++++
 Documentation/admin-guide/index.rst             |   1 +
 Documentation/admin-guide/kernel-parameters.txt |   4 +
 Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
 include/linux/sched.h                           |  15 ++
 include/linux/sched/sysctl.h                    |   6 +
 init/Kconfig                                    |  11 ++
 kernel/sched/core.c                             |  75 ++++++++++
 kernel/sched/fair.c                             |  62 +++++++++
 kernel/sched/sched.h                            |  12 ++
 kernel/sysctl.c                                 |  11 ++
 11 files changed, 382 insertions(+)
 create mode 100644 Documentation/admin-guide/cg-numa-stat.rst

-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v3 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2019-12-03  5:59   ` [PATCH v3 0/2] sched/numa: introduce numa locality 王贇
@ 2019-12-03  6:00     ` 王贇
  2019-12-04  2:33       ` Randy Dunlap
  2019-12-03  6:02     ` [PATCH v3 2/2] sched/numa: documentation for per-cgroup numa statistics 王贇
  2019-12-04  7:58     ` [PATCH v4 0/2] sched/numa: introduce numa locality 王贇
  2 siblings, 1 reply; 45+ messages in thread
From: 王贇 @ 2019-12-03  6:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

Currently there are no good approach to monitoring the per-cgroup NUMA
efficiency, this could be a trouble especially when groups are sharing
CPUs, we don't know which one introduced remote-memory accessing.

Although the per-task NUMA accessing info from PMU is good for further
debuging, but not light enough for daily monitoring, especial on a box
with thousands of tasks.

Fortunately, when NUMA Balancing enabled, it will periodly trigger page
fault and try to increase the NUMA locality, by tracing the results we
will be able to estimate the NUMA efficiency.

On each page fault of NUMA Balancing, when task's executing CPU is from
the same node of pages, we call this a local page accessing, otherwise
a remote page accessing.

By updating task's accessing counter into it's cgroup on ticks, we get
the per-cgroup numa locality info.

For example the new entry 'cpu.numa_stat' show:
  page_access local=1231412 remote=53453

Here we know the workloads in hierarchy have totally been traced 1284865
times of page accessing, and 1231412 of them are local page access, which
imply a good NUMA efficiency.

By monitoring the increments, we will be able to locate the per-cgroup
workload which NUMA Balancing can't helpwith (usually caused by wrong
CPU and memory node bindings), then we got chance to fix that in time.

Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 include/linux/sched.h        | 15 +++++++++
 include/linux/sched/sysctl.h |  6 ++++
 init/Kconfig                 |  9 ++++++
 kernel/sched/core.c          | 75 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c          | 62 ++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h         | 12 +++++++
 kernel/sysctl.c              | 11 +++++++
 7 files changed, 190 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8f6607cd40ac..d15704ac0c6e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1125,6 +1125,21 @@ struct task_struct {
 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */

+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	/*
+	 * Counter index stand for:
+	 * 0 -- remote page accessing
+	 * 1 -- local page accessing
+	 * 2 -- remote page accessing updated
+	 * 3 -- local page accessing updated
+	 *
+	 * We record the counter in task_numa_fault(), this is based on the
+	 * fact that after page fault is handled, the task will access the
+	 * page on the CPU where it triggered PF.
+	 */
+	unsigned long			numa_page_access[4];
+#endif
+
 #ifdef CONFIG_RSEQ
 	struct rseq __user *rseq;
 	u32 rseq_sig;
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 89f55e914673..c7048119b8b5 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -102,4 +102,10 @@ extern int sched_energy_aware_handler(struct ctl_table *table, int write,
 				 loff_t *ppos);
 #endif

+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+extern int sysctl_numa_locality(struct ctl_table *table, int write,
+				 void __user *buffer, size_t *lenp,
+				 loff_t *ppos);
+#endif
+
 #endif /* _LINUX_SCHED_SYSCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 4d8d145c41d2..9c086f716a6d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -817,6 +817,15 @@ config NUMA_BALANCING_DEFAULT_ENABLED
 	  If set, automatic NUMA balancing will be enabled if running on a NUMA
 	  machine.

+config CGROUP_NUMA_LOCALITY
+	bool "The per-cgroup NUMA Locality"
+	default n
+	depends on CGROUP_SCHED && NUMA_BALANCING
+	help
+	  This option enable the collection of per-cgroup NUMA locality info,
+	  to tell whether NUMA Balancing is working well for a particular
+	  workload.
+
 menuconfig CGROUPS
 	bool "Control Group support"
 	select KERNFS
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aaa1740e6497..6a7850d94c55 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7657,6 +7657,68 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */

+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+DEFINE_STATIC_KEY_FALSE(sched_numa_locality);
+
+#ifdef CONFIG_PROC_SYSCTL
+int sysctl_numa_locality(struct ctl_table *table, int write,
+			 void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table t;
+	int err;
+	int state = static_branch_likely(&sched_numa_locality);
+
+	if (write && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	t = *table;
+	t.data = &state;
+	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+	if (err < 0 || !write)
+		return err;
+
+	if (state)
+		static_branch_enable(&sched_numa_locality);
+	else
+		static_branch_disable(&sched_numa_locality);
+
+	return err;
+}
+#endif
+
+static inline struct cfs_rq *tg_cfs_rq(struct task_group *tg, int cpu)
+{
+	return tg == &root_task_group ? &cpu_rq(cpu)->cfs : tg->cfs_rq[cpu];
+}
+
+static int cpu_numa_stat_show(struct seq_file *sf, void *v)
+{
+	int cpu;
+	u64 local = 0, remote = 0;
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	if (!static_branch_likely(&sched_numa_locality))
+		return 0;
+
+	for_each_possible_cpu(cpu) {
+		local += tg_cfs_rq(tg, cpu)->local_page_access;
+		remote += tg_cfs_rq(tg, cpu)->remote_page_access;
+	}
+
+	seq_printf(sf, "page_access local=%llu remote=%llu\n", local, remote);
+
+	return 0;
+}
+
+static __init int numa_locality_setup(char *opt)
+{
+	static_branch_enable(&sched_numa_locality);
+
+	return 0;
+}
+__setup("numa_locality", numa_locality_setup);
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -7706,6 +7768,12 @@ static struct cftype cpu_legacy_files[] = {
 		.seq_show = cpu_uclamp_max_show,
 		.write = cpu_uclamp_max_write,
 	},
+#endif
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	{
+		.name = "numa_stat",
+		.seq_show = cpu_numa_stat_show,
+	},
 #endif
 	{ }	/* Terminate */
 };
@@ -7887,6 +7955,13 @@ static struct cftype cpu_files[] = {
 		.seq_show = cpu_uclamp_max_show,
 		.write = cpu_uclamp_max_write,
 	},
+#endif
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	{
+		.name = "numa_stat",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_numa_stat_show,
+	},
 #endif
 	{ }	/* terminate */
 };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 81eba554db8d..4f5689f5a088 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1050,6 +1050,62 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  */

 #ifdef CONFIG_NUMA_BALANCING
+
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+/*
+ * We want to record the real local/remote page access statistic
+ * here, so 'pnid' should be pages's real residential node after
+ * migrate_misplaced_page(), and 'cnid' should be the node of CPU
+ * where triggered the PF.
+ */
+static inline void
+update_task_locality(struct task_struct *p, int pnid, int cnid, int pages)
+{
+	if (!static_branch_unlikely(&sched_numa_locality))
+		return;
+
+	/*
+	 * pnid != cnid --> remote idx 0
+	 * pnid == cnid --> local idx 1
+	 */
+	p->numa_page_access[!!(pnid == cnid)] += pages;
+}
+
+static inline void update_group_locality(struct cfs_rq *cfs_rq)
+{
+	unsigned long ldiff, rdiff;
+
+	if (!static_branch_unlikely(&sched_numa_locality))
+		return;
+
+	rdiff = current->numa_page_access[0] - current->numa_page_access[2];
+	ldiff = current->numa_page_access[1] - current->numa_page_access[3];
+	if (!ldiff && !rdiff)
+		return;
+
+	cfs_rq->local_page_access += ldiff;
+	cfs_rq->remote_page_access += rdiff;
+
+	/*
+	 * Consider updated when reach root cfs_rq, no PF should happen
+	 * during the hierarchical updating.
+	 */
+	if (&cfs_rq->rq->cfs == cfs_rq) {
+		current->numa_page_access[2] = current->numa_page_access[0];
+		current->numa_page_access[3] = current->numa_page_access[1];
+	}
+}
+#else
+static inline void
+update_task_locality(struct task_struct *p, int pnid, int cnid, int pages)
+{
+}
+
+static inline void update_group_locality(struct cfs_rq *cfs_rq)
+{
+}
+#endif /* CONFIG_CGROUP_NUMA_LOCALITY */
+
 /*
  * Approximate time to scan a full NUMA task in ms. The task scan period is
  * calculated based on the tasks virtual memory size and
@@ -2465,6 +2521,8 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 	p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages;
 	p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages;
 	p->numa_faults_locality[local] += pages;
+
+	update_task_locality(p, mem_node, numa_node_id(), pages);
 }

 static void reset_ptenuma_scan(struct task_struct *p)
@@ -2650,6 +2708,9 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 	p->last_sum_exec_runtime	= 0;

 	init_task_work(&p->numa_work, task_numa_work);
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	memset(p->numa_page_access, 0, sizeof(p->numa_page_access));
+#endif

 	/* New address space, reset the preferred nid */
 	if (!(clone_flags & CLONE_VM)) {
@@ -4298,6 +4359,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	 */
 	update_load_avg(cfs_rq, curr, UPDATE_TG);
 	update_cfs_group(curr);
+	update_group_locality(cfs_rq);

 #ifdef CONFIG_SCHED_HRTICK
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 05c282775f21..33f5653d9d4c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -575,6 +575,14 @@ struct cfs_rq {
 	struct list_head	throttled_list;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	/*
+	 * The local/remote page access info collected from all
+	 * the tasks in hierarchy.
+	 */
+	u64			local_page_access;
+	u64			remote_page_access;
+#endif
 };

 static inline int rt_bandwidth_enabled(void)
@@ -1601,6 +1609,10 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features =
 extern struct static_key_false sched_numa_balancing;
 extern struct static_key_false sched_schedstats;

+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+extern struct static_key_false sched_numa_locality;
+#endif
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 50373984a5e2..73cbb70940ac 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -428,6 +428,17 @@ static struct ctl_table kern_table[] = {
 		.extra2		= SYSCTL_ONE,
 	},
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	{
+		.procname	= "numa_locality",
+		.data		= NULL, /* filled in by handler */
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_numa_locality,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
+#endif /* CONFIG_CGROUP_NUMA_LOCALITY */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
 		.procname	= "sched_rt_period_us",
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v3 2/2] sched/numa: documentation for per-cgroup numa statistics
  2019-12-03  5:59   ` [PATCH v3 0/2] sched/numa: introduce numa locality 王贇
  2019-12-03  6:00     ` [PATCH v3 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
@ 2019-12-03  6:02     ` 王贇
  2019-12-03 13:43       ` Jonathan Corbet
  2019-12-04  7:58     ` [PATCH v4 0/2] sched/numa: introduce numa locality 王贇
  2 siblings, 1 reply; 45+ messages in thread
From: 王贇 @ 2019-12-03  6:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

Add the description for 'numa_locality', also a new doc to explain
the details on how to deal with the per-cgroup numa statistics.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 Documentation/admin-guide/cg-numa-stat.rst      | 176 ++++++++++++++++++++++++
 Documentation/admin-guide/index.rst             |   1 +
 Documentation/admin-guide/kernel-parameters.txt |   4 +
 Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
 include/linux/sched.h                           |  10 +-
 init/Kconfig                                    |   4 +-
 kernel/sched/fair.c                             |   4 +-
 7 files changed, 200 insertions(+), 8 deletions(-)
 create mode 100644 Documentation/admin-guide/cg-numa-stat.rst

diff --git a/Documentation/admin-guide/cg-numa-stat.rst b/Documentation/admin-guide/cg-numa-stat.rst
new file mode 100644
index 000000000000..49167db36f37
--- /dev/null
+++ b/Documentation/admin-guide/cg-numa-stat.rst
@@ -0,0 +1,176 @@
+===============================
+Per-cgroup NUMA statistics
+===============================
+
+Background
+----------
+
+On NUMA platforms, remote memory accessing always has a performance penalty.
+Although we have NUMA balancing working hard to maximize the access locality,
+there are still situations it can't help.
+
+This could happen in modern production environment. When a large number of
+cgroups are used to classify and control resources, this creates a complex
+configuration for memory policy, CPUs and NUMA nodes. In such cases NUMA
+balancing could end up with the wrong memory policy or exhausted local NUMA
+node, which would lead to low percentage of local page accesses.
+
+We need to detect such cases, figure out which workloads from which cgroup
+have introduced the issues, then we get chance to do adjustment to avoid
+performance degradation.
+
+However, there are no hardware counters for per-task local/remote accessing
+info, we don't know how many remote page accesses have occurred for a
+particular task.
+
+NUMA Locality
+-------------
+
+Fortunately, we have NUMA Balancing which scans task's mapping and triggers
+page fault periodically, giving us the opportunity to record per-task page
+accessing info, when the CPU fall into PF is from the same node of pages, we
+consider task as doing local page accessing, otherwise the remote page
+accessing, we call these two counter the locality info.
+
+On each tick, we acquire the locality info of current task on that CPU, update
+the increments into it's cgroup, becoming the group locality info.
+
+By "echo 1 > /proc/sys/kernel/numa_locality" at runtime or adding boot parameter
+'numa_locality', we will enable the accounting of per-cgroup NUMA locality info,
+the 'cpu.numa_stat' entry of CPU cgroup will show statistics::
+
+  page_access local=NR_LOCAL_PAGE_ACCESS remote=NR_REMOTE_PAGE_ACCESS
+
+We define 'NUMA locality' as::
+
+  NR_LOCAL_PAGE_ACCESS * 100 / (NR_LOCAL_PAGE_ACCESS + NR_REMOTE_PAGE_ACCESS)
+
+This per-cgroup percentage number help to represent the NUMA Balancing behavior.
+
+Note that the accounting is hierarchical, which means the NUMA locality info for
+a given group represent not only the workload of this group, but also the
+workloads of all its descendants.
+
+For example the 'cpu.numa_stat' show::
+
+  page_access local=129909383 remote=18265810
+
+The NUMA locality calculated as::
+
+  129909383 * 100 / (129909383 + 18265810) = 87.67
+
+Thus we know the workload of this group and its descendants have totally done
+129909383 times of local page accessing and 18265810 times of remotes, locality
+is 87.67% which imply most of the memory access are local.
+
+NUMA Consumption
+----------------
+
+There are also other cgroup entry help us to estimate NUMA efficiency, which is
+'cpuacct.usage_percpu' and 'memory.numa_stat'.
+
+By reading 'cpuacct.usage_percpu' we will get per-cpu runtime (in nanoseconds)
+info (in hierarchy) as::
+
+  CPU_0_RUNTIME CPU_1_RUNTIME CPU_2_RUNTIME ... CPU_X_RUNTIME
+
+Combined with the info from::
+
+  cat /sys/devices/system/node/nodeX/cpulist
+
+We would be able to accumulate the runtime of CPUs into NUMA nodes, to get the
+per-cgroup node runtime info.
+
+By reading 'memory.numa_stat' we will get per-cgroup node memory consumption
+info as::
+
+  total=TOTAL_MEM N0=MEM_ON_NODE0 N1=MEM_ON_NODE1 ... NX=MEM_ON_NODEX
+
+Together we call these the per-cgroup NUMA consumption info, tell us how many
+resources a particular workload has consumed, on a particular NUMA node.
+
+Monitoring
+----------
+
+By monitoring the increments of locality info, we can easily know whether NUMA
+Balancing is working well for a particular workload.
+
+For example we take a 5 seconds sample period, then on each sampling we have::
+
+  local_diff = last_nr_local_page_access - nr_local_page_access
+  remote_diff = last_nr_remote_page_access - nr_remote_page_access
+
+and we get the locality in this period as::
+
+  locality = local_diff * 100 / (local_diff + remote_diff)
+
+We can plot a line for locality, when the line close to 100% things are good,
+when getting close to 0% something is wrong, we can pick a proper watermark to
+trigger warning message.
+
+You may want to drop the data if the local/remote_diff is too small, which
+implies there are not many available pages for NUMA Balancing to scan, ignoring
+would be fine since most likely the workload is insensitive to NUMA, or the
+memory topology is already good enough.
+
+Monitoring root group helps you control the overall situation, while you may
+also want to monitor all the leaf groups which contain the workloads, this
+helps to catch the mouse.
+
+Try to put your workload into also the cpuacct & memory cgroup, when NUMA
+Balancing is disabled or locality becomes too small, we may want to monitoring
+the per-node runtime & memory info to see if the node consumption meet the
+requirements.
+
+For NUMA node X on each sampling we have::
+
+  runtime_X_diff = runtime_X - last_runtime_X
+  runtime_all_diff = runtime_all - last_runtime_all
+
+  runtime_percent_X = runtime_X_diff * 100 / runtime_all_diff
+  memory_percent_X = memory_X * 100 / memory_all
+
+These two percentages are usually matched on each node, workload should execute
+mostly on the node that contains most of its memory, but it's not guaranteed.
+
+The workload may only access a small part of its memory, in such cases although
+the majority of memory are remotely, locality could still be good.
+
+Thus to tell if things are fine or not depends on the understanding of system
+resource deployment, however, if you find node X got 100% memory percent but 0%
+runtime percent, definitely something is wrong.
+
+Troubleshooting
+---------------
+
+After identifying which workload introduced the bad locality, check:
+
+1). Is the workload bound to a particular NUMA node?
+2). Has any NUMA node run out of resources?
+
+There are several ways to bind task's memory with a NUMA node, the strict way
+like the MPOL_BIND memory policy or 'cpuset.mems' will limit the memory
+node where to allocate pages. In this situation, admin should make sure the
+task is allowed to run on the CPUs of that NUMA node, and make sure there are
+available CPU resource there.
+
+There are also ways to bind task's CPU with a NUMA node, like 'cpuset.cpus' or
+sched_setaffinity() syscall. In this situation, NUMA Balancing help to migrate
+pages into that node, admin should make sure there are available memory there.
+
+Admin could try to rebind or unbind the NUMA node to erase the damage, make a
+change then observe the statistics to see if things get better until the
+situation is acceptable.
+
+Highlights
+----------
+
+For some tasks, NUMA Balancing may found no necessary to scan pages, and
+locality could always be 0 or small number, don't pay attention to them
+since they most likely insensitive to NUMA.
+
+There is no accounting until the option is turned on, so enable it in advance
+if you want to have the whole history.
+
+We have per-task migfailed counter to tell how many page migration has been
+failed for a particular task, you will find it in /proc/PID/sched entry.
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 4405b7485312..c75a3fdfcd94 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -112,6 +112,7 @@ configure specific aspects of kernel behavior to your liking.
    video-output
    wimax/index
    xfs
+   cg-numa-stat

 .. only::  subproject and html

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0945611b3877..9d9e57d19af3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3227,6 +3227,10 @@
 	numa_balancing=	[KNL,X86] Enable or disable automatic NUMA balancing.
 			Allowed values are enable and disable

+	numa_locality	[KNL] Enable per-cgroup numa locality info.
+			Useful to debug NUMA efficiency problems when there are
+			lots of per-cgroup workloads.
+
 	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
 			'node', 'default' can be specified
 			This can be set from sysctl after boot.
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 7e203b3ed331..efa995e757fd 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -572,6 +572,15 @@ rate for each task.
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.

+numa_locality:
+=============
+
+Enables/disables per-cgroup NUMA locality info.
+
+0: disabled (default).
+1: enabled.
+
+Check Documentation/admin-guide/cg-numa-stat.rst for details.

 osrelease, ostype & version:
 ============================
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 2/2] sched/numa: documentation for per-cgroup numa statistics
  2019-12-03  6:02     ` [PATCH v3 2/2] sched/numa: documentation for per-cgroup numa statistics 王贇
@ 2019-12-03 13:43       ` Jonathan Corbet
  2019-12-04  2:27         ` 王贇
  0 siblings, 1 reply; 45+ messages in thread
From: Jonathan Corbet @ 2019-12-03 13:43 UTC (permalink / raw)
  To: 王贇
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap

On Tue, 3 Dec 2019 14:02:18 +0800
王贇 <yun.wang@linux.alibaba.com> wrote:

> Add the description for 'numa_locality', also a new doc to explain
> the details on how to deal with the per-cgroup numa statistics.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Michal Koutný <mkoutny@suse.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Iurii Zaikin <yzaikin@google.com>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
> ---
>  Documentation/admin-guide/cg-numa-stat.rst      | 176 ++++++++++++++++++++++++
>  Documentation/admin-guide/index.rst             |   1 +
>  Documentation/admin-guide/kernel-parameters.txt |   4 +
>  Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
>  include/linux/sched.h                           |  10 +-
>  init/Kconfig                                    |   4 +-
>  kernel/sched/fair.c                             |   4 +-
>  7 files changed, 200 insertions(+), 8 deletions(-)
>  create mode 100644 Documentation/admin-guide/cg-numa-stat.rst
> 
> diff --git a/Documentation/admin-guide/cg-numa-stat.rst b/Documentation/admin-guide/cg-numa-stat.rst
> new file mode 100644
> index 000000000000..49167db36f37
> --- /dev/null
> +++ b/Documentation/admin-guide/cg-numa-stat.rst
> @@ -0,0 +1,176 @@
> +===============================
> +Per-cgroup NUMA statistics
> +===============================

One small request: can we get an SPDX line at the beginning of that new
file?

Thanks,

jon

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 2/2] sched/numa: documentation for per-cgroup numa statistics
  2019-12-03 13:43       ` Jonathan Corbet
@ 2019-12-04  2:27         ` 王贇
  0 siblings, 0 replies; 45+ messages in thread
From: 王贇 @ 2019-12-04  2:27 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap



On 2019/12/3 下午9:43, Jonathan Corbet wrote:
> On Tue, 3 Dec 2019 14:02:18 +0800
[snip]
>> diff --git a/Documentation/admin-guide/cg-numa-stat.rst b/Documentation/admin-guide/cg-numa-stat.rst
>> new file mode 100644
>> index 000000000000..49167db36f37
>> --- /dev/null
>> +++ b/Documentation/admin-guide/cg-numa-stat.rst
>> @@ -0,0 +1,176 @@
>> +===============================
>> +Per-cgroup NUMA statistics
>> +===============================
> 
> One small request: can we get an SPDX line at the beginning of that new
> file?

Certainly, will be in next version :-)

Regards,
Michael Wang

> 
> Thanks,
> 
> jon
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2019-12-03  6:00     ` [PATCH v3 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
@ 2019-12-04  2:33       ` Randy Dunlap
  2019-12-04  2:38         ` 王贇
  0 siblings, 1 reply; 45+ messages in thread
From: Randy Dunlap @ 2019-12-04  2:33 UTC (permalink / raw)
  To: 王贇,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Jonathan Corbet

On 12/2/19 10:00 PM, 王贇 wrote:
> diff --git a/init/Kconfig b/init/Kconfig
> index 4d8d145c41d2..9c086f716a6d 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -817,6 +817,15 @@ config NUMA_BALANCING_DEFAULT_ENABLED
>  	  If set, automatic NUMA balancing will be enabled if running on a NUMA
>  	  machine.
> 
> +config CGROUP_NUMA_LOCALITY
> +	bool "The per-cgroup NUMA Locality"

I would drop "The".

> +	default n
> +	depends on CGROUP_SCHED && NUMA_BALANCING
> +	help
> +	  This option enable the collection of per-cgroup NUMA locality info,

	              enables

> +	  to tell whether NUMA Balancing is working well for a particular
> +	  workload.
> +
>  menuconfig CGROUPS
>  	bool "Control Group support"
>  	select KERNFS


-- 
~Randy
Reported-by: Randy Dunlap <rdunlap@infradead.org>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v3 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2019-12-04  2:33       ` Randy Dunlap
@ 2019-12-04  2:38         ` 王贇
  0 siblings, 0 replies; 45+ messages in thread
From: 王贇 @ 2019-12-04  2:38 UTC (permalink / raw)
  To: Randy Dunlap, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Jonathan Corbet



On 2019/12/4 上午10:33, Randy Dunlap wrote:
> On 12/2/19 10:00 PM, 王贇 wrote:
>> diff --git a/init/Kconfig b/init/Kconfig
>> index 4d8d145c41d2..9c086f716a6d 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -817,6 +817,15 @@ config NUMA_BALANCING_DEFAULT_ENABLED
>>  	  If set, automatic NUMA balancing will be enabled if running on a NUMA
>>  	  machine.
>>
>> +config CGROUP_NUMA_LOCALITY
>> +	bool "The per-cgroup NUMA Locality"
> 
> I would drop "The".
> 
>> +	default n
>> +	depends on CGROUP_SCHED && NUMA_BALANCING
>> +	help
>> +	  This option enable the collection of per-cgroup NUMA locality info,
> 
> 	              enables

Will fix them in next version too~

Regards,
Michael Wang


> 
>> +	  to tell whether NUMA Balancing is working well for a particular
>> +	  workload.
>> +
>>  menuconfig CGROUPS
>>  	bool "Control Group support"
>>  	select KERNFS
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v4 0/2] sched/numa: introduce numa locality
  2019-12-03  5:59   ` [PATCH v3 0/2] sched/numa: introduce numa locality 王贇
  2019-12-03  6:00     ` [PATCH v3 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
  2019-12-03  6:02     ` [PATCH v3 2/2] sched/numa: documentation for per-cgroup numa statistics 王贇
@ 2019-12-04  7:58     ` 王贇
  2019-12-04  7:59       ` [PATCH v4 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
                         ` (2 more replies)
  2 siblings, 3 replies; 45+ messages in thread
From: 王贇 @ 2019-12-04  7:58 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

Since v3:
  * fix comments and improved documentation
Since v2:
  * simplified the locality concept & implementation
Since v1:
  * improved documentation

Modern production environment could use hundreds of cgroup to control
the resources for different workloads, along with the complicated
resource binding.

On NUMA platforms where we have multiple nodes, things become even more
complicated, we hope there are more local memory access to improve the
performance, and NUMA Balancing keep working hard to achieve that,
however, wrong memory policy or node binding could easily waste the
effort, result a lot of remote page accessing.

We need to notice such problems, then we got chance to fix it before
there are too much damages, however, there are no good monitoring
approach yet to help catch the mouse who introduced the remote access.

This patch set is trying to fill in the missing pieces, by introduce
the per-cgroup NUMA locality info, with this new statistics, we could
achieve the daily monitoring on NUMA efficiency, to give warning when
things going too wrong.

Please check the second patch for more details.
Michael Wang (2):
  sched/numa: introduce per-cgroup NUMA locality info
  sched/numa: documentation for per-cgroup numa statistics

 Documentation/admin-guide/cg-numa-stat.rst      | 178 ++++++++++++++++++++++++
 Documentation/admin-guide/index.rst             |   1 +
 Documentation/admin-guide/kernel-parameters.txt |   4 +
 Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
 include/linux/sched.h                           |  15 ++
 include/linux/sched/sysctl.h                    |   6 +
 init/Kconfig                                    |  11 ++
 kernel/sched/core.c                             |  75 ++++++++++
 kernel/sched/fair.c                             |  62 +++++++++
 kernel/sched/sched.h                            |  12 ++
 kernel/sysctl.c                                 |  11 ++
 11 files changed, 384 insertions(+)
 create mode 100644 Documentation/admin-guide/cg-numa-stat.rst

-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v4 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2019-12-04  7:58     ` [PATCH v4 0/2] sched/numa: introduce numa locality 王贇
@ 2019-12-04  7:59       ` 王贇
  2019-12-05  3:28         ` Randy Dunlap
  2019-12-04  8:00       ` [PATCH v4 2/2] sched/numa: documentation for per-cgroup numa statistics 王贇
  2019-12-05  6:53       ` [PATCH v5 0/2] sched/numa: introduce numa locality 王贇
  2 siblings, 1 reply; 45+ messages in thread
From: 王贇 @ 2019-12-04  7:59 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

Currently there are no good approach to monitoring the per-cgroup NUMA
efficiency, this could be a trouble especially when groups are sharing
CPUs, we don't know which one introduced remote-memory accessing.

Although the per-task NUMA accessing info from PMU is good for further
debuging, but not light enough for daily monitoring, especial on a box
with thousands of tasks.

Fortunately, when NUMA Balancing enabled, it will periodly trigger page
fault and try to increase the NUMA locality, by tracing the results we
will be able to estimate the NUMA efficiency.

On each page fault of NUMA Balancing, when task's executing CPU is from
the same node of pages, we call this a local page accessing, otherwise
a remote page accessing.

By updating task's accessing counter into it's cgroup on ticks, we get
the per-cgroup numa locality info.

For example the new entry 'cpu.numa_stat' show:
  page_access local=1231412 remote=53453

Here we know the workloads in hierarchy have totally been traced 1284865
times of page accessing, and 1231412 of them are local page access, which
imply a good NUMA efficiency.

By monitoring the increments, we will be able to locate the per-cgroup
workload which NUMA Balancing can't helpwith (usually caused by wrong
CPU and memory node bindings), then we got chance to fix that in time.

Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 include/linux/sched.h        | 15 +++++++++
 include/linux/sched/sysctl.h |  6 ++++
 init/Kconfig                 |  9 ++++++
 kernel/sched/core.c          | 75 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c          | 62 ++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h         | 12 +++++++
 kernel/sysctl.c              | 11 +++++++
 7 files changed, 190 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8f6607cd40ac..f73b3cf7d32a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1125,6 +1125,21 @@ struct task_struct {
 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */

+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	/*
+	 * Counter index stand for:
+	 * 0 -- remote page accessing
+	 * 1 -- local page accessing
+	 * 2 -- remote page accessing updated to cgroup
+	 * 3 -- local page accessing updated to cgroup
+	 *
+	 * We record the counter before the end of task_numa_fault(), this
+	 * is based on the fact that after page fault is handled, the task
+	 * will access the page on the CPU where it triggered the PF.
+	 */
+	unsigned long			numa_page_access[4];
+#endif
+
 #ifdef CONFIG_RSEQ
 	struct rseq __user *rseq;
 	u32 rseq_sig;
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 89f55e914673..c7048119b8b5 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -102,4 +102,10 @@ extern int sched_energy_aware_handler(struct ctl_table *table, int write,
 				 loff_t *ppos);
 #endif

+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+extern int sysctl_numa_locality(struct ctl_table *table, int write,
+				 void __user *buffer, size_t *lenp,
+				 loff_t *ppos);
+#endif
+
 #endif /* _LINUX_SCHED_SYSCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 4d8d145c41d2..fb7182a0d017 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -817,6 +817,15 @@ config NUMA_BALANCING_DEFAULT_ENABLED
 	  If set, automatic NUMA balancing will be enabled if running on a NUMA
 	  machine.

+config CGROUP_NUMA_LOCALITY
+	bool "The per-cgroup NUMA Locality"
+	default n
+	depends on CGROUP_SCHED && NUMA_BALANCING
+	help
+	  This option enable the collection of per-cgroup NUMA locality info,
+	  to tell whether NUMA Balancing is working well for a particular
+	  workload, also imply the NUMA efficiency.
+
 menuconfig CGROUPS
 	bool "Control Group support"
 	select KERNFS
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aaa1740e6497..6a7850d94c55 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7657,6 +7657,68 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */

+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+DEFINE_STATIC_KEY_FALSE(sched_numa_locality);
+
+#ifdef CONFIG_PROC_SYSCTL
+int sysctl_numa_locality(struct ctl_table *table, int write,
+			 void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table t;
+	int err;
+	int state = static_branch_likely(&sched_numa_locality);
+
+	if (write && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	t = *table;
+	t.data = &state;
+	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+	if (err < 0 || !write)
+		return err;
+
+	if (state)
+		static_branch_enable(&sched_numa_locality);
+	else
+		static_branch_disable(&sched_numa_locality);
+
+	return err;
+}
+#endif
+
+static inline struct cfs_rq *tg_cfs_rq(struct task_group *tg, int cpu)
+{
+	return tg == &root_task_group ? &cpu_rq(cpu)->cfs : tg->cfs_rq[cpu];
+}
+
+static int cpu_numa_stat_show(struct seq_file *sf, void *v)
+{
+	int cpu;
+	u64 local = 0, remote = 0;
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	if (!static_branch_likely(&sched_numa_locality))
+		return 0;
+
+	for_each_possible_cpu(cpu) {
+		local += tg_cfs_rq(tg, cpu)->local_page_access;
+		remote += tg_cfs_rq(tg, cpu)->remote_page_access;
+	}
+
+	seq_printf(sf, "page_access local=%llu remote=%llu\n", local, remote);
+
+	return 0;
+}
+
+static __init int numa_locality_setup(char *opt)
+{
+	static_branch_enable(&sched_numa_locality);
+
+	return 0;
+}
+__setup("numa_locality", numa_locality_setup);
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -7706,6 +7768,12 @@ static struct cftype cpu_legacy_files[] = {
 		.seq_show = cpu_uclamp_max_show,
 		.write = cpu_uclamp_max_write,
 	},
+#endif
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	{
+		.name = "numa_stat",
+		.seq_show = cpu_numa_stat_show,
+	},
 #endif
 	{ }	/* Terminate */
 };
@@ -7887,6 +7955,13 @@ static struct cftype cpu_files[] = {
 		.seq_show = cpu_uclamp_max_show,
 		.write = cpu_uclamp_max_write,
 	},
+#endif
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	{
+		.name = "numa_stat",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_numa_stat_show,
+	},
 #endif
 	{ }	/* terminate */
 };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 81eba554db8d..d3a141c79155 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1050,6 +1050,62 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  */

 #ifdef CONFIG_NUMA_BALANCING
+
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+/*
+ * We want to record the real local/remote page access statistic
+ * here, so 'pnid' should be pages's real residential node after
+ * migrate_misplaced_page(), and 'cnid' should be the node of CPU
+ * where triggered the PF.
+ */
+static inline void
+update_task_locality(struct task_struct *p, int pnid, int cnid, int pages)
+{
+	if (!static_branch_unlikely(&sched_numa_locality))
+		return;
+
+	/*
+	 * pnid != cnid --> remote idx 0
+	 * pnid == cnid --> local idx 1
+	 */
+	p->numa_page_access[!!(pnid == cnid)] += pages;
+}
+
+static inline void update_group_locality(struct cfs_rq *cfs_rq)
+{
+	unsigned long ldiff, rdiff;
+
+	if (!static_branch_unlikely(&sched_numa_locality))
+		return;
+
+	rdiff = current->numa_page_access[0] - current->numa_page_access[2];
+	ldiff = current->numa_page_access[1] - current->numa_page_access[3];
+	if (!ldiff && !rdiff)
+		return;
+
+	cfs_rq->local_page_access += ldiff;
+	cfs_rq->remote_page_access += rdiff;
+
+	/*
+	 * Consider updated when reach root cfs_rq, no NUMA Balancing PF
+	 * should happen on current task during the hierarchical updating.
+	 */
+	if (&cfs_rq->rq->cfs == cfs_rq) {
+		current->numa_page_access[2] = current->numa_page_access[0];
+		current->numa_page_access[3] = current->numa_page_access[1];
+	}
+}
+#else
+static inline void
+update_task_locality(struct task_struct *p, int pnid, int cnid, int pages)
+{
+}
+
+static inline void update_group_locality(struct cfs_rq *cfs_rq)
+{
+}
+#endif /* CONFIG_CGROUP_NUMA_LOCALITY */
+
 /*
  * Approximate time to scan a full NUMA task in ms. The task scan period is
  * calculated based on the tasks virtual memory size and
@@ -2465,6 +2521,8 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 	p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages;
 	p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages;
 	p->numa_faults_locality[local] += pages;
+
+	update_task_locality(p, mem_node, numa_node_id(), pages);
 }

 static void reset_ptenuma_scan(struct task_struct *p)
@@ -2650,6 +2708,9 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 	p->last_sum_exec_runtime	= 0;

 	init_task_work(&p->numa_work, task_numa_work);
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	memset(p->numa_page_access, 0, sizeof(p->numa_page_access));
+#endif

 	/* New address space, reset the preferred nid */
 	if (!(clone_flags & CLONE_VM)) {
@@ -4298,6 +4359,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	 */
 	update_load_avg(cfs_rq, curr, UPDATE_TG);
 	update_cfs_group(curr);
+	update_group_locality(cfs_rq);

 #ifdef CONFIG_SCHED_HRTICK
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 05c282775f21..33f5653d9d4c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -575,6 +575,14 @@ struct cfs_rq {
 	struct list_head	throttled_list;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	/*
+	 * The local/remote page access info collected from all
+	 * the tasks in hierarchy.
+	 */
+	u64			local_page_access;
+	u64			remote_page_access;
+#endif
 };

 static inline int rt_bandwidth_enabled(void)
@@ -1601,6 +1609,10 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features =
 extern struct static_key_false sched_numa_balancing;
 extern struct static_key_false sched_schedstats;

+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+extern struct static_key_false sched_numa_locality;
+#endif
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 50373984a5e2..73cbb70940ac 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -428,6 +428,17 @@ static struct ctl_table kern_table[] = {
 		.extra2		= SYSCTL_ONE,
 	},
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	{
+		.procname	= "numa_locality",
+		.data		= NULL, /* filled in by handler */
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_numa_locality,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
+#endif /* CONFIG_CGROUP_NUMA_LOCALITY */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
 		.procname	= "sched_rt_period_us",
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v4 2/2] sched/numa: documentation for per-cgroup numa statistics
  2019-12-04  7:58     ` [PATCH v4 0/2] sched/numa: introduce numa locality 王贇
  2019-12-04  7:59       ` [PATCH v4 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
@ 2019-12-04  8:00       ` 王贇
  2019-12-05  3:40         ` Randy Dunlap
  2019-12-05  6:53       ` [PATCH v5 0/2] sched/numa: introduce numa locality 王贇
  2 siblings, 1 reply; 45+ messages in thread
From: 王贇 @ 2019-12-04  8:00 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

Add the description for 'numa_locality', also a new doc to explain
the details on how to deal with the per-cgroup numa statistics.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 Documentation/admin-guide/cg-numa-stat.rst      | 178 ++++++++++++++++++++++++
 Documentation/admin-guide/index.rst             |   1 +
 Documentation/admin-guide/kernel-parameters.txt |   4 +
 Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
 init/Kconfig                                    |   6 +-
 5 files changed, 196 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/admin-guide/cg-numa-stat.rst

diff --git a/Documentation/admin-guide/cg-numa-stat.rst b/Documentation/admin-guide/cg-numa-stat.rst
new file mode 100644
index 000000000000..5d1f623451d5
--- /dev/null
+++ b/Documentation/admin-guide/cg-numa-stat.rst
@@ -0,0 +1,178 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+Per-cgroup NUMA statistics
+===============================
+
+Background
+----------
+
+On NUMA platforms, remote memory accessing always has a performance penalty.
+Although we have NUMA balancing working hard to maximize the access locality,
+there are still situations it can't help.
+
+This could happen in modern production environment. When a large number of
+cgroups are used to classify and control resources, this creates a complex
+configuration for memory policy, CPUs and NUMA nodes. In such cases NUMA
+balancing could end up with the wrong memory policy or exhausted local NUMA
+node, which would lead to low percentage of local page accesses.
+
+We need to detect such cases, figure out which workloads from which cgroup
+have introduced the issues, then we get chance to do adjustment to avoid
+performance degradation.
+
+However, there are no hardware counters for per-task local/remote accessing
+info, we don't know how many remote page accesses have occurred for a
+particular task.
+
+NUMA Locality
+-------------
+
+Fortunately, we have NUMA Balancing which scans task's mapping and triggers
+page fault periodically, giving us the opportunity to record per-task page
+accessing info, when the CPU fall into PF is from the same node of pages, we
+consider task as doing local page accessing, otherwise the remote page
+accessing, we call these two counter the locality info.
+
+On each tick, we acquire the locality info of current task on that CPU, update
+the increments into it's cgroup, becoming the group locality info.
+
+By "echo 1 > /proc/sys/kernel/numa_locality" at runtime or adding boot parameter
+'numa_locality', we will enable the accounting of per-cgroup NUMA locality info,
+the 'cpu.numa_stat' entry of CPU cgroup will show statistics::
+
+  page_access local=NR_LOCAL_PAGE_ACCESS remote=NR_REMOTE_PAGE_ACCESS
+
+We define 'NUMA locality' as::
+
+  NR_LOCAL_PAGE_ACCESS * 100 / (NR_LOCAL_PAGE_ACCESS + NR_REMOTE_PAGE_ACCESS)
+
+This per-cgroup percentage number help to represent the NUMA Balancing behavior.
+
+Note that the accounting is hierarchical, which means the NUMA locality info for
+a given group represent not only the workload of this group, but also the
+workloads of all its descendants.
+
+For example the 'cpu.numa_stat' show::
+
+  page_access local=129909383 remote=18265810
+
+The NUMA locality calculated as::
+
+  129909383 * 100 / (129909383 + 18265810) = 87.67
+
+Thus we know the workload of this group and its descendants have totally done
+129909383 times of local page accessing and 18265810 times of remotes, locality
+is 87.67% which imply most of the memory access are local.
+
+NUMA Consumption
+----------------
+
+There are also other cgroup entry help us to estimate NUMA efficiency, which is
+'cpuacct.usage_percpu' and 'memory.numa_stat'.
+
+By reading 'cpuacct.usage_percpu' we will get per-cpu runtime (in nanoseconds)
+info (in hierarchy) as::
+
+  CPU_0_RUNTIME CPU_1_RUNTIME CPU_2_RUNTIME ... CPU_X_RUNTIME
+
+Combined with the info from::
+
+  cat /sys/devices/system/node/nodeX/cpulist
+
+We would be able to accumulate the runtime of CPUs into NUMA nodes, to get the
+per-cgroup node runtime info.
+
+By reading 'memory.numa_stat' we will get per-cgroup node memory consumption
+info as::
+
+  total=TOTAL_MEM N0=MEM_ON_NODE0 N1=MEM_ON_NODE1 ... NX=MEM_ON_NODEX
+
+Together we call these the per-cgroup NUMA consumption info, tell us how many
+resources a particular workload has consumed, on a particular NUMA node.
+
+Monitoring
+----------
+
+By monitoring the increments of locality info, we can easily know whether NUMA
+Balancing is working well for a particular workload.
+
+For example we take a 5 seconds sample period, then on each sampling we have::
+
+  local_diff = last_nr_local_page_access - nr_local_page_access
+  remote_diff = last_nr_remote_page_access - nr_remote_page_access
+
+and we get the locality in this period as::
+
+  locality = local_diff * 100 / (local_diff + remote_diff)
+
+We can plot a line for locality, when the line close to 100% things are good,
+when getting close to 0% something is wrong, we can pick a proper watermark to
+trigger warning message.
+
+You may want to drop the data if the local/remote_diff is too small, which
+implies there are not many available pages for NUMA Balancing to scan, ignoring
+would be fine since most likely the workload is insensitive to NUMA, or the
+memory topology is already good enough.
+
+Monitoring root group helps you control the overall situation, while you may
+also want to monitor all the leaf groups which contain the workloads, this
+helps to catch the mouse.
+
+Try to put your workload into also the cpuacct & memory cgroup, when NUMA
+Balancing is disabled or locality becomes too small, we may want to monitoring
+the per-node runtime & memory info to see if the node consumption meet the
+requirements.
+
+For NUMA node X on each sampling we have::
+
+  runtime_X_diff = runtime_X - last_runtime_X
+  runtime_all_diff = runtime_all - last_runtime_all
+
+  runtime_percent_X = runtime_X_diff * 100 / runtime_all_diff
+  memory_percent_X = memory_X * 100 / memory_all
+
+These two percentages are usually matched on each node, workload should execute
+mostly on the node that contains most of its memory, but it's not guaranteed.
+
+The workload may only access a small part of its memory, in such cases although
+the majority of memory are remotely, locality could still be good.
+
+Thus to tell if things are fine or not depends on the understanding of system
+resource deployment, however, if you find node X got 100% memory percent but 0%
+runtime percent, definitely something is wrong.
+
+Troubleshooting
+---------------
+
+After identifying which workload introduced the bad locality, check:
+
+1). Is the workload bound to a particular NUMA node?
+2). Has any NUMA node run out of resources?
+
+There are several ways to bind task's memory with a NUMA node, the strict way
+like the MPOL_BIND memory policy or 'cpuset.mems' will limit the memory
+node where to allocate pages. In this situation, admin should make sure the
+task is allowed to run on the CPUs of that NUMA node, and make sure there are
+available CPU resource there.
+
+There are also ways to bind task's CPU with a NUMA node, like 'cpuset.cpus' or
+sched_setaffinity() syscall. In this situation, NUMA Balancing help to migrate
+pages into that node, admin should make sure there are available memory there.
+
+Admin could try to rebind or unbind the NUMA node to erase the damage, make a
+change then observe the statistics to see if things get better until the
+situation is acceptable.
+
+Highlights
+----------
+
+For some tasks, NUMA Balancing may found no necessary to scan pages, and
+locality could always be 0 or small number, don't pay attention to them
+since they most likely insensitive to NUMA.
+
+There is no accounting until the option is turned on, so enable it in advance
+if you want to have the whole history.
+
+We have per-task migfailed counter to tell how many page migration has been
+failed for a particular task, you will find it in /proc/PID/sched entry.
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 4405b7485312..c75a3fdfcd94 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -112,6 +112,7 @@ configure specific aspects of kernel behavior to your liking.
    video-output
    wimax/index
    xfs
+   cg-numa-stat

 .. only::  subproject and html

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0945611b3877..9d9e57d19af3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3227,6 +3227,10 @@
 	numa_balancing=	[KNL,X86] Enable or disable automatic NUMA balancing.
 			Allowed values are enable and disable

+	numa_locality	[KNL] Enable per-cgroup numa locality info.
+			Useful to debug NUMA efficiency problems when there are
+			lots of per-cgroup workloads.
+
 	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
 			'node', 'default' can be specified
 			This can be set from sysctl after boot.
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 7e203b3ed331..efa995e757fd 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -572,6 +572,15 @@ rate for each task.
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.

+numa_locality:
+=============
+
+Enables/disables per-cgroup NUMA locality info.
+
+0: disabled (default).
+1: enabled.
+
+Check Documentation/admin-guide/cg-numa-stat.rst for details.

 osrelease, ostype & version:
 ============================
diff --git a/init/Kconfig b/init/Kconfig
index fb7182a0d017..3538fdd73387 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -818,13 +818,15 @@ config NUMA_BALANCING_DEFAULT_ENABLED
 	  machine.

 config CGROUP_NUMA_LOCALITY
-	bool "The per-cgroup NUMA Locality"
+	bool "per-cgroup NUMA Locality"
 	default n
 	depends on CGROUP_SCHED && NUMA_BALANCING
 	help
-	  This option enable the collection of per-cgroup NUMA locality info,
+	  This option enables the collection of per-cgroup NUMA locality info,
 	  to tell whether NUMA Balancing is working well for a particular
 	  workload, also imply the NUMA efficiency.
+	  See
+		-  Documentation/admin-guide/cg-numa-stat.rst

 menuconfig CGROUPS
 	bool "Control Group support"
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2019-12-04  7:59       ` [PATCH v4 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
@ 2019-12-05  3:28         ` Randy Dunlap
  2019-12-05  3:29           ` Randy Dunlap
  0 siblings, 1 reply; 45+ messages in thread
From: Randy Dunlap @ 2019-12-05  3:28 UTC (permalink / raw)
  To: 王贇,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Jonathan Corbet

Hi,

It seems that you missed my previous comments...


On 12/3/19 11:59 PM, 王贇 wrote:
> diff --git a/init/Kconfig b/init/Kconfig
> index 4d8d145c41d2..fb7182a0d017 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -817,6 +817,15 @@ config NUMA_BALANCING_DEFAULT_ENABLED
>  	  If set, automatic NUMA balancing will be enabled if running on a NUMA
>  	  machine.
> 
> +config CGROUP_NUMA_LOCALITY
> +	bool "The per-cgroup NUMA Locality"

Drop "The"

> +	default n
> +	depends on CGROUP_SCHED && NUMA_BALANCING
> +	help
> +	  This option enable the collection of per-cgroup NUMA locality info,

	              enables

> +	  to tell whether NUMA Balancing is working well for a particular
> +	  workload, also imply the NUMA efficiency.
> +
>  menuconfig CGROUPS
>  	bool "Control Group support"
>  	select KERNFS


thanks.
-- 
~Randy


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2019-12-05  3:28         ` Randy Dunlap
@ 2019-12-05  3:29           ` Randy Dunlap
  2019-12-05  3:52             ` 王贇
  0 siblings, 1 reply; 45+ messages in thread
From: Randy Dunlap @ 2019-12-05  3:29 UTC (permalink / raw)
  To: 王贇,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Jonathan Corbet

On 12/4/19 7:28 PM, Randy Dunlap wrote:
> Hi,
> 
> It seems that you missed my previous comments...
> 
> 
> On 12/3/19 11:59 PM, 王贇 wrote:
>> diff --git a/init/Kconfig b/init/Kconfig
>> index 4d8d145c41d2..fb7182a0d017 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -817,6 +817,15 @@ config NUMA_BALANCING_DEFAULT_ENABLED
>>  	  If set, automatic NUMA balancing will be enabled if running on a NUMA
>>  	  machine.
>>
>> +config CGROUP_NUMA_LOCALITY
>> +	bool "The per-cgroup NUMA Locality"
> 
> Drop "The"
> 
>> +	default n
>> +	depends on CGROUP_SCHED && NUMA_BALANCING
>> +	help
>> +	  This option enable the collection of per-cgroup NUMA locality info,
> 
> 	              enables
> 
>> +	  to tell whether NUMA Balancing is working well for a particular
>> +	  workload, also imply the NUMA efficiency.
>> +
>>  menuconfig CGROUPS
>>  	bool "Control Group support"
>>  	select KERNFS
> 
> 
> thanks.
> 

Ah, the changes are in patch 2/2/ for some reason.  OK, thanks.

-- 
~Randy


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 2/2] sched/numa: documentation for per-cgroup numa statistics
  2019-12-04  8:00       ` [PATCH v4 2/2] sched/numa: documentation for per-cgroup numa statistics 王贇
@ 2019-12-05  3:40         ` Randy Dunlap
  0 siblings, 0 replies; 45+ messages in thread
From: Randy Dunlap @ 2019-12-05  3:40 UTC (permalink / raw)
  To: 王贇,
	Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Jonathan Corbet

On 12/4/19 12:00 AM, 王贇 wrote:
> Add the description for 'numa_locality', also a new doc to explain
> the details on how to deal with the per-cgroup numa statistics.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Michal Koutný <mkoutny@suse.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Iurii Zaikin <yzaikin@google.com>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
> ---
>  Documentation/admin-guide/cg-numa-stat.rst      | 178 ++++++++++++++++++++++++
>  Documentation/admin-guide/index.rst             |   1 +
>  Documentation/admin-guide/kernel-parameters.txt |   4 +
>  Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
>  init/Kconfig                                    |   6 +-
>  5 files changed, 196 insertions(+), 2 deletions(-)
>  create mode 100644 Documentation/admin-guide/cg-numa-stat.rst
> 
> diff --git a/Documentation/admin-guide/cg-numa-stat.rst b/Documentation/admin-guide/cg-numa-stat.rst
> new file mode 100644
> index 000000000000..5d1f623451d5
> --- /dev/null
> +++ b/Documentation/admin-guide/cg-numa-stat.rst
> @@ -0,0 +1,178 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===============================
> +Per-cgroup NUMA statistics
> +===============================
> +
> +Background
> +----------
> +
> +On NUMA platforms, remote memory accessing always has a performance penalty.
> +Although we have NUMA balancing working hard to maximize the access locality,
> +there are still situations it can't help.
> +
> +This could happen in modern production environment. When a large number of
> +cgroups are used to classify and control resources, this creates a complex
> +configuration for memory policy, CPUs and NUMA nodes. In such cases NUMA
> +balancing could end up with the wrong memory policy or exhausted local NUMA
> +node, which would lead to low percentage of local page accesses.
> +
> +We need to detect such cases, figure out which workloads from which cgroup
> +have introduced the issues, then we get chance to do adjustment to avoid
> +performance degradation.
> +
> +However, there are no hardware counters for per-task local/remote accessing
> +info, we don't know how many remote page accesses have occurred for a
> +particular task.
> +
> +NUMA Locality
> +-------------
> +
> +Fortunately, we have NUMA Balancing which scans task's mapping and triggers
> +page fault periodically, giving us the opportunity to record per-task page
> +accessing info, when the CPU fall into PF is from the same node of pages, we
> +consider task as doing local page accessing, otherwise the remote page
> +accessing, we call these two counter the locality info.
> +
> +On each tick, we acquire the locality info of current task on that CPU, update
> +the increments into it's cgroup, becoming the group locality info.

                       its
(it's means it is, which is not correct here)

> +
> +By "echo 1 > /proc/sys/kernel/numa_locality" at runtime or adding boot parameter
> +'numa_locality', we will enable the accounting of per-cgroup NUMA locality info,
> +the 'cpu.numa_stat' entry of CPU cgroup will show statistics::
> +
> +  page_access local=NR_LOCAL_PAGE_ACCESS remote=NR_REMOTE_PAGE_ACCESS
> +
> +We define 'NUMA locality' as::
> +
> +  NR_LOCAL_PAGE_ACCESS * 100 / (NR_LOCAL_PAGE_ACCESS + NR_REMOTE_PAGE_ACCESS)
> +
> +This per-cgroup percentage number help to represent the NUMA Balancing behavior.

                                     helps

> +
> +Note that the accounting is hierarchical, which means the NUMA locality info for
> +a given group represent not only the workload of this group, but also the
> +workloads of all its descendants.
> +
> +For example the 'cpu.numa_stat' show::

                                   shows::

> +
> +  page_access local=129909383 remote=18265810
> +
> +The NUMA locality calculated as::
> +
> +  129909383 * 100 / (129909383 + 18265810) = 87.67
> +
> +Thus we know the workload of this group and its descendants have totally done
> +129909383 times of local page accessing and 18265810 times of remotes, locality
> +is 87.67% which imply most of the memory access are local.
> +
> +NUMA Consumption
> +----------------
> +
> +There are also other cgroup entry help us to estimate NUMA efficiency, which is
> +'cpuacct.usage_percpu' and 'memory.numa_stat'.
> +
> +By reading 'cpuacct.usage_percpu' we will get per-cpu runtime (in nanoseconds)
> +info (in hierarchy) as::
> +
> +  CPU_0_RUNTIME CPU_1_RUNTIME CPU_2_RUNTIME ... CPU_X_RUNTIME
> +
> +Combined with the info from::
> +
> +  cat /sys/devices/system/node/nodeX/cpulist
> +
> +We would be able to accumulate the runtime of CPUs into NUMA nodes, to get the
> +per-cgroup node runtime info.
> +
> +By reading 'memory.numa_stat' we will get per-cgroup node memory consumption
> +info as::
> +
> +  total=TOTAL_MEM N0=MEM_ON_NODE0 N1=MEM_ON_NODE1 ... NX=MEM_ON_NODEX
> +
> +Together we call these the per-cgroup NUMA consumption info, tell us how many
> +resources a particular workload has consumed, on a particular NUMA node.
> +
> +Monitoring
> +----------
> +
> +By monitoring the increments of locality info, we can easily know whether NUMA
> +Balancing is working well for a particular workload.
> +
> +For example we take a 5 seconds sample period, then on each sampling we have::
> +
> +  local_diff = last_nr_local_page_access - nr_local_page_access
> +  remote_diff = last_nr_remote_page_access - nr_remote_page_access
> +
> +and we get the locality in this period as::
> +
> +  locality = local_diff * 100 / (local_diff + remote_diff)
> +
> +We can plot a line for locality, when the line close to 100% things are good,
> +when getting close to 0% something is wrong, we can pick a proper watermark to
> +trigger warning message.
> +
> +You may want to drop the data if the local/remote_diff is too small, which
> +implies there are not many available pages for NUMA Balancing to scan, ignoring
> +would be fine since most likely the workload is insensitive to NUMA, or the
> +memory topology is already good enough.
> +
> +Monitoring root group helps you control the overall situation, while you may
> +also want to monitor all the leaf groups which contain the workloads, this
> +helps to catch the mouse.
> +
> +Try to put your workload into also the cpuacct & memory cgroup, when NUMA
> +Balancing is disabled or locality becomes too small, we may want to monitoring

                                                                    to monitor

> +the per-node runtime & memory info to see if the node consumption meet the
> +requirements.
> +
> +For NUMA node X on each sampling we have::
> +
> +  runtime_X_diff = runtime_X - last_runtime_X
> +  runtime_all_diff = runtime_all - last_runtime_all
> +
> +  runtime_percent_X = runtime_X_diff * 100 / runtime_all_diff
> +  memory_percent_X = memory_X * 100 / memory_all
> +
> +These two percentages are usually matched on each node, workload should execute
> +mostly on the node that contains most of its memory, but it's not guaranteed.
> +
> +The workload may only access a small part of its memory, in such cases although
> +the majority of memory are remotely, locality could still be good.
> +
> +Thus to tell if things are fine or not depends on the understanding of system
> +resource deployment, however, if you find node X got 100% memory percent but 0%
> +runtime percent, definitely something is wrong.
> +
> +Troubleshooting
> +---------------
> +
> +After identifying which workload introduced the bad locality, check:
> +
> +1). Is the workload bound to a particular NUMA node?
> +2). Has any NUMA node run out of resources?
> +
> +There are several ways to bind task's memory with a NUMA node, the strict way
> +like the MPOL_BIND memory policy or 'cpuset.mems' will limit the memory
> +node where to allocate pages. In this situation, admin should make sure the
> +task is allowed to run on the CPUs of that NUMA node, and make sure there are
> +available CPU resource there.
> +
> +There are also ways to bind task's CPU with a NUMA node, like 'cpuset.cpus' or
> +sched_setaffinity() syscall. In this situation, NUMA Balancing help to migrate
> +pages into that node, admin should make sure there are available memory there.
> +
> +Admin could try to rebind or unbind the NUMA node to erase the damage, make a
> +change then observe the statistics to see if things get better until the
> +situation is acceptable.
> +
> +Highlights
> +----------
> +
> +For some tasks, NUMA Balancing may found no necessary to scan pages, and

                                  may be found to be unnecessary

> +locality could always be 0 or small number, don't pay attention to them
> +since they most likely insensitive to NUMA.
> +
> +There is no accounting until the option is turned on, so enable it in advance
> +if you want to have the whole history.
> +
> +We have per-task migfailed counter to tell how many page migration has been
> +failed for a particular task, you will find it in /proc/PID/sched entry.


HTH.
-- 
~Randy


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v4 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2019-12-05  3:29           ` Randy Dunlap
@ 2019-12-05  3:52             ` 王贇
  0 siblings, 0 replies; 45+ messages in thread
From: 王贇 @ 2019-12-05  3:52 UTC (permalink / raw)
  To: Randy Dunlap, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Luis Chamberlain, Kees Cook, Iurii Zaikin,
	Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Jonathan Corbet



On 2019/12/5 上午11:29, Randy Dunlap wrote:
> On 12/4/19 7:28 PM, Randy Dunlap wrote:
>> Hi,
>>
>> It seems that you missed my previous comments...
>>
>>
>> On 12/3/19 11:59 PM, 王贇 wrote:
>>> diff --git a/init/Kconfig b/init/Kconfig
>>> index 4d8d145c41d2..fb7182a0d017 100644
>>> --- a/init/Kconfig
>>> +++ b/init/Kconfig
>>> @@ -817,6 +817,15 @@ config NUMA_BALANCING_DEFAULT_ENABLED
>>>  	  If set, automatic NUMA balancing will be enabled if running on a NUMA
>>>  	  machine.
>>>
>>> +config CGROUP_NUMA_LOCALITY
>>> +	bool "The per-cgroup NUMA Locality"
>>
>> Drop "The"
>>
>>> +	default n
>>> +	depends on CGROUP_SCHED && NUMA_BALANCING
>>> +	help
>>> +	  This option enable the collection of per-cgroup NUMA locality info,
>>
>> 	              enables
>>
>>> +	  to tell whether NUMA Balancing is working well for a particular
>>> +	  workload, also imply the NUMA efficiency.
>>> +
>>>  menuconfig CGROUPS
>>>  	bool "Control Group support"
>>>  	select KERNFS
>>
>>
>> thanks.
>>
> 
> Ah, the changes are in patch 2/2/ for some reason.  OK, thanks.

My bad, let's move them to 1/2 then, also will apply the comments you sent
for 2/2 in next version ;-)

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v5 0/2] sched/numa: introduce numa locality
  2019-12-04  7:58     ` [PATCH v4 0/2] sched/numa: introduce numa locality 王贇
  2019-12-04  7:59       ` [PATCH v4 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
  2019-12-04  8:00       ` [PATCH v4 2/2] sched/numa: documentation for per-cgroup numa statistics 王贇
@ 2019-12-05  6:53       ` 王贇
  2019-12-05  6:53         ` [PATCH v5 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
  2019-12-05  6:54         ` [PATCH v5 2/2] sched/numa: documentation for per-cgroup numa, statistics 王贇
  2 siblings, 2 replies; 45+ messages in thread
From: 王贇 @ 2019-12-05  6:53 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

Since v4:
  * improved documentation
Since v3:
  * fix comments and improved documentation
Since v2:
  * simplified the locality concept & implementation
Since v1:
  * improved documentation

Modern production environment could use hundreds of cgroup to control
the resources for different workloads, along with the complicated
resource binding.

On NUMA platforms where we have multiple nodes, things become even more
complicated, we hope there are more local memory access to improve the
performance, and NUMA Balancing keep working hard to achieve that,
however, wrong memory policy or node binding could easily waste the
effort, result a lot of remote page accessing.

We need to notice such problems, then we got chance to fix it before
there are too much damages, however, there are no good monitoring
approach yet to help catch the mouse who introduced the remote access.

This patch set is trying to fill in the missing pieces, by introduce
the per-cgroup NUMA locality info, with this new statistics, we could
achieve the daily monitoring on NUMA efficiency, to give warning when
things going too wrong.

Please check the second patch for more details.

Michael Wang (2):
  sched/numa: introduce per-cgroup NUMA locality info
  sched/numa: documentation for per-cgroup numa statistics

 Documentation/admin-guide/cg-numa-stat.rst      | 178 ++++++++++++++++++++++++
 Documentation/admin-guide/index.rst             |   1 +
 Documentation/admin-guide/kernel-parameters.txt |   4 +
 Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
 include/linux/sched.h                           |  15 ++
 include/linux/sched/sysctl.h                    |   6 +
 init/Kconfig                                    |  11 ++
 kernel/sched/core.c                             |  75 ++++++++++
 kernel/sched/fair.c                             |  62 +++++++++
 kernel/sched/sched.h                            |  12 ++
 kernel/sysctl.c                                 |  11 ++
 11 files changed, 384 insertions(+)
 create mode 100644 Documentation/admin-guide/cg-numa-stat.rst

-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v5 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2019-12-05  6:53       ` [PATCH v5 0/2] sched/numa: introduce numa locality 王贇
@ 2019-12-05  6:53         ` 王贇
  2019-12-05  6:54         ` [PATCH v5 2/2] sched/numa: documentation for per-cgroup numa, statistics 王贇
  1 sibling, 0 replies; 45+ messages in thread
From: 王贇 @ 2019-12-05  6:53 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

Currently there are no good approach to monitoring the per-cgroup NUMA
efficiency, this could be a trouble especially when groups are sharing
CPUs, we don't know which one introduced remote-memory accessing.

Although the per-task NUMA accessing info from PMU is good for further
debuging, but not light enough for daily monitoring, especial on a box
with thousands of tasks.

Fortunately, when NUMA Balancing enabled, it will periodly trigger page
fault and try to increase the NUMA locality, by tracing the results we
will be able to estimate the NUMA efficiency.

On each page fault of NUMA Balancing, when task's executing CPU is from
the same node of pages, we call this a local page accessing, otherwise
a remote page accessing.

By updating task's accessing counter into it's cgroup on ticks, we get
the per-cgroup numa locality info.

For example the new entry 'cpu.numa_stat' show:
  page_access local=1231412 remote=53453

Here we know the workloads in hierarchy have totally been traced 1284865
times of page accessing, and 1231412 of them are local page access, which
imply a good NUMA efficiency.

By monitoring the increments, we will be able to locate the per-cgroup
workload which NUMA Balancing can't helpwith (usually caused by wrong
CPU and memory node bindings), then we got chance to fix that in time.

Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 include/linux/sched.h        | 15 +++++++++
 include/linux/sched/sysctl.h |  6 ++++
 init/Kconfig                 |  9 ++++++
 kernel/sched/core.c          | 75 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c          | 62 ++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h         | 12 +++++++
 kernel/sysctl.c              | 11 +++++++
 7 files changed, 190 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8f6607cd40ac..f73b3cf7d32a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1125,6 +1125,21 @@ struct task_struct {
 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */

+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	/*
+	 * Counter index stand for:
+	 * 0 -- remote page accessing
+	 * 1 -- local page accessing
+	 * 2 -- remote page accessing updated to cgroup
+	 * 3 -- local page accessing updated to cgroup
+	 *
+	 * We record the counter before the end of task_numa_fault(), this
+	 * is based on the fact that after page fault is handled, the task
+	 * will access the page on the CPU where it triggered the PF.
+	 */
+	unsigned long			numa_page_access[4];
+#endif
+
 #ifdef CONFIG_RSEQ
 	struct rseq __user *rseq;
 	u32 rseq_sig;
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 89f55e914673..c7048119b8b5 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -102,4 +102,10 @@ extern int sched_energy_aware_handler(struct ctl_table *table, int write,
 				 loff_t *ppos);
 #endif

+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+extern int sysctl_numa_locality(struct ctl_table *table, int write,
+				 void __user *buffer, size_t *lenp,
+				 loff_t *ppos);
+#endif
+
 #endif /* _LINUX_SCHED_SYSCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 4d8d145c41d2..c614ba6bdcc2 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -817,6 +817,15 @@ config NUMA_BALANCING_DEFAULT_ENABLED
 	  If set, automatic NUMA balancing will be enabled if running on a NUMA
 	  machine.

+config CGROUP_NUMA_LOCALITY
+	bool "per-cgroup NUMA Locality"
+	default n
+	depends on CGROUP_SCHED && NUMA_BALANCING
+	help
+	  This option enables the collection of per-cgroup NUMA locality info,
+	  to tell whether NUMA Balancing is working well for a particular
+	  workload, also imply the NUMA efficiency.
+
 menuconfig CGROUPS
 	bool "Control Group support"
 	select KERNFS
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aaa1740e6497..6a7850d94c55 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7657,6 +7657,68 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */

+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+DEFINE_STATIC_KEY_FALSE(sched_numa_locality);
+
+#ifdef CONFIG_PROC_SYSCTL
+int sysctl_numa_locality(struct ctl_table *table, int write,
+			 void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table t;
+	int err;
+	int state = static_branch_likely(&sched_numa_locality);
+
+	if (write && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	t = *table;
+	t.data = &state;
+	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+	if (err < 0 || !write)
+		return err;
+
+	if (state)
+		static_branch_enable(&sched_numa_locality);
+	else
+		static_branch_disable(&sched_numa_locality);
+
+	return err;
+}
+#endif
+
+static inline struct cfs_rq *tg_cfs_rq(struct task_group *tg, int cpu)
+{
+	return tg == &root_task_group ? &cpu_rq(cpu)->cfs : tg->cfs_rq[cpu];
+}
+
+static int cpu_numa_stat_show(struct seq_file *sf, void *v)
+{
+	int cpu;
+	u64 local = 0, remote = 0;
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	if (!static_branch_likely(&sched_numa_locality))
+		return 0;
+
+	for_each_possible_cpu(cpu) {
+		local += tg_cfs_rq(tg, cpu)->local_page_access;
+		remote += tg_cfs_rq(tg, cpu)->remote_page_access;
+	}
+
+	seq_printf(sf, "page_access local=%llu remote=%llu\n", local, remote);
+
+	return 0;
+}
+
+static __init int numa_locality_setup(char *opt)
+{
+	static_branch_enable(&sched_numa_locality);
+
+	return 0;
+}
+__setup("numa_locality", numa_locality_setup);
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -7706,6 +7768,12 @@ static struct cftype cpu_legacy_files[] = {
 		.seq_show = cpu_uclamp_max_show,
 		.write = cpu_uclamp_max_write,
 	},
+#endif
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	{
+		.name = "numa_stat",
+		.seq_show = cpu_numa_stat_show,
+	},
 #endif
 	{ }	/* Terminate */
 };
@@ -7887,6 +7955,13 @@ static struct cftype cpu_files[] = {
 		.seq_show = cpu_uclamp_max_show,
 		.write = cpu_uclamp_max_write,
 	},
+#endif
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	{
+		.name = "numa_stat",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_numa_stat_show,
+	},
 #endif
 	{ }	/* terminate */
 };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 81eba554db8d..d3a141c79155 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1050,6 +1050,62 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  */

 #ifdef CONFIG_NUMA_BALANCING
+
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+/*
+ * We want to record the real local/remote page access statistic
+ * here, so 'pnid' should be pages's real residential node after
+ * migrate_misplaced_page(), and 'cnid' should be the node of CPU
+ * where triggered the PF.
+ */
+static inline void
+update_task_locality(struct task_struct *p, int pnid, int cnid, int pages)
+{
+	if (!static_branch_unlikely(&sched_numa_locality))
+		return;
+
+	/*
+	 * pnid != cnid --> remote idx 0
+	 * pnid == cnid --> local idx 1
+	 */
+	p->numa_page_access[!!(pnid == cnid)] += pages;
+}
+
+static inline void update_group_locality(struct cfs_rq *cfs_rq)
+{
+	unsigned long ldiff, rdiff;
+
+	if (!static_branch_unlikely(&sched_numa_locality))
+		return;
+
+	rdiff = current->numa_page_access[0] - current->numa_page_access[2];
+	ldiff = current->numa_page_access[1] - current->numa_page_access[3];
+	if (!ldiff && !rdiff)
+		return;
+
+	cfs_rq->local_page_access += ldiff;
+	cfs_rq->remote_page_access += rdiff;
+
+	/*
+	 * Consider updated when reach root cfs_rq, no NUMA Balancing PF
+	 * should happen on current task during the hierarchical updating.
+	 */
+	if (&cfs_rq->rq->cfs == cfs_rq) {
+		current->numa_page_access[2] = current->numa_page_access[0];
+		current->numa_page_access[3] = current->numa_page_access[1];
+	}
+}
+#else
+static inline void
+update_task_locality(struct task_struct *p, int pnid, int cnid, int pages)
+{
+}
+
+static inline void update_group_locality(struct cfs_rq *cfs_rq)
+{
+}
+#endif /* CONFIG_CGROUP_NUMA_LOCALITY */
+
 /*
  * Approximate time to scan a full NUMA task in ms. The task scan period is
  * calculated based on the tasks virtual memory size and
@@ -2465,6 +2521,8 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 	p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages;
 	p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages;
 	p->numa_faults_locality[local] += pages;
+
+	update_task_locality(p, mem_node, numa_node_id(), pages);
 }

 static void reset_ptenuma_scan(struct task_struct *p)
@@ -2650,6 +2708,9 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 	p->last_sum_exec_runtime	= 0;

 	init_task_work(&p->numa_work, task_numa_work);
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	memset(p->numa_page_access, 0, sizeof(p->numa_page_access));
+#endif

 	/* New address space, reset the preferred nid */
 	if (!(clone_flags & CLONE_VM)) {
@@ -4298,6 +4359,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	 */
 	update_load_avg(cfs_rq, curr, UPDATE_TG);
 	update_cfs_group(curr);
+	update_group_locality(cfs_rq);

 #ifdef CONFIG_SCHED_HRTICK
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 05c282775f21..33f5653d9d4c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -575,6 +575,14 @@ struct cfs_rq {
 	struct list_head	throttled_list;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	/*
+	 * The local/remote page access info collected from all
+	 * the tasks in hierarchy.
+	 */
+	u64			local_page_access;
+	u64			remote_page_access;
+#endif
 };

 static inline int rt_bandwidth_enabled(void)
@@ -1601,6 +1609,10 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features =
 extern struct static_key_false sched_numa_balancing;
 extern struct static_key_false sched_schedstats;

+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+extern struct static_key_false sched_numa_locality;
+#endif
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 50373984a5e2..73cbb70940ac 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -428,6 +428,17 @@ static struct ctl_table kern_table[] = {
 		.extra2		= SYSCTL_ONE,
 	},
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	{
+		.procname	= "numa_locality",
+		.data		= NULL, /* filled in by handler */
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_numa_locality,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
+#endif /* CONFIG_CGROUP_NUMA_LOCALITY */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
 		.procname	= "sched_rt_period_us",
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v5 2/2] sched/numa: documentation for per-cgroup numa, statistics
  2019-12-05  6:53       ` [PATCH v5 0/2] sched/numa: introduce numa locality 王贇
  2019-12-05  6:53         ` [PATCH v5 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
@ 2019-12-05  6:54         ` 王贇
  1 sibling, 0 replies; 45+ messages in thread
From: 王贇 @ 2019-12-05  6:54 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

Add the description for 'numa_locality', also a new doc to explain
the details on how to deal with the per-cgroup numa statistics.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 Documentation/admin-guide/cg-numa-stat.rst      | 178 ++++++++++++++++++++++++
 Documentation/admin-guide/index.rst             |   1 +
 Documentation/admin-guide/kernel-parameters.txt |   4 +
 Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
 init/Kconfig                                    |   2 +
 5 files changed, 194 insertions(+)
 create mode 100644 Documentation/admin-guide/cg-numa-stat.rst

diff --git a/Documentation/admin-guide/cg-numa-stat.rst b/Documentation/admin-guide/cg-numa-stat.rst
new file mode 100644
index 000000000000..30ebe5d6404f
--- /dev/null
+++ b/Documentation/admin-guide/cg-numa-stat.rst
@@ -0,0 +1,178 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+Per-cgroup NUMA statistics
+===============================
+
+Background
+----------
+
+On NUMA platforms, remote memory accessing always has a performance penalty.
+Although we have NUMA balancing working hard to maximize the access locality,
+there are still situations it can't help.
+
+This could happen in modern production environment. When a large number of
+cgroups are used to classify and control resources, this creates a complex
+configuration for memory policy, CPUs and NUMA nodes. In such cases NUMA
+balancing could end up with the wrong memory policy or exhausted local NUMA
+node, which would lead to low percentage of local page accesses.
+
+We need to detect such cases, figure out which workloads from which cgroup
+have introduced the issues, then we get chance to do adjustment to avoid
+performance degradation.
+
+However, there are no hardware counters for per-task local/remote accessing
+info, we don't know how many remote page accesses have occurred for a
+particular task.
+
+NUMA Locality
+-------------
+
+Fortunately, we have NUMA Balancing which scans task's mapping and triggers
+page fault periodically, giving us the opportunity to record per-task page
+accessing info, when the CPU fall into PF is from the same node of pages, we
+consider task as doing local page accessing, otherwise the remote page
+accessing, we call these two counter the locality info.
+
+On each tick, we acquire the locality info of current task on that CPU, update
+the increments into its cgroup, becoming the group locality info.
+
+By "echo 1 > /proc/sys/kernel/numa_locality" at runtime or adding boot parameter
+'numa_locality', we will enable the accounting of per-cgroup NUMA locality info,
+the 'cpu.numa_stat' entry of CPU cgroup will show statistics::
+
+  page_access local=NR_LOCAL_PAGE_ACCESS remote=NR_REMOTE_PAGE_ACCESS
+
+We define 'NUMA locality' as::
+
+  NR_LOCAL_PAGE_ACCESS * 100 / (NR_LOCAL_PAGE_ACCESS + NR_REMOTE_PAGE_ACCESS)
+
+This per-cgroup percentage number helps to represent the NUMA Balancing behavior.
+
+Note that the accounting is hierarchical, which means the NUMA locality info for
+a given group represent not only the workload of this group, but also the
+workloads of all its descendants.
+
+For example the 'cpu.numa_stat' shows::
+
+  page_access local=129909383 remote=18265810
+
+The NUMA locality calculated as::
+
+  129909383 * 100 / (129909383 + 18265810) = 87.67
+
+Thus we know the workload of this group and its descendants have totally done
+129909383 times of local page accessing and 18265810 times of remotes, locality
+is 87.67% which imply most of the memory access are local.
+
+NUMA Consumption
+----------------
+
+There are also other cgroup entry help us to estimate NUMA efficiency, which is
+'cpuacct.usage_percpu' and 'memory.numa_stat'.
+
+By reading 'cpuacct.usage_percpu' we will get per-cpu runtime (in nanoseconds)
+info (in hierarchy) as::
+
+  CPU_0_RUNTIME CPU_1_RUNTIME CPU_2_RUNTIME ... CPU_X_RUNTIME
+
+Combined with the info from::
+
+  cat /sys/devices/system/node/nodeX/cpulist
+
+We would be able to accumulate the runtime of CPUs into NUMA nodes, to get the
+per-cgroup node runtime info.
+
+By reading 'memory.numa_stat' we will get per-cgroup node memory consumption
+info as::
+
+  total=TOTAL_MEM N0=MEM_ON_NODE0 N1=MEM_ON_NODE1 ... NX=MEM_ON_NODEX
+
+Together we call these the per-cgroup NUMA consumption info, tell us how many
+resources a particular workload has consumed, on a particular NUMA node.
+
+Monitoring
+----------
+
+By monitoring the increments of locality info, we can easily know whether NUMA
+Balancing is working well for a particular workload.
+
+For example we take a 5 seconds sample period, then on each sampling we have::
+
+  local_diff = last_nr_local_page_access - nr_local_page_access
+  remote_diff = last_nr_remote_page_access - nr_remote_page_access
+
+and we get the locality in this period as::
+
+  locality = local_diff * 100 / (local_diff + remote_diff)
+
+We can plot a line for locality, when the line close to 100% things are good,
+when getting close to 0% something is wrong, we can pick a proper watermark to
+trigger warning message.
+
+You may want to drop the data if the local/remote_diff is too small, which
+implies there are not many available pages for NUMA Balancing to scan, ignoring
+would be fine since most likely the workload is insensitive to NUMA, or the
+memory topology is already good enough.
+
+Monitoring root group helps you control the overall situation, while you may
+also want to monitor all the leaf groups which contain the workloads, this
+helps to catch the mouse.
+
+Try to put your workload into also the cpuacct & memory cgroup, when NUMA
+Balancing is disabled or locality becomes too small, we may want to monitor
+the per-node runtime & memory info to see if the node consumption meet the
+requirements.
+
+For NUMA node X on each sampling we have::
+
+  runtime_X_diff = runtime_X - last_runtime_X
+  runtime_all_diff = runtime_all - last_runtime_all
+
+  runtime_percent_X = runtime_X_diff * 100 / runtime_all_diff
+  memory_percent_X = memory_X * 100 / memory_all
+
+These two percentages are usually matched on each node, workload should execute
+mostly on the node that contains most of its memory, but it's not guaranteed.
+
+The workload may only access a small part of its memory, in such cases although
+the majority of memory are remotely, locality could still be good.
+
+Thus to tell if things are fine or not depends on the understanding of system
+resource deployment, however, if you find node X got 100% memory percent but 0%
+runtime percent, definitely something is wrong.
+
+Troubleshooting
+---------------
+
+After identifying which workload introduced the bad locality, check:
+
+1). Is the workload bound to a particular NUMA node?
+2). Has any NUMA node run out of resources?
+
+There are several ways to bind task's memory with a NUMA node, the strict way
+like the MPOL_BIND memory policy or 'cpuset.mems' will limit the memory
+node where to allocate pages. In this situation, admin should make sure the
+task is allowed to run on the CPUs of that NUMA node, and make sure there are
+available CPU resource there.
+
+There are also ways to bind task's CPU with a NUMA node, like 'cpuset.cpus' or
+sched_setaffinity() syscall. In this situation, NUMA Balancing help to migrate
+pages into that node, admin should make sure there are available memory there.
+
+Admin could try to rebind or unbind the NUMA node to erase the damage, make a
+change then observe the statistics to see if things get better until the
+situation is acceptable.
+
+Highlights
+----------
+
+For some tasks, NUMA Balancing may be found to be unnecessary to scan pages,
+and locality could always be 0 or small number, don't pay attention to them
+since they most likely insensitive to NUMA.
+
+There is no accounting until the option is turned on, so enable it in advance
+if you want to have the whole history.
+
+We have per-task migfailed counter to tell how many page migration has been
+failed for a particular task, you will find it in /proc/PID/sched entry.
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 4405b7485312..c75a3fdfcd94 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -112,6 +112,7 @@ configure specific aspects of kernel behavior to your liking.
    video-output
    wimax/index
    xfs
+   cg-numa-stat

 .. only::  subproject and html

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0945611b3877..9d9e57d19af3 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3227,6 +3227,10 @@
 	numa_balancing=	[KNL,X86] Enable or disable automatic NUMA balancing.
 			Allowed values are enable and disable

+	numa_locality	[KNL] Enable per-cgroup numa locality info.
+			Useful to debug NUMA efficiency problems when there are
+			lots of per-cgroup workloads.
+
 	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
 			'node', 'default' can be specified
 			This can be set from sysctl after boot.
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 7e203b3ed331..efa995e757fd 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -572,6 +572,15 @@ rate for each task.
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.

+numa_locality:
+=============
+
+Enables/disables per-cgroup NUMA locality info.
+
+0: disabled (default).
+1: enabled.
+
+Check Documentation/admin-guide/cg-numa-stat.rst for details.

 osrelease, ostype & version:
 ============================
diff --git a/init/Kconfig b/init/Kconfig
index c614ba6bdcc2..3538fdd73387 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -825,6 +825,8 @@ config CGROUP_NUMA_LOCALITY
 	  This option enables the collection of per-cgroup NUMA locality info,
 	  to tell whether NUMA Balancing is working well for a particular
 	  workload, also imply the NUMA efficiency.
+	  See
+		-  Documentation/admin-guide/cg-numa-stat.rst

 menuconfig CGROUPS
 	bool "Control Group support"
-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, back to index

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-13  3:43 [PATCH 0/3] sched/numa: introduce advanced numa statistic 王贇
2019-11-13  3:44 ` [PATCH 1/3] sched/numa: advanced per-cgroup " 王贇
2019-11-13  3:45 ` [PATCH 2/3] sched/numa: expose per-task pages-migration-failure 王贇
2019-11-13  3:45 ` [PATCH 3/3] sched/numa: documentation for per-cgroup numa stat 王贇
2019-11-13 15:09   ` Jonathan Corbet
2019-11-14  1:52     ` 王贇
2019-11-13 18:28   ` Iurii Zaikin
2019-11-14  2:22     ` 王贇
2019-11-15  2:29   ` [PATCH v2 " 王贇
2019-11-20  9:45 ` [PATCH 0/3] sched/numa: introduce advanced numa statistic 王贇
2019-11-25  1:35 ` 王贇
2019-11-27  1:48 ` [PATCH v2 " 王贇
2019-11-27  1:49   ` [PATCH v2 1/3] sched/numa: advanced per-cgroup " 王贇
2019-11-27 10:19     ` Mel Gorman
2019-11-28  2:09       ` 王贇
2019-11-28 12:39         ` Michal Koutný
2019-11-28 13:41           ` 王贇
2019-11-28 15:58             ` Michal Koutný
2019-11-29  1:52               ` 王贇
2019-11-29  5:19                 ` 王贇
2019-11-29 10:06                   ` Michal Koutný
2019-12-02  2:11                     ` 王贇
2019-11-27  1:50   ` [PATCH v2 2/3] sched/numa: expose per-task pages-migration-failure 王贇
2019-11-27 10:00     ` Mel Gorman
2019-12-02  2:22     ` 王贇
2019-11-27  1:50   ` [PATCH v2 3/3] sched/numa: documentation for per-cgroup numa stat 王贇
2019-11-27  4:58     ` Randy Dunlap
2019-11-27  5:54       ` 王贇
2019-12-03  5:59   ` [PATCH v3 0/2] sched/numa: introduce numa locality 王贇
2019-12-03  6:00     ` [PATCH v3 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
2019-12-04  2:33       ` Randy Dunlap
2019-12-04  2:38         ` 王贇
2019-12-03  6:02     ` [PATCH v3 2/2] sched/numa: documentation for per-cgroup numa statistics 王贇
2019-12-03 13:43       ` Jonathan Corbet
2019-12-04  2:27         ` 王贇
2019-12-04  7:58     ` [PATCH v4 0/2] sched/numa: introduce numa locality 王贇
2019-12-04  7:59       ` [PATCH v4 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
2019-12-05  3:28         ` Randy Dunlap
2019-12-05  3:29           ` Randy Dunlap
2019-12-05  3:52             ` 王贇
2019-12-04  8:00       ` [PATCH v4 2/2] sched/numa: documentation for per-cgroup numa statistics 王贇
2019-12-05  3:40         ` Randy Dunlap
2019-12-05  6:53       ` [PATCH v5 0/2] sched/numa: introduce numa locality 王贇
2019-12-05  6:53         ` [PATCH v5 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
2019-12-05  6:54         ` [PATCH v5 2/2] sched/numa: documentation for per-cgroup numa, statistics 王贇

Linux-Fsdevel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
		linux-fsdevel@vger.kernel.org
	public-inbox-index linux-fsdevel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git