linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RESEND v8 0/2] sched/numa: introduce numa locality
@ 2020-02-07  3:34 王贇
  2020-02-07  3:35 ` [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
  2020-02-07  3:35 ` [PATCH RESEND v8 2/2] sched/numa: documentation for per-cgroup numa 王贇
  0 siblings, 2 replies; 16+ messages in thread
From: 王贇 @ 2020-02-07  3:34 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

v8:
  * document edited
v7:
  * rebased on latest linux-next
v6:
  * fix compile failure when NUMA disabled
v5:
  * improved documentation
v4:
  * fix comments and improved documentation
v3:
  * simplified the locality concept & implementation
v2:
  * improved documentation

Modern production environment could use hundreds of cgroup to control
the resources for different workloads, along with the complicated
resource binding.

On NUMA platforms where we have multiple nodes, things become even more
complicated, we hope there are more local memory access to improve the
performance, and NUMA Balancing keep working hard to achieve that,
however, wrong memory policy or node binding could easily waste the
effort, result a lot of remote page accessing.

We need to notice such problems, then we got chance to fix it before
there are too much damages, however, there are no good monitoring
approach yet to help catch the mouse who introduced the remote access.

This patch set is trying to fill in the missing pieces, by introduce
the per-cgroup NUMA locality info, with this new statistics, we could
achieve the daily monitoring on NUMA efficiency, to give warning when
things going too wrong.

Please check the second patch for more details.

Michael Wang (2):
  sched/numa: introduce per-cgroup NUMA locality info
  sched/numa: documentation for per-cgroup numa statistics

 Documentation/admin-guide/cg-numa-stat.rst      | 178 ++++++++++++++++++++++++
 Documentation/admin-guide/index.rst             |   1 +
 Documentation/admin-guide/kernel-parameters.txt |   4 +
 Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
 include/linux/sched.h                           |  15 ++
 include/linux/sched/sysctl.h                    |   6 +
 init/Kconfig                                    |  11 ++
 kernel/sched/core.c                             |  75 ++++++++++
 kernel/sched/fair.c                             |  62 +++++++++
 kernel/sched/sched.h                            |  12 ++
 kernel/sysctl.c                                 |  11 ++
 11 files changed, 384 insertions(+)
 create mode 100644 Documentation/admin-guide/cg-numa-stat.rst

-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2020-02-07  3:34 [PATCH RESEND v8 0/2] sched/numa: introduce numa locality 王贇
@ 2020-02-07  3:35 ` 王贇
  2020-02-07  3:37   ` 王贇
  2020-02-14 15:10   ` Peter Zijlstra
  2020-02-07  3:35 ` [PATCH RESEND v8 2/2] sched/numa: documentation for per-cgroup numa 王贇
  1 sibling, 2 replies; 16+ messages in thread
From: 王贇 @ 2020-02-07  3:35 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

Currently there are no good approach to monitoring the per-cgroup NUMA
efficiency, this could be a trouble especially when groups are sharing
CPUs, we don't know which one introduced remote-memory accessing.

Although the per-task NUMA accessing info from PMU is good for further
debuging, but not light enough for daily monitoring, especial on a box
with thousands of tasks.

Fortunately, when NUMA Balancing enabled, it will periodly trigger page
fault and try to increase the NUMA locality, by tracing the results we
will be able to estimate the NUMA efficiency.

On each page fault of NUMA Balancing, when task's executing CPU is from
the same node of pages, we call this a local page accessing, otherwise
a remote page accessing.

By updating task's accessing counter into it's cgroup on ticks, we get
the per-cgroup numa locality info.

For example the new entry 'cpu.numa_stat' show:
  page_access local=1231412 remote=53453

Here we know the workloads in hierarchy have totally been traced 1284865
times of page accessing, and 1231412 of them are local page access, which
imply a good NUMA efficiency.

By monitoring the increments, we will be able to locate the per-cgroup
workload which NUMA Balancing can't helpwith (usually caused by wrong
CPU and memory node bindings), then we got chance to fix that in time.

Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 include/linux/sched.h        | 15 +++++++++
 include/linux/sched/sysctl.h |  6 ++++
 init/Kconfig                 |  9 ++++++
 kernel/sched/core.c          | 75 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c          | 62 ++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h         | 12 +++++++
 kernel/sysctl.c              | 11 +++++++
 7 files changed, 190 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a6c924fa1c77..74bf234bae53 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1128,6 +1128,21 @@ struct task_struct {
 	unsigned long			numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */

+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	/*
+	 * Counter index stand for:
+	 * 0 -- remote page accessing
+	 * 1 -- local page accessing
+	 * 2 -- remote page accessing updated to cgroup
+	 * 3 -- local page accessing updated to cgroup
+	 *
+	 * We record the counter before the end of task_numa_fault(), this
+	 * is based on the fact that after page fault is handled, the task
+	 * will access the page on the CPU where it triggered the PF.
+	 */
+	unsigned long			numa_page_access[4];
+#endif
+
 #ifdef CONFIG_RSEQ
 	struct rseq __user *rseq;
 	u32 rseq_sig;
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..bb3721cf48e0 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -101,4 +101,10 @@ extern int sched_energy_aware_handler(struct ctl_table *table, int write,
 				 loff_t *ppos);
 #endif

+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+extern int sysctl_numa_locality(struct ctl_table *table, int write,
+				 void __user *buffer, size_t *lenp,
+				 loff_t *ppos);
+#endif
+
 #endif /* _LINUX_SCHED_SYSCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 322fd2c65304..63c6b90a515d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -813,6 +813,15 @@ config NUMA_BALANCING_DEFAULT_ENABLED
 	  If set, automatic NUMA balancing will be enabled if running on a NUMA
 	  machine.

+config CGROUP_NUMA_LOCALITY
+	bool "per-cgroup NUMA Locality"
+	default n
+	depends on CGROUP_SCHED && NUMA_BALANCING
+	help
+	  This option enables the collection of per-cgroup NUMA locality info,
+	  to tell whether NUMA Balancing is working well for a particular
+	  workload, also imply the NUMA efficiency.
+
 menuconfig CGROUPS
 	bool "Control Group support"
 	select KERNFS
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e7b08d52db93..40dd6b221eef 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7657,6 +7657,68 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */

+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+DEFINE_STATIC_KEY_FALSE(sched_numa_locality);
+
+#ifdef CONFIG_PROC_SYSCTL
+int sysctl_numa_locality(struct ctl_table *table, int write,
+			 void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	struct ctl_table t;
+	int err;
+	int state = static_branch_likely(&sched_numa_locality);
+
+	if (write && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	t = *table;
+	t.data = &state;
+	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
+	if (err < 0 || !write)
+		return err;
+
+	if (state)
+		static_branch_enable(&sched_numa_locality);
+	else
+		static_branch_disable(&sched_numa_locality);
+
+	return err;
+}
+#endif
+
+static inline struct cfs_rq *tg_cfs_rq(struct task_group *tg, int cpu)
+{
+	return tg == &root_task_group ? &cpu_rq(cpu)->cfs : tg->cfs_rq[cpu];
+}
+
+static int cpu_numa_stat_show(struct seq_file *sf, void *v)
+{
+	int cpu;
+	u64 local = 0, remote = 0;
+	struct task_group *tg = css_tg(seq_css(sf));
+
+	if (!static_branch_likely(&sched_numa_locality))
+		return 0;
+
+	for_each_possible_cpu(cpu) {
+		local += tg_cfs_rq(tg, cpu)->local_page_access;
+		remote += tg_cfs_rq(tg, cpu)->remote_page_access;
+	}
+
+	seq_printf(sf, "page_access local=%llu remote=%llu\n", local, remote);
+
+	return 0;
+}
+
+static __init int numa_locality_setup(char *opt)
+{
+	static_branch_enable(&sched_numa_locality);
+
+	return 0;
+}
+__setup("numa_locality", numa_locality_setup);
+#endif
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	{
@@ -7706,6 +7768,12 @@ static struct cftype cpu_legacy_files[] = {
 		.seq_show = cpu_uclamp_max_show,
 		.write = cpu_uclamp_max_write,
 	},
+#endif
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	{
+		.name = "numa_stat",
+		.seq_show = cpu_numa_stat_show,
+	},
 #endif
 	{ }	/* Terminate */
 };
@@ -7887,6 +7955,13 @@ static struct cftype cpu_files[] = {
 		.seq_show = cpu_uclamp_max_show,
 		.write = cpu_uclamp_max_write,
 	},
+#endif
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	{
+		.name = "numa_stat",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = cpu_numa_stat_show,
+	},
 #endif
 	{ }	/* terminate */
 };
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d170b5da0e3..eb838557bae2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1049,7 +1049,63 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
  * Scheduling class queueing methods:
  */

+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+/*
+ * We want to record the real local/remote page access statistic
+ * here, so 'pnid' should be pages's real residential node after
+ * migrate_misplaced_page(), and 'cnid' should be the node of CPU
+ * where triggered the PF.
+ */
+static inline void
+update_task_locality(struct task_struct *p, int pnid, int cnid, int pages)
+{
+	if (!static_branch_unlikely(&sched_numa_locality))
+		return;
+
+	/*
+	 * pnid != cnid --> remote idx 0
+	 * pnid == cnid --> local idx 1
+	 */
+	p->numa_page_access[!!(pnid == cnid)] += pages;
+}
+
+static inline void update_group_locality(struct cfs_rq *cfs_rq)
+{
+	unsigned long ldiff, rdiff;
+
+	if (!static_branch_unlikely(&sched_numa_locality))
+		return;
+
+	rdiff = current->numa_page_access[0] - current->numa_page_access[2];
+	ldiff = current->numa_page_access[1] - current->numa_page_access[3];
+	if (!ldiff && !rdiff)
+		return;
+
+	cfs_rq->local_page_access += ldiff;
+	cfs_rq->remote_page_access += rdiff;
+
+	/*
+	 * Consider updated when reach root cfs_rq, no NUMA Balancing PF
+	 * should happen on current task during the hierarchical updating.
+	 */
+	if (&cfs_rq->rq->cfs == cfs_rq) {
+		current->numa_page_access[2] = current->numa_page_access[0];
+		current->numa_page_access[3] = current->numa_page_access[1];
+	}
+}
+#else
+static inline void
+update_task_locality(struct task_struct *p, int pnid, int cnid, int pages)
+{
+}
+
+static inline void update_group_locality(struct cfs_rq *cfs_rq)
+{
+}
+#endif /* CONFIG_CGROUP_NUMA_LOCALITY */
+
 #ifdef CONFIG_NUMA_BALANCING
+
 /*
  * Approximate time to scan a full NUMA task in ms. The task scan period is
  * calculated based on the tasks virtual memory size and
@@ -2465,6 +2521,8 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 	p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages;
 	p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages;
 	p->numa_faults_locality[local] += pages;
+
+	update_task_locality(p, mem_node, numa_node_id(), pages);
 }

 static void reset_ptenuma_scan(struct task_struct *p)
@@ -2650,6 +2708,9 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
 	p->last_sum_exec_runtime	= 0;

 	init_task_work(&p->numa_work, task_numa_work);
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	memset(p->numa_page_access, 0, sizeof(p->numa_page_access));
+#endif

 	/* New address space, reset the preferred nid */
 	if (!(clone_flags & CLONE_VM)) {
@@ -4313,6 +4374,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	 */
 	update_load_avg(cfs_rq, curr, UPDATE_TG);
 	update_cfs_group(curr);
+	update_group_locality(cfs_rq);

 #ifdef CONFIG_SCHED_HRTICK
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1a88dc8ad11b..66b4e581b6ed 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -575,6 +575,14 @@ struct cfs_rq {
 	struct list_head	throttled_list;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	/*
+	 * The local/remote page access info collected from all
+	 * the tasks in hierarchy.
+	 */
+	u64			local_page_access;
+	u64			remote_page_access;
+#endif
 };

 static inline int rt_bandwidth_enabled(void)
@@ -1601,6 +1609,10 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features =
 extern struct static_key_false sched_numa_balancing;
 extern struct static_key_false sched_schedstats;

+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+extern struct static_key_false sched_numa_locality;
+#endif
+
 static inline u64 global_rt_period(void)
 {
 	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d396aaaf19a3..a8f5951f92b3 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -428,6 +428,17 @@ static struct ctl_table kern_table[] = {
 		.extra2		= SYSCTL_ONE,
 	},
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_CGROUP_NUMA_LOCALITY
+	{
+		.procname	= "numa_locality",
+		.data		= NULL, /* filled in by handler */
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_numa_locality,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
+#endif /* CONFIG_CGROUP_NUMA_LOCALITY */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
 		.procname	= "sched_rt_period_us",
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH RESEND v8 2/2] sched/numa: documentation for per-cgroup numa
  2020-02-07  3:34 [PATCH RESEND v8 0/2] sched/numa: introduce numa locality 王贇
  2020-02-07  3:35 ` [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
@ 2020-02-07  3:35 ` 王贇
  1 sibling, 0 replies; 16+ messages in thread
From: 王贇 @ 2020-02-07  3:35 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

Add the description for 'numa_locality', also a new doc to explain
the details on how to deal with the per-cgroup numa statistics.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Iurii Zaikin <yzaikin@google.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
---
 Documentation/admin-guide/cg-numa-stat.rst      | 178 ++++++++++++++++++++++++
 Documentation/admin-guide/index.rst             |   1 +
 Documentation/admin-guide/kernel-parameters.txt |   4 +
 Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
 init/Kconfig                                    |   2 +
 5 files changed, 194 insertions(+)
 create mode 100644 Documentation/admin-guide/cg-numa-stat.rst

diff --git a/Documentation/admin-guide/cg-numa-stat.rst b/Documentation/admin-guide/cg-numa-stat.rst
new file mode 100644
index 000000000000..1106eb1e4050
--- /dev/null
+++ b/Documentation/admin-guide/cg-numa-stat.rst
@@ -0,0 +1,178 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============================
+Per-cgroup NUMA statistics
+===============================
+
+Background
+----------
+
+On NUMA platforms, remote memory accessing always has a performance penalty.
+Although we have NUMA balancing working hard to maximize the access locality,
+there are still situations it can't help.
+
+This could happen in modern production environment. When a large number of
+cgroups are used to classify and control resources, this creates a complex
+configuration for memory policy, CPUs and NUMA nodes. In such cases NUMA
+balancing could end up with the wrong memory policy or exhausted local NUMA
+node, which would lead to low percentage of local page accesses.
+
+We need to detect such cases, figure out which workloads from which cgroup
+have introduced the issues, then we get chance to do adjustment to avoid
+performance degradation.
+
+However, there are no hardware counters for per-task local/remote accessing
+info, we don't know how many remote page accesses have occurred for a
+particular task.
+
+NUMA Locality
+-------------
+
+Fortunately, we have NUMA Balancing which scans task's mapping and triggers
+page fault periodically, giving us the opportunity to record per-task page
+accessing info, when the CPU fall into PF is from the same node of pages, we
+consider task as doing local page accessing, otherwise the remote page
+accessing, we call these two counters the locality info.
+
+On each tick, we acquire the locality info of current task on that CPU, update
+the increments into its cgroup, becoming the group locality info.
+
+By "echo 1 > /proc/sys/kernel/numa_locality" at runtime or adding boot parameter
+'numa_locality', we will enable the accounting of per-cgroup NUMA locality info,
+the 'cpu.numa_stat' entry of CPU cgroup will show statistics::
+
+  page_access local=NR_LOCAL_PAGE_ACCESS remote=NR_REMOTE_PAGE_ACCESS
+
+We define 'NUMA locality' as::
+
+  NR_LOCAL_PAGE_ACCESS * 100 / (NR_LOCAL_PAGE_ACCESS + NR_REMOTE_PAGE_ACCESS)
+
+This per-cgroup percentage number helps to represent the NUMA Balancing behavior.
+
+Note that the accounting is hierarchical, which means the NUMA locality info for
+a given group represents not only the workload of this group, but also the
+workloads of all its descendants.
+
+For example the 'cpu.numa_stat' shows::
+
+  page_access local=129909383 remote=18265810
+
+The NUMA locality calculated as::
+
+  129909383 * 100 / (129909383 + 18265810) = 87.67
+
+Thus we know the workload of this group and its descendants have totally done
+129909383 times of local page accessing and 18265810 times of remotes, locality
+is 87.67% which implies most of the memory access are local.
+
+NUMA Consumption
+----------------
+
+There are also other cgroup entries which help us to estimate NUMA efficiency.
+They are 'cpuacct.usage_percpu' and 'memory.numa_stat'.
+
+By reading 'cpuacct.usage_percpu' we will get per-cpu runtime (in nanoseconds)
+info (in hierarchy) as::
+
+  CPU_0_RUNTIME CPU_1_RUNTIME CPU_2_RUNTIME ... CPU_X_RUNTIME
+
+Combined with the info from::
+
+  cat /sys/devices/system/node/nodeX/cpulist
+
+We would be able to accumulate the runtime of CPUs into NUMA nodes, to get the
+per-cgroup node runtime info.
+
+By reading 'memory.numa_stat' we will get per-cgroup node memory consumption
+info as::
+
+  total=TOTAL_MEM N0=MEM_ON_NODE0 N1=MEM_ON_NODE1 ... NX=MEM_ON_NODEX
+
+Together we call these the per-cgroup NUMA consumption info, telling us how many
+resources a particular workload has consumed, on a particular NUMA node.
+
+Monitoring
+----------
+
+By monitoring the increments of locality info, we can easily know whether NUMA
+Balancing is working well for a particular workload.
+
+For example we take a 5 seconds sample period, then on each sampling we have::
+
+  local_diff = last_nr_local_page_access - nr_local_page_access
+  remote_diff = last_nr_remote_page_access - nr_remote_page_access
+
+and we get the locality in this period as::
+
+  locality = local_diff * 100 / (local_diff + remote_diff)
+
+We can plot a line for locality. When the line is close to 100%, things are
+good; when getting close to 0% something is wrong. We can pick a proper
+watermark to trigger warning message.
+
+You may want to drop the data if the local/remote_diff is too small, which
+implies there are not many available pages for NUMA Balancing to scan, ignoring
+would be fine since most likely the workload is insensitive to NUMA, or the
+memory topology is already good enough.
+
+Monitoring root group helps you control the overall situation, while you may
+also want to monitor all the leaf groups which contain the workloads, this
+helps to catch the mouse.
+
+Try to put your workload into also the cpuacct & memory cgroup, when NUMA
+Balancing is disabled or locality becomes too small, we may want to monitor
+the per-node runtime & memory info to see if the node consumption meet the
+requirements.
+
+For NUMA node X on each sampling we have::
+
+  runtime_X_diff = runtime_X - last_runtime_X
+  runtime_all_diff = runtime_all - last_runtime_all
+
+  runtime_percent_X = runtime_X_diff * 100 / runtime_all_diff
+  memory_percent_X = memory_X * 100 / memory_all
+
+These two percentages are usually matched on each node, workload should execute
+mostly on the node that contains most of its memory, but it's not guaranteed.
+
+The workload may only access a small part of its memory, in such cases although
+the majority of memory are remote, locality could still be good.
+
+Thus to tell if things are fine or not depends on the understanding of system
+resource deployment, however, if you find node X got 100% memory percent but 0%
+runtime percent, definitely something is wrong.
+
+Troubleshooting
+---------------
+
+After identifying which workload introduced the bad locality, check:
+
+1). Is the workload bound to a particular NUMA node?
+2). Has any NUMA node run out of resources?
+
+There are several ways to bind task's memory with a NUMA node, the strict way
+like the MPOL_BIND memory policy or 'cpuset.mems' will limit the memory
+node where to allocate pages. In this situation, admin should make sure the
+task is allowed to run on the CPUs of that NUMA node, and make sure there are
+available CPU resources there.
+
+There are also ways to bind task's CPU with a NUMA node, like 'cpuset.cpus' or
+sched_setaffinity() syscall. In this situation, NUMA Balancing helps to migrate
+pages into that node, admin should make sure there is available memory there.
+
+Admin could try to rebind or unbind the NUMA node to erase the damage, make a
+change then observe the statistics to see if things get better until the
+situation is acceptable.
+
+Highlights
+----------
+
+For some tasks, NUMA Balancing may be found to be unnecessary to scan pages,
+and locality could always be 0 or small number, don't pay attention to them
+since they most likely insensitive to NUMA.
+
+There is no accounting until the option is turned on, so enable it in advance
+if you want to have the whole history.
+
+We have per-task migfailed counter to tell how many page migrations have
+failed for a particular task; you will find it in /proc/PID/sched entry.
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index f1d0ccffbe72..bd769f5ba565 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -114,6 +114,7 @@ configure specific aspects of kernel behavior to your liking.
    video-output
    wimax/index
    xfs
+   cg-numa-stat

 .. only::  subproject and html

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index e35b28e3a301..9024fc1bed8d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3249,6 +3249,10 @@
 	numa_balancing=	[KNL,X86] Enable or disable automatic NUMA balancing.
 			Allowed values are enable and disable

+	numa_locality	[KNL] Enable per-cgroup numa locality info.
+			Useful to debug NUMA efficiency problems when there are
+			lots of per-cgroup workloads.
+
 	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
 			'node', 'default' can be specified
 			This can be set from sysctl after boot.
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index def074807cee..d2b862c65e67 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -556,6 +556,15 @@ rate for each task.
 numa_balancing_scan_size_mb is how many megabytes worth of pages are
 scanned for a given scan.

+numa_locality:
+=============
+
+Enables/disables per-cgroup NUMA locality info.
+
+0: disabled (default).
+1: enabled.
+
+Check Documentation/admin-guide/cg-numa-stat.rst for details.

 osrelease, ostype & version:
 ============================
diff --git a/init/Kconfig b/init/Kconfig
index 63c6b90a515d..2b3281caab42 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -821,6 +821,8 @@ config CGROUP_NUMA_LOCALITY
 	  This option enables the collection of per-cgroup NUMA locality info,
 	  to tell whether NUMA Balancing is working well for a particular
 	  workload, also imply the NUMA efficiency.
+	  See
+		-  Documentation/admin-guide/cg-numa-stat.rst

 menuconfig CGROUPS
 	bool "Control Group support"
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2020-02-07  3:35 ` [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
@ 2020-02-07  3:37   ` 王贇
  2020-02-13  2:35     ` 王贇
  2020-02-14 15:10   ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: 王贇 @ 2020-02-07  3:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

Forwarded:

Hi, Peter, Ingo

Could you give some comments please?

As Mel replied previously, he won't disagree the idea, so we're looking
forward the opinion from the maintainers.

Please allow me to highlight the necessary of monitoring NUMA Balancing
again, this feature is critical to the performance on NUMA platform,
it cost and benefit -- lot or less, however there are not enough
information for an admin to analysis the trade-off, while locality could
be the missing piece.

Regards,
Michael Wang

On 2020/2/7 上午11:35, 王贇 wrote:
> Currently there are no good approach to monitoring the per-cgroup NUMA
> efficiency, this could be a trouble especially when groups are sharing
> CPUs, we don't know which one introduced remote-memory accessing.
> 
> Although the per-task NUMA accessing info from PMU is good for further
> debuging, but not light enough for daily monitoring, especial on a box
> with thousands of tasks.
> 
> Fortunately, when NUMA Balancing enabled, it will periodly trigger page
> fault and try to increase the NUMA locality, by tracing the results we
> will be able to estimate the NUMA efficiency.
> 
> On each page fault of NUMA Balancing, when task's executing CPU is from
> the same node of pages, we call this a local page accessing, otherwise
> a remote page accessing.
> 
> By updating task's accessing counter into it's cgroup on ticks, we get
> the per-cgroup numa locality info.
> 
> For example the new entry 'cpu.numa_stat' show:
>   page_access local=1231412 remote=53453
> 
> Here we know the workloads in hierarchy have totally been traced 1284865
> times of page accessing, and 1231412 of them are local page access, which
> imply a good NUMA efficiency.
> 
> By monitoring the increments, we will be able to locate the per-cgroup
> workload which NUMA Balancing can't helpwith (usually caused by wrong
> CPU and memory node bindings), then we got chance to fix that in time.
> 
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Michal Koutný <mkoutny@suse.com>
> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
> ---
>  include/linux/sched.h        | 15 +++++++++
>  include/linux/sched/sysctl.h |  6 ++++
>  init/Kconfig                 |  9 ++++++
>  kernel/sched/core.c          | 75 ++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/fair.c          | 62 ++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h         | 12 +++++++
>  kernel/sysctl.c              | 11 +++++++
>  7 files changed, 190 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a6c924fa1c77..74bf234bae53 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1128,6 +1128,21 @@ struct task_struct {
>  	unsigned long			numa_pages_migrated;
>  #endif /* CONFIG_NUMA_BALANCING */
> 
> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
> +	/*
> +	 * Counter index stand for:
> +	 * 0 -- remote page accessing
> +	 * 1 -- local page accessing
> +	 * 2 -- remote page accessing updated to cgroup
> +	 * 3 -- local page accessing updated to cgroup
> +	 *
> +	 * We record the counter before the end of task_numa_fault(), this
> +	 * is based on the fact that after page fault is handled, the task
> +	 * will access the page on the CPU where it triggered the PF.
> +	 */
> +	unsigned long			numa_page_access[4];
> +#endif
> +
>  #ifdef CONFIG_RSEQ
>  	struct rseq __user *rseq;
>  	u32 rseq_sig;
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index d4f6215ee03f..bb3721cf48e0 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -101,4 +101,10 @@ extern int sched_energy_aware_handler(struct ctl_table *table, int write,
>  				 loff_t *ppos);
>  #endif
> 
> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
> +extern int sysctl_numa_locality(struct ctl_table *table, int write,
> +				 void __user *buffer, size_t *lenp,
> +				 loff_t *ppos);
> +#endif
> +
>  #endif /* _LINUX_SCHED_SYSCTL_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index 322fd2c65304..63c6b90a515d 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -813,6 +813,15 @@ config NUMA_BALANCING_DEFAULT_ENABLED
>  	  If set, automatic NUMA balancing will be enabled if running on a NUMA
>  	  machine.
> 
> +config CGROUP_NUMA_LOCALITY
> +	bool "per-cgroup NUMA Locality"
> +	default n
> +	depends on CGROUP_SCHED && NUMA_BALANCING
> +	help
> +	  This option enables the collection of per-cgroup NUMA locality info,
> +	  to tell whether NUMA Balancing is working well for a particular
> +	  workload, also imply the NUMA efficiency.
> +
>  menuconfig CGROUPS
>  	bool "Control Group support"
>  	select KERNFS
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e7b08d52db93..40dd6b221eef 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7657,6 +7657,68 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
>  }
>  #endif /* CONFIG_RT_GROUP_SCHED */
> 
> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
> +DEFINE_STATIC_KEY_FALSE(sched_numa_locality);
> +
> +#ifdef CONFIG_PROC_SYSCTL
> +int sysctl_numa_locality(struct ctl_table *table, int write,
> +			 void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	struct ctl_table t;
> +	int err;
> +	int state = static_branch_likely(&sched_numa_locality);
> +
> +	if (write && !capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	t = *table;
> +	t.data = &state;
> +	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
> +	if (err < 0 || !write)
> +		return err;
> +
> +	if (state)
> +		static_branch_enable(&sched_numa_locality);
> +	else
> +		static_branch_disable(&sched_numa_locality);
> +
> +	return err;
> +}
> +#endif
> +
> +static inline struct cfs_rq *tg_cfs_rq(struct task_group *tg, int cpu)
> +{
> +	return tg == &root_task_group ? &cpu_rq(cpu)->cfs : tg->cfs_rq[cpu];
> +}
> +
> +static int cpu_numa_stat_show(struct seq_file *sf, void *v)
> +{
> +	int cpu;
> +	u64 local = 0, remote = 0;
> +	struct task_group *tg = css_tg(seq_css(sf));
> +
> +	if (!static_branch_likely(&sched_numa_locality))
> +		return 0;
> +
> +	for_each_possible_cpu(cpu) {
> +		local += tg_cfs_rq(tg, cpu)->local_page_access;
> +		remote += tg_cfs_rq(tg, cpu)->remote_page_access;
> +	}
> +
> +	seq_printf(sf, "page_access local=%llu remote=%llu\n", local, remote);
> +
> +	return 0;
> +}
> +
> +static __init int numa_locality_setup(char *opt)
> +{
> +	static_branch_enable(&sched_numa_locality);
> +
> +	return 0;
> +}
> +__setup("numa_locality", numa_locality_setup);
> +#endif
> +
>  static struct cftype cpu_legacy_files[] = {
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  	{
> @@ -7706,6 +7768,12 @@ static struct cftype cpu_legacy_files[] = {
>  		.seq_show = cpu_uclamp_max_show,
>  		.write = cpu_uclamp_max_write,
>  	},
> +#endif
> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
> +	{
> +		.name = "numa_stat",
> +		.seq_show = cpu_numa_stat_show,
> +	},
>  #endif
>  	{ }	/* Terminate */
>  };
> @@ -7887,6 +7955,13 @@ static struct cftype cpu_files[] = {
>  		.seq_show = cpu_uclamp_max_show,
>  		.write = cpu_uclamp_max_write,
>  	},
> +#endif
> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
> +	{
> +		.name = "numa_stat",
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +		.seq_show = cpu_numa_stat_show,
> +	},
>  #endif
>  	{ }	/* terminate */
>  };
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2d170b5da0e3..eb838557bae2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1049,7 +1049,63 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
>   * Scheduling class queueing methods:
>   */
> 
> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
> +/*
> + * We want to record the real local/remote page access statistic
> + * here, so 'pnid' should be pages's real residential node after
> + * migrate_misplaced_page(), and 'cnid' should be the node of CPU
> + * where triggered the PF.
> + */
> +static inline void
> +update_task_locality(struct task_struct *p, int pnid, int cnid, int pages)
> +{
> +	if (!static_branch_unlikely(&sched_numa_locality))
> +		return;
> +
> +	/*
> +	 * pnid != cnid --> remote idx 0
> +	 * pnid == cnid --> local idx 1
> +	 */
> +	p->numa_page_access[!!(pnid == cnid)] += pages;
> +}
> +
> +static inline void update_group_locality(struct cfs_rq *cfs_rq)
> +{
> +	unsigned long ldiff, rdiff;
> +
> +	if (!static_branch_unlikely(&sched_numa_locality))
> +		return;
> +
> +	rdiff = current->numa_page_access[0] - current->numa_page_access[2];
> +	ldiff = current->numa_page_access[1] - current->numa_page_access[3];
> +	if (!ldiff && !rdiff)
> +		return;
> +
> +	cfs_rq->local_page_access += ldiff;
> +	cfs_rq->remote_page_access += rdiff;
> +
> +	/*
> +	 * Consider updated when reach root cfs_rq, no NUMA Balancing PF
> +	 * should happen on current task during the hierarchical updating.
> +	 */
> +	if (&cfs_rq->rq->cfs == cfs_rq) {
> +		current->numa_page_access[2] = current->numa_page_access[0];
> +		current->numa_page_access[3] = current->numa_page_access[1];
> +	}
> +}
> +#else
> +static inline void
> +update_task_locality(struct task_struct *p, int pnid, int cnid, int pages)
> +{
> +}
> +
> +static inline void update_group_locality(struct cfs_rq *cfs_rq)
> +{
> +}
> +#endif /* CONFIG_CGROUP_NUMA_LOCALITY */
> +
>  #ifdef CONFIG_NUMA_BALANCING
> +
>  /*
>   * Approximate time to scan a full NUMA task in ms. The task scan period is
>   * calculated based on the tasks virtual memory size and
> @@ -2465,6 +2521,8 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
>  	p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages;
>  	p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages;
>  	p->numa_faults_locality[local] += pages;
> +
> +	update_task_locality(p, mem_node, numa_node_id(), pages);
>  }
> 
>  static void reset_ptenuma_scan(struct task_struct *p)
> @@ -2650,6 +2708,9 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
>  	p->last_sum_exec_runtime	= 0;
> 
>  	init_task_work(&p->numa_work, task_numa_work);
> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
> +	memset(p->numa_page_access, 0, sizeof(p->numa_page_access));
> +#endif
> 
>  	/* New address space, reset the preferred nid */
>  	if (!(clone_flags & CLONE_VM)) {
> @@ -4313,6 +4374,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>  	 */
>  	update_load_avg(cfs_rq, curr, UPDATE_TG);
>  	update_cfs_group(curr);
> +	update_group_locality(cfs_rq);
> 
>  #ifdef CONFIG_SCHED_HRTICK
>  	/*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 1a88dc8ad11b..66b4e581b6ed 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -575,6 +575,14 @@ struct cfs_rq {
>  	struct list_head	throttled_list;
>  #endif /* CONFIG_CFS_BANDWIDTH */
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
> +	/*
> +	 * The local/remote page access info collected from all
> +	 * the tasks in hierarchy.
> +	 */
> +	u64			local_page_access;
> +	u64			remote_page_access;
> +#endif
>  };
> 
>  static inline int rt_bandwidth_enabled(void)
> @@ -1601,6 +1609,10 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features =
>  extern struct static_key_false sched_numa_balancing;
>  extern struct static_key_false sched_schedstats;
> 
> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
> +extern struct static_key_false sched_numa_locality;
> +#endif
> +
>  static inline u64 global_rt_period(void)
>  {
>  	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index d396aaaf19a3..a8f5951f92b3 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -428,6 +428,17 @@ static struct ctl_table kern_table[] = {
>  		.extra2		= SYSCTL_ONE,
>  	},
>  #endif /* CONFIG_NUMA_BALANCING */
> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
> +	{
> +		.procname	= "numa_locality",
> +		.data		= NULL, /* filled in by handler */
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= sysctl_numa_locality,
> +		.extra1		= SYSCTL_ZERO,
> +		.extra2		= SYSCTL_ONE,
> +	},
> +#endif /* CONFIG_CGROUP_NUMA_LOCALITY */
>  #endif /* CONFIG_SCHED_DEBUG */
>  	{
>  		.procname	= "sched_rt_period_us",
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2020-02-07  3:37   ` 王贇
@ 2020-02-13  2:35     ` 王贇
  0 siblings, 0 replies; 16+ messages in thread
From: 王贇 @ 2020-02-13  2:35 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Luis Chamberlain, Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

Hi, Peter

Could you please give some comments on this one?

Regards,
Michael Wang

On 2020/2/7 上午11:37, 王贇 wrote:
> Forwarded:
> 
> Hi, Peter, Ingo
> 
> Could you give some comments please?
> 
> As Mel replied previously, he won't disagree the idea, so we're looking
> forward the opinion from the maintainers.
> 
> Please allow me to highlight the necessary of monitoring NUMA Balancing
> again, this feature is critical to the performance on NUMA platform,
> it cost and benefit -- lot or less, however there are not enough
> information for an admin to analysis the trade-off, while locality could
> be the missing piece.
> 
> Regards,
> Michael Wang
> 
> On 2020/2/7 上午11:35, 王贇 wrote:
>> Currently there are no good approach to monitoring the per-cgroup NUMA
>> efficiency, this could be a trouble especially when groups are sharing
>> CPUs, we don't know which one introduced remote-memory accessing.
>>
>> Although the per-task NUMA accessing info from PMU is good for further
>> debuging, but not light enough for daily monitoring, especial on a box
>> with thousands of tasks.
>>
>> Fortunately, when NUMA Balancing enabled, it will periodly trigger page
>> fault and try to increase the NUMA locality, by tracing the results we
>> will be able to estimate the NUMA efficiency.
>>
>> On each page fault of NUMA Balancing, when task's executing CPU is from
>> the same node of pages, we call this a local page accessing, otherwise
>> a remote page accessing.
>>
>> By updating task's accessing counter into it's cgroup on ticks, we get
>> the per-cgroup numa locality info.
>>
>> For example the new entry 'cpu.numa_stat' show:
>>   page_access local=1231412 remote=53453
>>
>> Here we know the workloads in hierarchy have totally been traced 1284865
>> times of page accessing, and 1231412 of them are local page access, which
>> imply a good NUMA efficiency.
>>
>> By monitoring the increments, we will be able to locate the per-cgroup
>> workload which NUMA Balancing can't helpwith (usually caused by wrong
>> CPU and memory node bindings), then we got chance to fix that in time.
>>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Michal Koutný <mkoutny@suse.com>
>> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
>> ---
>>  include/linux/sched.h        | 15 +++++++++
>>  include/linux/sched/sysctl.h |  6 ++++
>>  init/Kconfig                 |  9 ++++++
>>  kernel/sched/core.c          | 75 ++++++++++++++++++++++++++++++++++++++++++++
>>  kernel/sched/fair.c          | 62 ++++++++++++++++++++++++++++++++++++
>>  kernel/sched/sched.h         | 12 +++++++
>>  kernel/sysctl.c              | 11 +++++++
>>  7 files changed, 190 insertions(+)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index a6c924fa1c77..74bf234bae53 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1128,6 +1128,21 @@ struct task_struct {
>>  	unsigned long			numa_pages_migrated;
>>  #endif /* CONFIG_NUMA_BALANCING */
>>
>> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
>> +	/*
>> +	 * Counter index stand for:
>> +	 * 0 -- remote page accessing
>> +	 * 1 -- local page accessing
>> +	 * 2 -- remote page accessing updated to cgroup
>> +	 * 3 -- local page accessing updated to cgroup
>> +	 *
>> +	 * We record the counter before the end of task_numa_fault(), this
>> +	 * is based on the fact that after page fault is handled, the task
>> +	 * will access the page on the CPU where it triggered the PF.
>> +	 */
>> +	unsigned long			numa_page_access[4];
>> +#endif
>> +
>>  #ifdef CONFIG_RSEQ
>>  	struct rseq __user *rseq;
>>  	u32 rseq_sig;
>> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
>> index d4f6215ee03f..bb3721cf48e0 100644
>> --- a/include/linux/sched/sysctl.h
>> +++ b/include/linux/sched/sysctl.h
>> @@ -101,4 +101,10 @@ extern int sched_energy_aware_handler(struct ctl_table *table, int write,
>>  				 loff_t *ppos);
>>  #endif
>>
>> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
>> +extern int sysctl_numa_locality(struct ctl_table *table, int write,
>> +				 void __user *buffer, size_t *lenp,
>> +				 loff_t *ppos);
>> +#endif
>> +
>>  #endif /* _LINUX_SCHED_SYSCTL_H */
>> diff --git a/init/Kconfig b/init/Kconfig
>> index 322fd2c65304..63c6b90a515d 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -813,6 +813,15 @@ config NUMA_BALANCING_DEFAULT_ENABLED
>>  	  If set, automatic NUMA balancing will be enabled if running on a NUMA
>>  	  machine.
>>
>> +config CGROUP_NUMA_LOCALITY
>> +	bool "per-cgroup NUMA Locality"
>> +	default n
>> +	depends on CGROUP_SCHED && NUMA_BALANCING
>> +	help
>> +	  This option enables the collection of per-cgroup NUMA locality info,
>> +	  to tell whether NUMA Balancing is working well for a particular
>> +	  workload, also imply the NUMA efficiency.
>> +
>>  menuconfig CGROUPS
>>  	bool "Control Group support"
>>  	select KERNFS
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index e7b08d52db93..40dd6b221eef 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7657,6 +7657,68 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css,
>>  }
>>  #endif /* CONFIG_RT_GROUP_SCHED */
>>
>> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
>> +DEFINE_STATIC_KEY_FALSE(sched_numa_locality);
>> +
>> +#ifdef CONFIG_PROC_SYSCTL
>> +int sysctl_numa_locality(struct ctl_table *table, int write,
>> +			 void __user *buffer, size_t *lenp, loff_t *ppos)
>> +{
>> +	struct ctl_table t;
>> +	int err;
>> +	int state = static_branch_likely(&sched_numa_locality);
>> +
>> +	if (write && !capable(CAP_SYS_ADMIN))
>> +		return -EPERM;
>> +
>> +	t = *table;
>> +	t.data = &state;
>> +	err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos);
>> +	if (err < 0 || !write)
>> +		return err;
>> +
>> +	if (state)
>> +		static_branch_enable(&sched_numa_locality);
>> +	else
>> +		static_branch_disable(&sched_numa_locality);
>> +
>> +	return err;
>> +}
>> +#endif
>> +
>> +static inline struct cfs_rq *tg_cfs_rq(struct task_group *tg, int cpu)
>> +{
>> +	return tg == &root_task_group ? &cpu_rq(cpu)->cfs : tg->cfs_rq[cpu];
>> +}
>> +
>> +static int cpu_numa_stat_show(struct seq_file *sf, void *v)
>> +{
>> +	int cpu;
>> +	u64 local = 0, remote = 0;
>> +	struct task_group *tg = css_tg(seq_css(sf));
>> +
>> +	if (!static_branch_likely(&sched_numa_locality))
>> +		return 0;
>> +
>> +	for_each_possible_cpu(cpu) {
>> +		local += tg_cfs_rq(tg, cpu)->local_page_access;
>> +		remote += tg_cfs_rq(tg, cpu)->remote_page_access;
>> +	}
>> +
>> +	seq_printf(sf, "page_access local=%llu remote=%llu\n", local, remote);
>> +
>> +	return 0;
>> +}
>> +
>> +static __init int numa_locality_setup(char *opt)
>> +{
>> +	static_branch_enable(&sched_numa_locality);
>> +
>> +	return 0;
>> +}
>> +__setup("numa_locality", numa_locality_setup);
>> +#endif
>> +
>>  static struct cftype cpu_legacy_files[] = {
>>  #ifdef CONFIG_FAIR_GROUP_SCHED
>>  	{
>> @@ -7706,6 +7768,12 @@ static struct cftype cpu_legacy_files[] = {
>>  		.seq_show = cpu_uclamp_max_show,
>>  		.write = cpu_uclamp_max_write,
>>  	},
>> +#endif
>> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
>> +	{
>> +		.name = "numa_stat",
>> +		.seq_show = cpu_numa_stat_show,
>> +	},
>>  #endif
>>  	{ }	/* Terminate */
>>  };
>> @@ -7887,6 +7955,13 @@ static struct cftype cpu_files[] = {
>>  		.seq_show = cpu_uclamp_max_show,
>>  		.write = cpu_uclamp_max_write,
>>  	},
>> +#endif
>> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
>> +	{
>> +		.name = "numa_stat",
>> +		.flags = CFTYPE_NOT_ON_ROOT,
>> +		.seq_show = cpu_numa_stat_show,
>> +	},
>>  #endif
>>  	{ }	/* terminate */
>>  };
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 2d170b5da0e3..eb838557bae2 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1049,7 +1049,63 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>   * Scheduling class queueing methods:
>>   */
>>
>> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
>> +/*
>> + * We want to record the real local/remote page access statistic
>> + * here, so 'pnid' should be pages's real residential node after
>> + * migrate_misplaced_page(), and 'cnid' should be the node of CPU
>> + * where triggered the PF.
>> + */
>> +static inline void
>> +update_task_locality(struct task_struct *p, int pnid, int cnid, int pages)
>> +{
>> +	if (!static_branch_unlikely(&sched_numa_locality))
>> +		return;
>> +
>> +	/*
>> +	 * pnid != cnid --> remote idx 0
>> +	 * pnid == cnid --> local idx 1
>> +	 */
>> +	p->numa_page_access[!!(pnid == cnid)] += pages;
>> +}
>> +
>> +static inline void update_group_locality(struct cfs_rq *cfs_rq)
>> +{
>> +	unsigned long ldiff, rdiff;
>> +
>> +	if (!static_branch_unlikely(&sched_numa_locality))
>> +		return;
>> +
>> +	rdiff = current->numa_page_access[0] - current->numa_page_access[2];
>> +	ldiff = current->numa_page_access[1] - current->numa_page_access[3];
>> +	if (!ldiff && !rdiff)
>> +		return;
>> +
>> +	cfs_rq->local_page_access += ldiff;
>> +	cfs_rq->remote_page_access += rdiff;
>> +
>> +	/*
>> +	 * Consider updated when reach root cfs_rq, no NUMA Balancing PF
>> +	 * should happen on current task during the hierarchical updating.
>> +	 */
>> +	if (&cfs_rq->rq->cfs == cfs_rq) {
>> +		current->numa_page_access[2] = current->numa_page_access[0];
>> +		current->numa_page_access[3] = current->numa_page_access[1];
>> +	}
>> +}
>> +#else
>> +static inline void
>> +update_task_locality(struct task_struct *p, int pnid, int cnid, int pages)
>> +{
>> +}
>> +
>> +static inline void update_group_locality(struct cfs_rq *cfs_rq)
>> +{
>> +}
>> +#endif /* CONFIG_CGROUP_NUMA_LOCALITY */
>> +
>>  #ifdef CONFIG_NUMA_BALANCING
>> +
>>  /*
>>   * Approximate time to scan a full NUMA task in ms. The task scan period is
>>   * calculated based on the tasks virtual memory size and
>> @@ -2465,6 +2521,8 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
>>  	p->numa_faults[task_faults_idx(NUMA_MEMBUF, mem_node, priv)] += pages;
>>  	p->numa_faults[task_faults_idx(NUMA_CPUBUF, cpu_node, priv)] += pages;
>>  	p->numa_faults_locality[local] += pages;
>> +
>> +	update_task_locality(p, mem_node, numa_node_id(), pages);
>>  }
>>
>>  static void reset_ptenuma_scan(struct task_struct *p)
>> @@ -2650,6 +2708,9 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
>>  	p->last_sum_exec_runtime	= 0;
>>
>>  	init_task_work(&p->numa_work, task_numa_work);
>> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
>> +	memset(p->numa_page_access, 0, sizeof(p->numa_page_access));
>> +#endif
>>
>>  	/* New address space, reset the preferred nid */
>>  	if (!(clone_flags & CLONE_VM)) {
>> @@ -4313,6 +4374,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>>  	 */
>>  	update_load_avg(cfs_rq, curr, UPDATE_TG);
>>  	update_cfs_group(curr);
>> +	update_group_locality(cfs_rq);
>>
>>  #ifdef CONFIG_SCHED_HRTICK
>>  	/*
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 1a88dc8ad11b..66b4e581b6ed 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -575,6 +575,14 @@ struct cfs_rq {
>>  	struct list_head	throttled_list;
>>  #endif /* CONFIG_CFS_BANDWIDTH */
>>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
>> +	/*
>> +	 * The local/remote page access info collected from all
>> +	 * the tasks in hierarchy.
>> +	 */
>> +	u64			local_page_access;
>> +	u64			remote_page_access;
>> +#endif
>>  };
>>
>>  static inline int rt_bandwidth_enabled(void)
>> @@ -1601,6 +1609,10 @@ static const_debug __maybe_unused unsigned int sysctl_sched_features =
>>  extern struct static_key_false sched_numa_balancing;
>>  extern struct static_key_false sched_schedstats;
>>
>> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
>> +extern struct static_key_false sched_numa_locality;
>> +#endif
>> +
>>  static inline u64 global_rt_period(void)
>>  {
>>  	return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index d396aaaf19a3..a8f5951f92b3 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -428,6 +428,17 @@ static struct ctl_table kern_table[] = {
>>  		.extra2		= SYSCTL_ONE,
>>  	},
>>  #endif /* CONFIG_NUMA_BALANCING */
>> +#ifdef CONFIG_CGROUP_NUMA_LOCALITY
>> +	{
>> +		.procname	= "numa_locality",
>> +		.data		= NULL, /* filled in by handler */
>> +		.maxlen		= sizeof(unsigned int),
>> +		.mode		= 0644,
>> +		.proc_handler	= sysctl_numa_locality,
>> +		.extra1		= SYSCTL_ZERO,
>> +		.extra2		= SYSCTL_ONE,
>> +	},
>> +#endif /* CONFIG_CGROUP_NUMA_LOCALITY */
>>  #endif /* CONFIG_SCHED_DEBUG */
>>  	{
>>  		.procname	= "sched_rt_period_us",
>>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2020-02-07  3:35 ` [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
  2020-02-07  3:37   ` 王贇
@ 2020-02-14 15:10   ` Peter Zijlstra
  2020-02-17 11:58     ` Mel Gorman
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2020-02-14 15:10 UTC (permalink / raw)
  To: 王贇
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Michal Koutný,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

On Fri, Feb 07, 2020 at 11:35:30AM +0800, 王贇 wrote:
> Currently there are no good approach to monitoring the per-cgroup NUMA
> efficiency, this could be a trouble especially when groups are sharing
> CPUs, we don't know which one introduced remote-memory accessing.
> 
> Although the per-task NUMA accessing info from PMU is good for further
> debuging, but not light enough for daily monitoring, especial on a box
> with thousands of tasks.
> 
> Fortunately, when NUMA Balancing enabled, it will periodly trigger page
> fault and try to increase the NUMA locality, by tracing the results we
> will be able to estimate the NUMA efficiency.
> 
> On each page fault of NUMA Balancing, when task's executing CPU is from
> the same node of pages, we call this a local page accessing, otherwise
> a remote page accessing.
> 
> By updating task's accessing counter into it's cgroup on ticks, we get
> the per-cgroup numa locality info.
> 
> For example the new entry 'cpu.numa_stat' show:
>   page_access local=1231412 remote=53453
> 
> Here we know the workloads in hierarchy have totally been traced 1284865
> times of page accessing, and 1231412 of them are local page access, which
> imply a good NUMA efficiency.
> 
> By monitoring the increments, we will be able to locate the per-cgroup
> workload which NUMA Balancing can't helpwith (usually caused by wrong
> CPU and memory node bindings), then we got chance to fix that in time.
> 
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Michal Koutný <mkoutny@suse.com>
> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>

So here:

  https://lkml.kernel.org/r/20191127101932.GN28938@suse.de

Mel argues that the information exposed is fairly implementation
specific and hard to use without understanding how NUMA balancing works.

By exposing it to userspace, we tie ourselves to these particulars. We
can no longer change these NUMA balancing details if we wanted to, due
to UAPI concerns.

Mel, I suspect you still feel that way, right?

In the document (patch 2/2) you write:

> +However, there are no hardware counters for per-task local/remote accessing
> +info, we don't know how many remote page accesses have occurred for a
> +particular task.

We can of course 'fix' that by adding a tracepoint.

Mel, would you feel better by having a tracepoint in task_numa_fault() ?

Now I'm not really a fan of tracepoints myself, since they also
establish a UAPI, but perhaps it is a lesser evil in this case.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2020-02-14 15:10   ` Peter Zijlstra
@ 2020-02-17 11:58     ` Mel Gorman
  2020-02-17 13:23       ` 王贇
  0 siblings, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2020-02-17 11:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: ??????,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Michal Koutn?,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

On Fri, Feb 14, 2020 at 04:10:48PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 07, 2020 at 11:35:30AM +0800, ?????? wrote:
> > By monitoring the increments, we will be able to locate the per-cgroup
> > workload which NUMA Balancing can't helpwith (usually caused by wrong
> > CPU and memory node bindings), then we got chance to fix that in time.
> > 
> > Cc: Mel Gorman <mgorman@suse.de>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Michal Koutný <mkoutny@suse.com>
> > Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
> 
> So here:
> 
>   https://lkml.kernel.org/r/20191127101932.GN28938@suse.de
> 
> Mel argues that the information exposed is fairly implementation
> specific and hard to use without understanding how NUMA balancing works.
> 
> By exposing it to userspace, we tie ourselves to these particulars. We
> can no longer change these NUMA balancing details if we wanted to, due
> to UAPI concerns.
> 
> Mel, I suspect you still feel that way, right?
> 

Yes, I still think it would be a struggle to interpret the data
meaningfully without very specific knowledge of the implementation. If
the scan rate was constant, it would be easier but that would make NUMA
balancing worse overall. Similarly, the stat might get very difficult to
interpret when NUMA balancing is failing because of a load imbalance,
pages are shared and being interleaved or NUMA groups span multiple
active nodes.

For example, the series that reconciles NUMA and CPU balancers may look
worse in these stats even though the overall performance may be better.

> In the document (patch 2/2) you write:
> 
> > +However, there are no hardware counters for per-task local/remote accessing
> > +info, we don't know how many remote page accesses have occurred for a
> > +particular task.
> 
> We can of course 'fix' that by adding a tracepoint.
> 
> Mel, would you feel better by having a tracepoint in task_numa_fault() ?
> 

A bit, although interpreting the data would still be difficult and the
tracepoint would have to include information about the cgroup. While
I've never tried, this seems like the type of thing that would be suited
to a BPF script that probes task_numa_fault and extract the information
it needs.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2020-02-17 11:58     ` Mel Gorman
@ 2020-02-17 13:23       ` 王贇
  2020-02-17 14:16         ` Mel Gorman
  2020-02-21 15:28         ` Peter Zijlstra
  0 siblings, 2 replies; 16+ messages in thread
From: 王贇 @ 2020-02-17 13:23 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Michal Koutn?,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet



On 2020/2/17 下午7:58, Mel Gorman wrote:
[snip]
>> Mel, I suspect you still feel that way, right?
>>
> 
> Yes, I still think it would be a struggle to interpret the data
> meaningfully without very specific knowledge of the implementation. If
> the scan rate was constant, it would be easier but that would make NUMA
> balancing worse overall. Similarly, the stat might get very difficult to
> interpret when NUMA balancing is failing because of a load imbalance,
> pages are shared and being interleaved or NUMA groups span multiple
> active nodes.

Hi, Mel, appreciated to have you back on the table :-)

IMHO the scan period changing should not be a problem now, since the
maximum period is defined by user, so monitoring at maximum period
on the accumulated page accessing counters is always meaningful, correct?

FYI, by monitoring locality, we found that the kvm vcpu thread is not
covered by NUMA Balancing, whatever how many maximum period passed, the
counters are not increasing, or very slowly, although inside guest we are
copying memory.

Later we found such task rarely exit to user space to trigger task
work callbacks, and NUMA Balancing scan depends on that, which help us
realize the importance to enable NUMA Balancing inside guest, with the
correct NUMA topo, a big performance risk I'll say :-P

Maybe not a good example, but we just try to highlight that NUMA Balancing
could have issue in some cases, and we want them to be exposed, somehow,
maybe by the locality.

Regards,
Michael Wang

> 
> For example, the series that reconciles NUMA and CPU balancers may look
> worse in these stats even though the overall performance may be better.
> 
>> In the document (patch 2/2) you write:
>>
>>> +However, there are no hardware counters for per-task local/remote accessing
>>> +info, we don't know how many remote page accesses have occurred for a
>>> +particular task.
>>
>> We can of course 'fix' that by adding a tracepoint.
>>
>> Mel, would you feel better by having a tracepoint in task_numa_fault() ?
>>
> 
> A bit, although interpreting the data would still be difficult and the
> tracepoint would have to include information about the cgroup. While
> I've never tried, this seems like the type of thing that would be suited
> to a BPF script that probes task_numa_fault and extract the information
> it needs.

> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2020-02-17 13:23       ` 王贇
@ 2020-02-17 14:16         ` Mel Gorman
  2020-02-18  1:39           ` 王贇
  2020-02-21 15:28         ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2020-02-17 14:16 UTC (permalink / raw)
  To: ??????
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Michal Koutn?,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

On Mon, Feb 17, 2020 at 09:23:52PM +0800, ?????? wrote:
> 
> 
> On 2020/2/17 ??????7:58, Mel Gorman wrote:
> [snip]
> >> Mel, I suspect you still feel that way, right?
> >>
> > 
> > Yes, I still think it would be a struggle to interpret the data
> > meaningfully without very specific knowledge of the implementation. If
> > the scan rate was constant, it would be easier but that would make NUMA
> > balancing worse overall. Similarly, the stat might get very difficult to
> > interpret when NUMA balancing is failing because of a load imbalance,
> > pages are shared and being interleaved or NUMA groups span multiple
> > active nodes.
> 
> Hi, Mel, appreciated to have you back on the table :-)
> 
> IMHO the scan period changing should not be a problem now, since the
> maximum period is defined by user, so monitoring at maximum period
> on the accumulated page accessing counters is always meaningful, correct?
> 

It has meaning but the scan rate drives the fault rate which is the basis
for the stats you accumulate. If the scan rate is high when accesses
are local, the stats can be skewed making it appear the task is much
more local than it may really is at a later point in time. The scan rate
affects the accuracy of the information. The counters have meaning but
they needs careful interpretation.

> FYI, by monitoring locality, we found that the kvm vcpu thread is not
> covered by NUMA Balancing, whatever how many maximum period passed, the
> counters are not increasing, or very slowly, although inside guest we are
> copying memory.
> 
> Later we found such task rarely exit to user space to trigger task
> work callbacks, and NUMA Balancing scan depends on that, which help us
> realize the importance to enable NUMA Balancing inside guest, with the
> correct NUMA topo, a big performance risk I'll say :-P
> 

Which is a very interesting corner case in itself but also one that
could have potentially have been inferred from monitoring /proc/vmstat
numa_pte_updates or on a per-task basis by monitoring /proc/PID/sched and
watching numa_scan_seq and total_numa_faults. Accumulating the information
on a per-cgroup basis would require a bit more legwork.

> Maybe not a good example, but we just try to highlight that NUMA Balancing
> could have issue in some cases, and we want them to be exposed, somehow,
> maybe by the locality.
> 

Again, I'm somewhat neutral on the patch simply because I would not use
the information for debugging problems with NUMA balancing. I would try
using tracepoints and if the tracepoints were not good enough, I'd add or
fix them -- similar to what I had to do with sched_stick_numa recently.
The caveat is that I mostly look at this sort of problem as a developer.
Sysadmins have very different requirements, especially simplicity even
if the simplicity in this case is an illusion.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2020-02-17 14:16         ` Mel Gorman
@ 2020-02-18  1:39           ` 王贇
  2020-02-21 14:20             ` Mel Gorman
  0 siblings, 1 reply; 16+ messages in thread
From: 王贇 @ 2020-02-18  1:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Michal Koutn?,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet



On 2020/2/17 下午10:16, Mel Gorman wrote:
> On Mon, Feb 17, 2020 at 09:23:52PM +0800, ?????? wrote:
[snip]
>>
>> IMHO the scan period changing should not be a problem now, since the
>> maximum period is defined by user, so monitoring at maximum period
>> on the accumulated page accessing counters is always meaningful, correct?
>>
> 
> It has meaning but the scan rate drives the fault rate which is the basis
> for the stats you accumulate. If the scan rate is high when accesses
> are local, the stats can be skewed making it appear the task is much
> more local than it may really is at a later point in time. The scan rate
> affects the accuracy of the information. The counters have meaning but
> they needs careful interpretation.

Yeah, to zip so many information from NUMA Balancing to some statistics
is a challenge itself, the locality still not so easy to be understood by
NUMA newbie :-P

> 
>> FYI, by monitoring locality, we found that the kvm vcpu thread is not
>> covered by NUMA Balancing, whatever how many maximum period passed, the
>> counters are not increasing, or very slowly, although inside guest we are
>> copying memory.
>>
>> Later we found such task rarely exit to user space to trigger task
>> work callbacks, and NUMA Balancing scan depends on that, which help us
>> realize the importance to enable NUMA Balancing inside guest, with the
>> correct NUMA topo, a big performance risk I'll say :-P
>>
> 
> Which is a very interesting corner case in itself but also one that
> could have potentially have been inferred from monitoring /proc/vmstat
> numa_pte_updates or on a per-task basis by monitoring /proc/PID/sched and
> watching numa_scan_seq and total_numa_faults. Accumulating the information
> on a per-cgroup basis would require a bit more legwork.

That's not working for daily monitoring...

Besides, compared with locality, this require much more deeper understand
on the implementation, which could even be tough for NUMA developers to
assemble all these statistics together.

> 
>> Maybe not a good example, but we just try to highlight that NUMA Balancing
>> could have issue in some cases, and we want them to be exposed, somehow,
>> maybe by the locality.
>>
> 
> Again, I'm somewhat neutral on the patch simply because I would not use
> the information for debugging problems with NUMA balancing. I would try
> using tracepoints and if the tracepoints were not good enough, I'd add or
> fix them -- similar to what I had to do with sched_stick_numa recently.
> The caveat is that I mostly look at this sort of problem as a developer.
> Sysadmins have very different requirements, especially simplicity even
> if the simplicity in this case is an illusion.

Fair enough, but I guess PeterZ still want your Ack, so neutral means
refuse in this case :-(

BTW, how do you think about the documentation in second patch?

Do you think it's necessary to have a doc to explain NUMA related statistics?

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2020-02-18  1:39           ` 王贇
@ 2020-02-21 14:20             ` Mel Gorman
  2020-02-21 15:47               ` Peter Zijlstra
  2020-02-24  3:05               ` 王贇
  0 siblings, 2 replies; 16+ messages in thread
From: Mel Gorman @ 2020-02-21 14:20 UTC (permalink / raw)
  To: ??????
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Michal Koutn?,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

On Tue, Feb 18, 2020 at 09:39:35AM +0800, ?????? wrote:
> On 2020/2/17 ??????10:16, Mel Gorman wrote:
> > On Mon, Feb 17, 2020 at 09:23:52PM +0800, ?????? wrote:
> [snip]
> >>
> >> IMHO the scan period changing should not be a problem now, since the
> >> maximum period is defined by user, so monitoring at maximum period
> >> on the accumulated page accessing counters is always meaningful, correct?
> >>
> > 
> > It has meaning but the scan rate drives the fault rate which is the basis
> > for the stats you accumulate. If the scan rate is high when accesses
> > are local, the stats can be skewed making it appear the task is much
> > more local than it may really is at a later point in time. The scan rate
> > affects the accuracy of the information. The counters have meaning but
> > they needs careful interpretation.
> 
> Yeah, to zip so many information from NUMA Balancing to some statistics
> is a challenge itself, the locality still not so easy to be understood by
> NUMA newbie :-P
> 

Indeed and if they do not take into account historical skew into
account, they still might not understand.

> > 
> >> FYI, by monitoring locality, we found that the kvm vcpu thread is not
> >> covered by NUMA Balancing, whatever how many maximum period passed, the
> >> counters are not increasing, or very slowly, although inside guest we are
> >> copying memory.
> >>
> >> Later we found such task rarely exit to user space to trigger task
> >> work callbacks, and NUMA Balancing scan depends on that, which help us
> >> realize the importance to enable NUMA Balancing inside guest, with the
> >> correct NUMA topo, a big performance risk I'll say :-P
> >>
> > 
> > Which is a very interesting corner case in itself but also one that
> > could have potentially have been inferred from monitoring /proc/vmstat
> > numa_pte_updates or on a per-task basis by monitoring /proc/PID/sched and
> > watching numa_scan_seq and total_numa_faults. Accumulating the information
> > on a per-cgroup basis would require a bit more legwork.
> 
> That's not working for daily monitoring...
> 

Indeed although at least /proc/vmstat is cheap to monitor and it could
at least be tracked if the number of NUMA faults are abnormally low or
the ratio of remote to local hints are problematic.

> Besides, compared with locality, this require much more deeper understand
> on the implementation, which could even be tough for NUMA developers to
> assemble all these statistics together.
> 

My point is that even with the patch, the definition of locality is
subtle. At a single point in time, the locality might appear to be low
but it's due to an event that happened far in the past.

> > 
> >> Maybe not a good example, but we just try to highlight that NUMA Balancing
> >> could have issue in some cases, and we want them to be exposed, somehow,
> >> maybe by the locality.
> >>
> > 
> > Again, I'm somewhat neutral on the patch simply because I would not use
> > the information for debugging problems with NUMA balancing. I would try
> > using tracepoints and if the tracepoints were not good enough, I'd add or
> > fix them -- similar to what I had to do with sched_stick_numa recently.
> > The caveat is that I mostly look at this sort of problem as a developer.
> > Sysadmins have very different requirements, especially simplicity even
> > if the simplicity in this case is an illusion.
> 
> Fair enough, but I guess PeterZ still want your Ack, so neutral means
> refuse in this case :-(
> 

I think the patch is functionally harmless and can be disabled but I also
would be wary of dealing with a bug report that was based on the numbers
provided by the locality metric. The bulk of the work related to the bug
would likely be spent on trying to explain the metric and I've dealt with
quite a few bugs that were essentially "We don't like this number and think
something is wrong because of it -- fix it". Even then, I would want the
workload isolated and then vmstat recorded over time to determine it's
a persistent problem or not. That's the reason why I'm relucant to ack it.

I fully acknowledge that this may have value for sysadmins and may be a
good enough reason to merge it for environments that typically build and
configure their own kernels. I doubt that general distributions would
enable it but that's a guess.

> BTW, how do you think about the documentation in second patch?
> 

I think the documentation is great, it's clear and explains itself well.

> Do you think it's necessary to have a doc to explain NUMA related statistics?
> 

It would be nice but AFAIK, the stats in vmstats are not documented.
They are there because recording them over time can be very useful when
dealing with user bug reports.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2020-02-17 13:23       ` 王贇
  2020-02-17 14:16         ` Mel Gorman
@ 2020-02-21 15:28         ` Peter Zijlstra
  2020-02-24  3:09           ` 王贇
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2020-02-21 15:28 UTC (permalink / raw)
  To: 王贇
  Cc: Mel Gorman, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Michal Koutn?,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

On Mon, Feb 17, 2020 at 09:23:52PM +0800, 王贇 wrote:
> FYI, by monitoring locality, we found that the kvm vcpu thread is not
> covered by NUMA Balancing, whatever how many maximum period passed, the
> counters are not increasing, or very slowly, although inside guest we are
> copying memory.
> 
> Later we found such task rarely exit to user space to trigger task
> work callbacks, and NUMA Balancing scan depends on that, which help us
> realize the importance to enable NUMA Balancing inside guest, with the
> correct NUMA topo, a big performance risk I'll say :-P

That's a bug in KVM, see:

  https://lkml.kernel.org/r/20190801143657.785902257@linutronix.de
  https://lkml.kernel.org/r/20190801143657.887648487@linutronix.de

ISTR there being newer versions of that patch-set, but I can't seem to
find them in a hurry.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2020-02-21 14:20             ` Mel Gorman
@ 2020-02-21 15:47               ` Peter Zijlstra
  2020-02-24  3:13                 ` 王贇
  2020-02-24  3:05               ` 王贇
  1 sibling, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2020-02-21 15:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: ??????,
	Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Michal Koutn?,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet

On Fri, Feb 21, 2020 at 02:20:10PM +0000, Mel Gorman wrote:
> I fully acknowledge that this may have value for sysadmins and may be a
> good enough reason to merge it for environments that typically build and
> configure their own kernels. I doubt that general distributions would
> enable it but that's a guess.

OTOH, many sysadmins seem to 'rely' on BPF scripts and other such fancy
things these days.

 ( of course, we have the open question on what happens when we break
   one of those BPF 'important' scripts ... )

My main reservation with this patch is that it exposes, to userspace, an
ABI that is very hard to interpret and subject to implementation
details.

So while it can be disabled; people who have it enabled might suddenly
complain when we change the meaning/interpretation/whatever of these
magic numbers.

Michael; you seem to have ignored the tracepoint / BPF angle earlier in
this discussion; that is not something that could/would work for you?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2020-02-21 14:20             ` Mel Gorman
  2020-02-21 15:47               ` Peter Zijlstra
@ 2020-02-24  3:05               ` 王贇
  1 sibling, 0 replies; 16+ messages in thread
From: 王贇 @ 2020-02-24  3:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Michal Koutn?,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet



On 2020/2/21 下午10:20, Mel Gorman wrote:
[snip]
>>>
>>> Which is a very interesting corner case in itself but also one that
>>> could have potentially have been inferred from monitoring /proc/vmstat
>>> numa_pte_updates or on a per-task basis by monitoring /proc/PID/sched and
>>> watching numa_scan_seq and total_numa_faults. Accumulating the information
>>> on a per-cgroup basis would require a bit more legwork.
>>
>> That's not working for daily monitoring...
>>
> 
> Indeed although at least /proc/vmstat is cheap to monitor and it could
> at least be tracked if the number of NUMA faults are abnormally low or
> the ratio of remote to local hints are problematic.
> 
>> Besides, compared with locality, this require much more deeper understand
>> on the implementation, which could even be tough for NUMA developers to
>> assemble all these statistics together.
>>
> 
> My point is that even with the patch, the definition of locality is
> subtle. At a single point in time, the locality might appear to be low
> but it's due to an event that happened far in the past.

Agree, the locality's meaning just keep changing... only those who
understand the implementation can figure out the useful information.

> 
>>>
>>>> Maybe not a good example, but we just try to highlight that NUMA Balancing
>>>> could have issue in some cases, and we want them to be exposed, somehow,
>>>> maybe by the locality.
>>>>
>>>
>>> Again, I'm somewhat neutral on the patch simply because I would not use
>>> the information for debugging problems with NUMA balancing. I would try
>>> using tracepoints and if the tracepoints were not good enough, I'd add or
>>> fix them -- similar to what I had to do with sched_stick_numa recently.
>>> The caveat is that I mostly look at this sort of problem as a developer.
>>> Sysadmins have very different requirements, especially simplicity even
>>> if the simplicity in this case is an illusion.
>>
>> Fair enough, but I guess PeterZ still want your Ack, so neutral means
>> refuse in this case :-(
>>
> 
> I think the patch is functionally harmless and can be disabled but I also
> would be wary of dealing with a bug report that was based on the numbers
> provided by the locality metric. The bulk of the work related to the bug
> would likely be spent on trying to explain the metric and I've dealt with
> quite a few bugs that were essentially "We don't like this number and think
> something is wrong because of it -- fix it". Even then, I would want the
> workload isolated and then vmstat recorded over time to determine it's
> a persistent problem or not. That's the reason why I'm relucant to ack it.
> 
> I fully acknowledge that this may have value for sysadmins and may be a
> good enough reason to merge it for environments that typically build and
> configure their own kernels. I doubt that general distributions would
> enable it but that's a guess.

Thanks for the kindly explain, I get the point.

False alarm maybe fine to admin, but could be nightmare if the user keep
asking why, I suppose those who want to do some improvement on NUMA may be
interested :-P

Anyway, I understand there is a gap between general requirement and this
locality idea, and it's really hard to be fulfill...

> 
>> BTW, how do you think about the documentation in second patch?
>>
> 
> I think the documentation is great, it's clear and explains itself well.
> 
>> Do you think it's necessary to have a doc to explain NUMA related statistics?
>>
> 
> It would be nice but AFAIK, the stats in vmstats are not documented.
> They are there because recording them over time can be very useful when
> dealing with user bug reports.

Another TODO then :-)

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2020-02-21 15:28         ` Peter Zijlstra
@ 2020-02-24  3:09           ` 王贇
  0 siblings, 0 replies; 16+ messages in thread
From: 王贇 @ 2020-02-24  3:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Luis Chamberlain,
	Kees Cook, Iurii Zaikin, Michal Koutn?,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet



On 2020/2/21 下午11:28, Peter Zijlstra wrote:
> On Mon, Feb 17, 2020 at 09:23:52PM +0800, 王贇 wrote:
>> FYI, by monitoring locality, we found that the kvm vcpu thread is not
>> covered by NUMA Balancing, whatever how many maximum period passed, the
>> counters are not increasing, or very slowly, although inside guest we are
>> copying memory.
>>
>> Later we found such task rarely exit to user space to trigger task
>> work callbacks, and NUMA Balancing scan depends on that, which help us
>> realize the importance to enable NUMA Balancing inside guest, with the
>> correct NUMA topo, a big performance risk I'll say :-P
> 
> That's a bug in KVM, see:
> 
>   https://lkml.kernel.org/r/20190801143657.785902257@linutronix.de
>   https://lkml.kernel.org/r/20190801143657.887648487@linutronix.de
> 
> ISTR there being newer versions of that patch-set, but I can't seem to
> find them in a hurry.

Aha, that's exactly the problem we saw, will check~

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info
  2020-02-21 15:47               ` Peter Zijlstra
@ 2020-02-24  3:13                 ` 王贇
  0 siblings, 0 replies; 16+ messages in thread
From: 王贇 @ 2020-02-24  3:13 UTC (permalink / raw)
  To: Peter Zijlstra, Mel Gorman
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Luis Chamberlain, Kees Cook,
	Iurii Zaikin, Michal Koutn?,
	linux-fsdevel, linux-kernel, linux-doc, Paul E. McKenney,
	Randy Dunlap, Jonathan Corbet



On 2020/2/21 下午11:47, Peter Zijlstra wrote:
> On Fri, Feb 21, 2020 at 02:20:10PM +0000, Mel Gorman wrote:
>> I fully acknowledge that this may have value for sysadmins and may be a
>> good enough reason to merge it for environments that typically build and
>> configure their own kernels. I doubt that general distributions would
>> enable it but that's a guess.
> 
> OTOH, many sysadmins seem to 'rely' on BPF scripts and other such fancy
> things these days.
> 
>  ( of course, we have the open question on what happens when we break
>    one of those BPF 'important' scripts ... )
> 
> My main reservation with this patch is that it exposes, to userspace, an
> ABI that is very hard to interpret and subject to implementation
> details.
> 
> So while it can be disabled; people who have it enabled might suddenly
> complain when we change the meaning/interpretation/whatever of these
> magic numbers.
> 
> Michael; you seem to have ignored the tracepoint / BPF angle earlier in
> this discussion; that is not something that could/would work for you?

At very beginning I think these fancy stuff may consume too much resources
them selves, so just as you said, ignored the possibility :-P

But now I understand there is a big gap here, which require a much more general
way to evaluate the NUMA platform, I'll try to follow this way see if there
are any practical approach instead~

Regards,
Michael Wang

> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-02-24  3:14 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-07  3:34 [PATCH RESEND v8 0/2] sched/numa: introduce numa locality 王贇
2020-02-07  3:35 ` [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
2020-02-07  3:37   ` 王贇
2020-02-13  2:35     ` 王贇
2020-02-14 15:10   ` Peter Zijlstra
2020-02-17 11:58     ` Mel Gorman
2020-02-17 13:23       ` 王贇
2020-02-17 14:16         ` Mel Gorman
2020-02-18  1:39           ` 王贇
2020-02-21 14:20             ` Mel Gorman
2020-02-21 15:47               ` Peter Zijlstra
2020-02-24  3:13                 ` 王贇
2020-02-24  3:05               ` 王贇
2020-02-21 15:28         ` Peter Zijlstra
2020-02-24  3:09           ` 王贇
2020-02-07  3:35 ` [PATCH RESEND v8 2/2] sched/numa: documentation for per-cgroup numa 王贇

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).