Re: [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info

From: Peter Zijlstra <peterz@infradead.org>
To: 王贇 <yun.wang@linux.alibaba.com>
Cc: "Ingo Molnar" <mingo@redhat.com>,
	"Juri Lelli" <juri.lelli@redhat.com>,
	"Vincent Guittot" <vincent.guittot@linaro.org>,
	"Dietmar Eggemann" <dietmar.eggemann@arm.com>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Ben Segall" <bsegall@google.com>, "Mel Gorman" <mgorman@suse.de>,
	"Luis Chamberlain" <mcgrof@kernel.org>,
	"Kees Cook" <keescook@chromium.org>,
	"Iurii Zaikin" <yzaikin@google.com>,
	"Michal Koutný" <mkoutny@suse.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org,
	"Paul E. McKenney" <paulmck@linux.ibm.com>,
	"Randy Dunlap" <rdunlap@infradead.org>,
	"Jonathan Corbet" <corbet@lwn.net>
Subject: Re: [PATCH RESEND v8 1/2] sched/numa: introduce per-cgroup NUMA locality info
Date: Fri, 14 Feb 2020 16:10:48 +0100	[thread overview]
Message-ID: <20200214151048.GL14914@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <cde13472-46c0-7e17-175f-4b2ba4d8148a@linux.alibaba.com>

On Fri, Feb 07, 2020 at 11:35:30AM +0800, 王贇 wrote:
> Currently there are no good approach to monitoring the per-cgroup NUMA
> efficiency, this could be a trouble especially when groups are sharing
> CPUs, we don't know which one introduced remote-memory accessing.
> 
> Although the per-task NUMA accessing info from PMU is good for further
> debuging, but not light enough for daily monitoring, especial on a box
> with thousands of tasks.
> 
> Fortunately, when NUMA Balancing enabled, it will periodly trigger page
> fault and try to increase the NUMA locality, by tracing the results we
> will be able to estimate the NUMA efficiency.
> 
> On each page fault of NUMA Balancing, when task's executing CPU is from
> the same node of pages, we call this a local page accessing, otherwise
> a remote page accessing.
> 
> By updating task's accessing counter into it's cgroup on ticks, we get
> the per-cgroup numa locality info.
> 
> For example the new entry 'cpu.numa_stat' show:
>   page_access local=1231412 remote=53453
> 
> Here we know the workloads in hierarchy have totally been traced 1284865
> times of page accessing, and 1231412 of them are local page access, which
> imply a good NUMA efficiency.
> 
> By monitoring the increments, we will be able to locate the per-cgroup
> workload which NUMA Balancing can't helpwith (usually caused by wrong
> CPU and memory node bindings), then we got chance to fix that in time.
> 
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Michal Koutný <mkoutny@suse.com>
> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>

So here:

  https://lkml.kernel.org/r/20191127101932.GN28938@suse.de

Mel argues that the information exposed is fairly implementation
specific and hard to use without understanding how NUMA balancing works.

By exposing it to userspace, we tie ourselves to these particulars. We
can no longer change these NUMA balancing details if we wanted to, due
to UAPI concerns.

Mel, I suspect you still feel that way, right?

In the document (patch 2/2) you write:

> +However, there are no hardware counters for per-task local/remote accessing
> +info, we don't know how many remote page accesses have occurred for a
> +particular task.

We can of course 'fix' that by adding a tracepoint.

Mel, would you feel better by having a tracepoint in task_numa_fault() ?

Now I'm not really a fan of tracepoints myself, since they also
establish a UAPI, but perhaps it is a lesser evil in this case.