All of lore.kernel.org
 help / color / mirror / Atom feed
From: Randy Dunlap <rdunlap@infradead.org>
To: 王贇 <yun.wang@linux.alibaba.com>, "Ingo Molnar" <mingo@redhat.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Juri Lelli" <juri.lelli@redhat.com>,
	"Vincent Guittot" <vincent.guittot@linaro.org>,
	"Dietmar Eggemann" <dietmar.eggemann@arm.com>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Ben Segall" <bsegall@google.com>, "Mel Gorman" <mgorman@suse.de>,
	"Luis Chamberlain" <mcgrof@kernel.org>,
	"Kees Cook" <keescook@chromium.org>,
	"Iurii Zaikin" <yzaikin@google.com>,
	"Michal Koutný" <mkoutny@suse.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org,
	"Paul E. McKenney" <paulmck@linux.ibm.com>,
	"Jonathan Corbet" <corbet@lwn.net>
Subject: Re: [PATCH v8 2/2] sched/numa: documentation for per-cgroup numa, statistics
Date: Mon, 20 Jan 2020 18:08:57 -0800	[thread overview]
Message-ID: <5c299da6-762e-1d2d-dafb-cfe3a0082d56@infradead.org> (raw)
In-Reply-To: <23fc0493-967c-d0e1-767b-89e8f7c85718@linux.alibaba.com>

On 1/20/20 5:57 PM, 王贇 wrote:
> Add the description for 'numa_locality', also a new doc to explain
> the details on how to deal with the per-cgroup numa statistics.
> 
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Michal Koutný <mkoutny@suse.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Iurii Zaikin <yzaikin@google.com>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>

Thanks for the updates.
Acked-by: Randy Dunlap <rdunlap@infradead.org>


> ---
>  Documentation/admin-guide/cg-numa-stat.rst      | 178 ++++++++++++++++++++++++
>  Documentation/admin-guide/index.rst             |   1 +
>  Documentation/admin-guide/kernel-parameters.txt |   4 +
>  Documentation/admin-guide/sysctl/kernel.rst     |   9 ++
>  init/Kconfig                                    |   2 +
>  5 files changed, 194 insertions(+)
>  create mode 100644 Documentation/admin-guide/cg-numa-stat.rst
> 
> diff --git a/Documentation/admin-guide/cg-numa-stat.rst b/Documentation/admin-guide/cg-numa-stat.rst
> new file mode 100644
> index 000000000000..1106eb1e4050
> --- /dev/null
> +++ b/Documentation/admin-guide/cg-numa-stat.rst
> @@ -0,0 +1,178 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===============================
> +Per-cgroup NUMA statistics
> +===============================
> +
> +Background
> +----------
> +
> +On NUMA platforms, remote memory accessing always has a performance penalty.
> +Although we have NUMA balancing working hard to maximize the access locality,
> +there are still situations it can't help.
> +
> +This could happen in modern production environment. When a large number of
> +cgroups are used to classify and control resources, this creates a complex
> +configuration for memory policy, CPUs and NUMA nodes. In such cases NUMA
> +balancing could end up with the wrong memory policy or exhausted local NUMA
> +node, which would lead to low percentage of local page accesses.
> +
> +We need to detect such cases, figure out which workloads from which cgroup
> +have introduced the issues, then we get chance to do adjustment to avoid
> +performance degradation.
> +
> +However, there are no hardware counters for per-task local/remote accessing
> +info, we don't know how many remote page accesses have occurred for a
> +particular task.
> +
> +NUMA Locality
> +-------------
> +
> +Fortunately, we have NUMA Balancing which scans task's mapping and triggers
> +page fault periodically, giving us the opportunity to record per-task page
> +accessing info, when the CPU fall into PF is from the same node of pages, we
> +consider task as doing local page accessing, otherwise the remote page
> +accessing, we call these two counters the locality info.
> +
> +On each tick, we acquire the locality info of current task on that CPU, update
> +the increments into its cgroup, becoming the group locality info.
> +
> +By "echo 1 > /proc/sys/kernel/numa_locality" at runtime or adding boot parameter
> +'numa_locality', we will enable the accounting of per-cgroup NUMA locality info,
> +the 'cpu.numa_stat' entry of CPU cgroup will show statistics::
> +
> +  page_access local=NR_LOCAL_PAGE_ACCESS remote=NR_REMOTE_PAGE_ACCESS
> +
> +We define 'NUMA locality' as::
> +
> +  NR_LOCAL_PAGE_ACCESS * 100 / (NR_LOCAL_PAGE_ACCESS + NR_REMOTE_PAGE_ACCESS)
> +
> +This per-cgroup percentage number helps to represent the NUMA Balancing behavior.
> +
> +Note that the accounting is hierarchical, which means the NUMA locality info for
> +a given group represents not only the workload of this group, but also the
> +workloads of all its descendants.
> +
> +For example the 'cpu.numa_stat' shows::
> +
> +  page_access local=129909383 remote=18265810
> +
> +The NUMA locality calculated as::
> +
> +  129909383 * 100 / (129909383 + 18265810) = 87.67
> +
> +Thus we know the workload of this group and its descendants have totally done
> +129909383 times of local page accessing and 18265810 times of remotes, locality
> +is 87.67% which implies most of the memory access are local.
> +
> +NUMA Consumption
> +----------------
> +
> +There are also other cgroup entries which help us to estimate NUMA efficiency.
> +They are 'cpuacct.usage_percpu' and 'memory.numa_stat'.
> +
> +By reading 'cpuacct.usage_percpu' we will get per-cpu runtime (in nanoseconds)
> +info (in hierarchy) as::
> +
> +  CPU_0_RUNTIME CPU_1_RUNTIME CPU_2_RUNTIME ... CPU_X_RUNTIME
> +
> +Combined with the info from::
> +
> +  cat /sys/devices/system/node/nodeX/cpulist
> +
> +We would be able to accumulate the runtime of CPUs into NUMA nodes, to get the
> +per-cgroup node runtime info.
> +
> +By reading 'memory.numa_stat' we will get per-cgroup node memory consumption
> +info as::
> +
> +  total=TOTAL_MEM N0=MEM_ON_NODE0 N1=MEM_ON_NODE1 ... NX=MEM_ON_NODEX
> +
> +Together we call these the per-cgroup NUMA consumption info, telling us how many
> +resources a particular workload has consumed, on a particular NUMA node.
> +
> +Monitoring
> +----------
> +
> +By monitoring the increments of locality info, we can easily know whether NUMA
> +Balancing is working well for a particular workload.
> +
> +For example we take a 5 seconds sample period, then on each sampling we have::
> +
> +  local_diff = last_nr_local_page_access - nr_local_page_access
> +  remote_diff = last_nr_remote_page_access - nr_remote_page_access
> +
> +and we get the locality in this period as::
> +
> +  locality = local_diff * 100 / (local_diff + remote_diff)
> +
> +We can plot a line for locality. When the line is close to 100%, things are
> +good; when getting close to 0% something is wrong. We can pick a proper
> +watermark to trigger warning message.
> +
> +You may want to drop the data if the local/remote_diff is too small, which
> +implies there are not many available pages for NUMA Balancing to scan, ignoring
> +would be fine since most likely the workload is insensitive to NUMA, or the
> +memory topology is already good enough.
> +
> +Monitoring root group helps you control the overall situation, while you may
> +also want to monitor all the leaf groups which contain the workloads, this
> +helps to catch the mouse.
> +
> +Try to put your workload into also the cpuacct & memory cgroup, when NUMA
> +Balancing is disabled or locality becomes too small, we may want to monitor
> +the per-node runtime & memory info to see if the node consumption meet the
> +requirements.
> +
> +For NUMA node X on each sampling we have::
> +
> +  runtime_X_diff = runtime_X - last_runtime_X
> +  runtime_all_diff = runtime_all - last_runtime_all
> +
> +  runtime_percent_X = runtime_X_diff * 100 / runtime_all_diff
> +  memory_percent_X = memory_X * 100 / memory_all
> +
> +These two percentages are usually matched on each node, workload should execute
> +mostly on the node that contains most of its memory, but it's not guaranteed.
> +
> +The workload may only access a small part of its memory, in such cases although
> +the majority of memory are remote, locality could still be good.
> +
> +Thus to tell if things are fine or not depends on the understanding of system
> +resource deployment, however, if you find node X got 100% memory percent but 0%
> +runtime percent, definitely something is wrong.
> +
> +Troubleshooting
> +---------------
> +
> +After identifying which workload introduced the bad locality, check:
> +
> +1). Is the workload bound to a particular NUMA node?
> +2). Has any NUMA node run out of resources?
> +
> +There are several ways to bind task's memory with a NUMA node, the strict way
> +like the MPOL_BIND memory policy or 'cpuset.mems' will limit the memory
> +node where to allocate pages. In this situation, admin should make sure the
> +task is allowed to run on the CPUs of that NUMA node, and make sure there are
> +available CPU resources there.
> +
> +There are also ways to bind task's CPU with a NUMA node, like 'cpuset.cpus' or
> +sched_setaffinity() syscall. In this situation, NUMA Balancing helps to migrate
> +pages into that node, admin should make sure there is available memory there.
> +
> +Admin could try to rebind or unbind the NUMA node to erase the damage, make a
> +change then observe the statistics to see if things get better until the
> +situation is acceptable.
> +
> +Highlights
> +----------
> +
> +For some tasks, NUMA Balancing may be found to be unnecessary to scan pages,
> +and locality could always be 0 or small number, don't pay attention to them
> +since they most likely insensitive to NUMA.
> +
> +There is no accounting until the option is turned on, so enable it in advance
> +if you want to have the whole history.
> +
> +We have per-task migfailed counter to tell how many page migrations have
> +failed for a particular task; you will find it in /proc/PID/sched entry.


  reply	other threads:[~2020-01-21  2:09 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-13  3:43 [PATCH 0/3] sched/numa: introduce advanced numa statistic 王贇
2019-11-13  3:44 ` [PATCH 1/3] sched/numa: advanced per-cgroup " 王贇
2019-11-13  3:45 ` [PATCH 2/3] sched/numa: expose per-task pages-migration-failure 王贇
2019-11-13  3:45 ` [PATCH 3/3] sched/numa: documentation for per-cgroup numa stat 王贇
2019-11-13 15:09   ` Jonathan Corbet
2019-11-14  1:52     ` 王贇
2019-11-13 18:28   ` Iurii Zaikin
2019-11-14  2:22     ` 王贇
2019-11-15  2:29   ` [PATCH v2 " 王贇
2019-11-20  9:45 ` [PATCH 0/3] sched/numa: introduce advanced numa statistic 王贇
2019-11-25  1:35 ` 王贇
2019-11-27  1:48 ` [PATCH v2 " 王贇
2019-11-27  1:49   ` [PATCH v2 1/3] sched/numa: advanced per-cgroup " 王贇
2019-11-27 10:19     ` Mel Gorman
2019-11-28  2:09       ` 王贇
2019-11-28 12:39         ` Michal Koutný
2019-11-28 13:41           ` 王贇
2019-11-28 15:58             ` Michal Koutný
2019-11-29  1:52               ` 王贇
2019-11-29  5:19                 ` 王贇
2019-11-29 10:06                   ` Michal Koutný
2019-12-02  2:11                     ` 王贇
2019-11-27  1:50   ` [PATCH v2 2/3] sched/numa: expose per-task pages-migration-failure 王贇
2019-11-27 10:00     ` Mel Gorman
2019-12-02  2:22     ` 王贇
2019-11-27  1:50   ` [PATCH v2 3/3] sched/numa: documentation for per-cgroup numa stat 王贇
2019-11-27  4:58     ` Randy Dunlap
2019-11-27  5:54       ` 王贇
2019-12-03  5:59   ` [PATCH v3 0/2] sched/numa: introduce numa locality 王贇
2019-12-03  6:00     ` [PATCH v3 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
2019-12-04  2:33       ` Randy Dunlap
2019-12-04  2:38         ` 王贇
2019-12-03  6:02     ` [PATCH v3 2/2] sched/numa: documentation for per-cgroup numa statistics 王贇
2019-12-03 13:43       ` Jonathan Corbet
2019-12-04  2:27         ` 王贇
2019-12-04  7:58     ` [PATCH v4 0/2] sched/numa: introduce numa locality 王贇
2019-12-04  7:59       ` [PATCH v4 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
2019-12-05  3:28         ` Randy Dunlap
2019-12-05  3:29           ` Randy Dunlap
2019-12-05  3:52             ` 王贇
2019-12-04  8:00       ` [PATCH v4 2/2] sched/numa: documentation for per-cgroup numa statistics 王贇
2019-12-05  3:40         ` Randy Dunlap
2019-12-05  6:53       ` [PATCH v5 0/2] sched/numa: introduce numa locality 王贇
2019-12-05  6:53         ` [PATCH v5 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
2019-12-05  6:54         ` [PATCH v5 2/2] sched/numa: documentation for per-cgroup numa, statistics 王贇
2019-12-10  2:19         ` [PATCH v5 0/2] sched/numa: introduce numa locality 王贇
2019-12-13  1:43         ` [PATCH v6 " 王贇
2019-12-13  1:47           ` [PATCH v6 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
2020-01-03 15:14             ` Michal Koutný
2020-01-04  4:51               ` 王贇
2019-12-13  1:48           ` [PATCH v6 2/2] sched/numa: documentation for per-cgroup numa 王贇
2019-12-27  2:22           ` [PATCH v6 0/2] sched/numa: introduce numa locality 王贇
2020-01-17  2:19           ` 王贇
2020-01-19  6:08           ` [PATCH v7 " 王贇
2020-01-19  6:09             ` [PATCH v7 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
2020-01-19  6:09             ` [PATCH v7 2/2] sched/numa: documentation for per-cgroup numa, statistics 王贇
2020-01-21  0:12               ` Randy Dunlap
2020-01-21  1:58                 ` 王贇
2020-01-21  1:56             ` [PATCH v8 0/2] sched/numa: introduce numa locality 王贇
2020-01-21  1:57               ` [PATCH v8 1/2] sched/numa: introduce per-cgroup NUMA locality info 王贇
2020-01-21  1:57               ` [PATCH v8 2/2] sched/numa: documentation for per-cgroup numa, statistics 王贇
2020-01-21  2:08                 ` Randy Dunlap [this message]
2020-02-07  1:10               ` [PATCH v8 0/2] sched/numa: introduce numa locality 王贇
2020-02-07  1:25                 ` Steven Rostedt
2020-02-07  2:31                   ` 王贇
2020-02-07  2:37             ` [PATCH RESEND " 王贇

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5c299da6-762e-1d2d-dafb-cfe3a0082d56@infradead.org \
    --to=rdunlap@infradead.org \
    --cc=bsegall@google.com \
    --cc=corbet@lwn.net \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=keescook@chromium.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=paulmck@linux.ibm.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=yun.wang@linux.alibaba.com \
    --cc=yzaikin@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.