linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: 王贇 <yun.wang@linux.alibaba.com>
To: Peter Zijlstra <peterz@infradead.org>,
	hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com,
	Ingo Molnar <mingo@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	mcgrof@kernel.org, keescook@chromium.org,
	linux-fsdevel@vger.kernel.org, cgroups@vger.kernel.org,
	"Michal Koutný" <mkoutny@suse.com>,
	"Hillf Danton" <hdanton@sina.com>
Subject: [PATCH v2 0/4] per-cgroup numa suite
Date: Tue, 16 Jul 2019 11:38:47 +0800	[thread overview]
Message-ID: <65c1987f-bcce-2165-8c30-cf8cf3454591@linux.alibaba.com> (raw)
In-Reply-To: <60b59306-5e36-e587-9145-e90657daec41@linux.alibaba.com>

During our torturing on numa stuff, we found problems like:

  * missing per-cgroup information about the per-node execution status
  * missing per-cgroup information about the numa locality

That is when we have a cpu cgroup running with bunch of tasks, no good
way to tell how it's tasks are dealing with numa.

The first two patches are trying to complete the missing pieces, but
more problems appeared after monitoring these status:

  * tasks not always running on the preferred numa node
  * tasks from same cgroup running on different nodes

The task numa group handler will always check if tasks are sharing pages
and try to pack them into a single numa group, so they will have chance to
settle down on the same node, but this failed in some cases:

  * workloads share page caches rather than share mappings
  * workloads got too many wakeup across nodes

Since page caches are not traced by numa balancing, there are no way to
realize such kind of relationship, and when there are too many wakeup,
task will be drag from the preferred node and then migrate back by numa
balancing, repeatedly.

Here the third patch try to address the first issue, we could now give hint
to kernel about the relationship of tasks, and pack them into single numa
group.

And the forth patch introduced numa cling, which try to address the wakup
issue, now we try to make task stay on the preferred node on wakeup in fast
path, in order to address the unbalancing risk, we monitoring the numa
migration failure ratio, and pause numa cling when it reach the specified
degree.

Since v1:
  * move statistics from memory cgroup into cpu group
  * statistics now accounting in hierarchical way
  * locality now accounted into 8 regions equally
  * numa cling no longer override select_idle_sibling, instead we
    prevent numa swap migration with tasks cling to dst-node, also
    prevent wake affine to drag tasks away which already cling to
    prev-cpu
  * other refine on comments and names

Michael Wang (4):
  v2 numa: introduce per-cgroup numa balancing locality statistic
  v2 numa: append per-node execution time in cpu.numa_stat
  v2 numa: introduce numa group per task group
  v4 numa: introduce numa cling feature

 include/linux/sched.h        |   8 +-
 include/linux/sched/sysctl.h |   3 +
 kernel/sched/core.c          |  85 ++++++++
 kernel/sched/debug.c         |   7 +
 kernel/sched/fair.c          | 510 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h         |  41 ++++
 kernel/sysctl.c              |   9 +
 7 files changed, 651 insertions(+), 12 deletions(-)

-- 
2.14.4.44.g2045bb6


  parent reply	other threads:[~2019-07-16  3:38 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-22  2:10 [RFC PATCH 0/5] NUMA Balancer Suite 王贇
2019-04-22  2:11 ` [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic 王贇
2019-04-23  8:44   ` Peter Zijlstra
2019-04-23  9:14     ` 王贇
2019-04-23  8:46   ` Peter Zijlstra
2019-04-23  9:32     ` 王贇
2019-04-23  8:47   ` Peter Zijlstra
2019-04-23  9:33     ` 王贇
2019-04-23  9:46       ` Peter Zijlstra
2019-04-22  2:12 ` [RFC PATCH 2/5] numa: append per-node execution info in memory.numa_stat 王贇
2019-04-23  8:52   ` Peter Zijlstra
2019-04-23  9:36     ` 王贇
2019-04-23  9:46       ` Peter Zijlstra
2019-04-23 10:01         ` 王贇
2019-04-22  2:13 ` [RFC PATCH 3/5] numa: introduce per-cgroup preferred numa node 王贇
2019-04-23  8:55   ` Peter Zijlstra
2019-04-23  9:41     ` 王贇
2019-04-22  2:14 ` [RFC PATCH 4/5] numa: introduce numa balancer infrastructure 王贇
2019-04-22  2:21 ` [RFC PATCH 5/5] numa: numa balancer 王贇
2019-04-23  9:05   ` Peter Zijlstra
2019-04-23  9:59     ` 王贇
     [not found] ` <CAHCio2gEw4xyuoiurvwzvEiU8eLas+5ZLhzmqm1V2CJqvt+cyA@mail.gmail.com>
2019-04-23  2:14   ` [RFC PATCH 0/5] NUMA Balancer Suite 王贇
2019-07-03  3:26 ` [PATCH 0/4] per cpu cgroup numa suite 王贇
2019-07-03  3:28   ` [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic 王贇
2019-07-11 13:43     ` Peter Zijlstra
2019-07-12  3:15       ` 王贇
2019-07-11 13:47     ` Peter Zijlstra
2019-07-12  3:43       ` 王贇
2019-07-12  7:58         ` Peter Zijlstra
2019-07-12  9:11           ` 王贇
2019-07-12  9:42             ` Peter Zijlstra
2019-07-12 10:10               ` 王贇
2019-07-15  2:09                 ` 王贇
2019-07-15 12:10                 ` Michal Koutný
2019-07-16  2:41                   ` 王贇
2019-07-19 16:47                     ` Michal Koutný
2019-07-03  3:29   ` [PATCH 2/4] numa: append per-node execution info in memory.numa_stat 王贇
2019-07-11 13:45     ` Peter Zijlstra
2019-07-12  3:17       ` 王贇
2019-07-03  3:32   ` [PATCH 3/4] numa: introduce numa group per task group 王贇
2019-07-11 14:10     ` Peter Zijlstra
2019-07-12  4:03       ` 王贇
2019-07-03  3:34   ` [PATCH 4/4] numa: introduce numa cling feature 王贇
2019-07-08  2:25     ` [PATCH v2 " 王贇
2019-07-09  2:15       ` 王贇
2019-07-09  2:24       ` [PATCH v3 " 王贇
2019-07-11 14:27     ` [PATCH " Peter Zijlstra
2019-07-12  3:10       ` 王贇
2019-07-12  7:53         ` Peter Zijlstra
2019-07-12  8:58           ` 王贇
2019-07-22  3:44             ` 王贇
2019-07-11  9:00   ` [PATCH 0/4] per cgroup numa suite 王贇
2019-07-16  3:38   ` 王贇 [this message]
2019-07-16  3:39     ` [PATCH v2 1/4] numa: introduce per-cgroup numa balancing locality statistic 王贇
2019-07-16  3:40     ` [PATCH v2 2/4] numa: append per-node execution time in cpu.numa_stat 王贇
2019-07-19 16:39       ` Michal Koutný
2019-07-22  2:36         ` 王贇
2019-07-16  3:41     ` [PATCH v2 3/4] numa: introduce numa group per task group 王贇
2019-07-16  3:41     ` [PATCH v4 4/4] numa: introduce numa cling feature 王贇
2019-07-22  2:37       ` [PATCH v5 " 王贇
2019-07-25  2:33     ` [PATCH v2 0/4] per-cgroup numa suite 王贇
2019-08-06  1:33     ` 王贇

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=65c1987f-bcce-2165-8c30-cf8cf3454591@linux.alibaba.com \
    --to=yun.wang@linux.alibaba.com \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=hdanton@sina.com \
    --cc=keescook@chromium.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mcgrof@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=mingo@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=peterz@infradead.org \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).