linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] NUMA Balancer Suite
@ 2019-04-22  2:10 王贇
  2019-04-22  2:11 ` [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic 王贇
                   ` (6 more replies)
  0 siblings, 7 replies; 62+ messages in thread
From: 王贇 @ 2019-04-22  2:10 UTC (permalink / raw)
  To: Peter Zijlstra, hannes, mhocko, vdavydov.dev, Ingo Molnar
  Cc: linux-kernel, linux-mm

We have NUMA Balancing feature which always trying to move pages
of a task to the node it executed more, while still got issues:

* page cache can't be handled
* no cgroup level balancing

Suppose we have a box with 4 cpu, two cgroup A & B each running 4 tasks,
below scenery could be easily observed:

NODE0			|	NODE1
			|
CPU0		CPU1	|	CPU2		CPU3
task_A0		task_A1	|	task_A2		task_A3
task_B0		task_B1	|	task_B2		task_B3

and usually with the equal memory consumption on each node, when tasks have
similar behavior.

In this case numa balancing try to move pages of task_A0,1 & task_B0,1 to node 0,
pages of task_A2,3 & task_B2,3 to node 1, but page cache will be located randomly,
depends on the first read/write CPU location.

Let's suppose another scenery:

NODE0			|	NODE1
			|
CPU0		CPU1	|	CPU2		CPU3
task_A0		task_A1	|	task_B0		task_B1
task_A2		task_A3	|	task_B2		task_B3

By switching the cpu & memory resources of task_A0,1 and task_B0,1, now workloads
of cgroup A all on node 0, and cgroup B all on node 1, resource consumption are same
but related tasks could share a closer cpu cache, while cache still randomly located.

Now what if the workloads generate lot's of page cache, and most of the memory
accessing are page cache writing?

A page cache generated by task_A0 on NODE1 won't follow it to NODE0, but if task_A0
was already on NODE0 before it read/write files, caches will be there, so how to
make sure this happen?

Usually we could solve this problem by binding workloads on a single node, if the
cgroup A was binding to CPU0,1, then all the caches it generated will be on NODE0,
the numa bonus will be maximum.

However, this require a very well administration on specified workloads, suppose in our
cases if A & B are with a changing CPU requirement from 0% to 400%, then binding to a
single node would be a bad idea.

So what we need is a way to detect memory topology on cgroup level, and try to migrate
cpu/mem resources to the node with most of the caches there, as long as the resource
is plenty on that node.

This patch set introduced:
  * advanced per-cgroup numa statistic
  * numa preferred node feature
  * Numa Balancer module

Which helps to achieve an easy and flexible numa resource assignment, to gain numa bonus
as much as possible.

Michael Wang (5):
  numa: introduce per-cgroup numa balancing locality statistic
  numa: append per-node execution info in memory.numa_stat
  numa: introduce per-cgroup preferred numa node
  numa: introduce numa balancer infrastructure
  numa: numa balancer

 drivers/Makefile             |   1 +
 drivers/numa/Makefile        |   1 +
 drivers/numa/numa_balancer.c | 715 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/memcontrol.h   |  99 ++++++
 include/linux/sched.h        |   9 +-
 kernel/sched/debug.c         |   8 +
 kernel/sched/fair.c          |  41 +++
 mm/huge_memory.c             |   7 +-
 mm/memcontrol.c              | 246 +++++++++++++++
 mm/memory.c                  |   9 +-
 mm/mempolicy.c               |   4 +
 11 files changed, 1133 insertions(+), 7 deletions(-)
 create mode 100644 drivers/numa/Makefile
 create mode 100644 drivers/numa/numa_balancer.c

-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2019-08-06  1:33 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-22  2:10 [RFC PATCH 0/5] NUMA Balancer Suite 王贇
2019-04-22  2:11 ` [RFC PATCH 1/5] numa: introduce per-cgroup numa balancing locality, statistic 王贇
2019-04-23  8:44   ` Peter Zijlstra
2019-04-23  9:14     ` 王贇
2019-04-23  8:46   ` Peter Zijlstra
2019-04-23  9:32     ` 王贇
2019-04-23  8:47   ` Peter Zijlstra
2019-04-23  9:33     ` 王贇
2019-04-23  9:46       ` Peter Zijlstra
2019-04-22  2:12 ` [RFC PATCH 2/5] numa: append per-node execution info in memory.numa_stat 王贇
2019-04-23  8:52   ` Peter Zijlstra
2019-04-23  9:36     ` 王贇
2019-04-23  9:46       ` Peter Zijlstra
2019-04-23 10:01         ` 王贇
2019-04-22  2:13 ` [RFC PATCH 3/5] numa: introduce per-cgroup preferred numa node 王贇
2019-04-23  8:55   ` Peter Zijlstra
2019-04-23  9:41     ` 王贇
2019-04-22  2:14 ` [RFC PATCH 4/5] numa: introduce numa balancer infrastructure 王贇
2019-04-22  2:21 ` [RFC PATCH 5/5] numa: numa balancer 王贇
2019-04-23  9:05   ` Peter Zijlstra
2019-04-23  9:59     ` 王贇
     [not found] ` <CAHCio2gEw4xyuoiurvwzvEiU8eLas+5ZLhzmqm1V2CJqvt+cyA@mail.gmail.com>
2019-04-23  2:14   ` [RFC PATCH 0/5] NUMA Balancer Suite 王贇
2019-07-03  3:26 ` [PATCH 0/4] per cpu cgroup numa suite 王贇
2019-07-03  3:28   ` [PATCH 1/4] numa: introduce per-cgroup numa balancing locality, statistic 王贇
2019-07-11 13:43     ` Peter Zijlstra
2019-07-12  3:15       ` 王贇
2019-07-11 13:47     ` Peter Zijlstra
2019-07-12  3:43       ` 王贇
2019-07-12  7:58         ` Peter Zijlstra
2019-07-12  9:11           ` 王贇
2019-07-12  9:42             ` Peter Zijlstra
2019-07-12 10:10               ` 王贇
2019-07-15  2:09                 ` 王贇
2019-07-15 12:10                 ` Michal Koutný
2019-07-16  2:41                   ` 王贇
2019-07-19 16:47                     ` Michal Koutný
2019-07-03  3:29   ` [PATCH 2/4] numa: append per-node execution info in memory.numa_stat 王贇
2019-07-11 13:45     ` Peter Zijlstra
2019-07-12  3:17       ` 王贇
2019-07-03  3:32   ` [PATCH 3/4] numa: introduce numa group per task group 王贇
2019-07-11 14:10     ` Peter Zijlstra
2019-07-12  4:03       ` 王贇
2019-07-03  3:34   ` [PATCH 4/4] numa: introduce numa cling feature 王贇
2019-07-08  2:25     ` [PATCH v2 " 王贇
2019-07-09  2:15       ` 王贇
2019-07-09  2:24       ` [PATCH v3 " 王贇
2019-07-11 14:27     ` [PATCH " Peter Zijlstra
2019-07-12  3:10       ` 王贇
2019-07-12  7:53         ` Peter Zijlstra
2019-07-12  8:58           ` 王贇
2019-07-22  3:44             ` 王贇
2019-07-11  9:00   ` [PATCH 0/4] per cgroup numa suite 王贇
2019-07-16  3:38   ` [PATCH v2 0/4] per-cgroup " 王贇
2019-07-16  3:39     ` [PATCH v2 1/4] numa: introduce per-cgroup numa balancing locality statistic 王贇
2019-07-16  3:40     ` [PATCH v2 2/4] numa: append per-node execution time in cpu.numa_stat 王贇
2019-07-19 16:39       ` Michal Koutný
2019-07-22  2:36         ` 王贇
2019-07-16  3:41     ` [PATCH v2 3/4] numa: introduce numa group per task group 王贇
2019-07-16  3:41     ` [PATCH v4 4/4] numa: introduce numa cling feature 王贇
2019-07-22  2:37       ` [PATCH v5 " 王贇
2019-07-25  2:33     ` [PATCH v2 0/4] per-cgroup numa suite 王贇
2019-08-06  1:33     ` 王贇

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).