All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}
@ 2022-07-08  8:21 Gang Li
  2022-07-08  8:21 ` [PATCH v2 1/5] mm: add a new parameter `node` to `get/add/inc/dec_mm_counter` Gang Li
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Gang Li @ 2022-07-08  8:21 UTC (permalink / raw)
  To: mhocko, akpm, surenb
  Cc: hca, gor, agordeev, borntraeger, svens, viro, ebiederm, keescook,
	rostedt, mingo, peterz, acme, mark.rutland, alexander.shishkin,
	jolsa, namhyung, david, imbrenda, adobriyan, yang.yang29,
	brauner, stephen.s.brennan, zhengqi.arch, haolee.swjtu, xu.xin16,
	Liam.Howlett, ohoono.kwon, peterx, arnd, shy828301, alex.sierra,
	xianting.tian, willy, ccross, vbabka, sujiaxun, sfr,
	vasily.averin, mgorman, vvghjk1234, tglx, luto, bigeasy,
	fenghua.yu, linux-s390, linux-kernel, linux-fsdevel, linux-mm,
	linux-perf-users, Gang Li

TLDR
----
If a mempolicy or cpuset is in effect, out_of_memory() will select victim
on specific node to kill. So that kernel can avoid accidental killing on
NUMA system.

Problem
-------
Before this patch series, oom will only kill the process with the highest 
memory usage by selecting process with the highest oom_badness on the
entire system.

This works fine on UMA system, but may have some accidental killing on NUMA
system.

As shown below, if process c.out is bind to Node1 and keep allocating pages
from Node1, a.out will be killed first. But killing a.out did't free any
mem on Node1, so c.out will be killed then.

A lot of AMD machines have 8 numa nodes. In these systems, there is a
greater chance of triggering this problem.

OOM before patches:
```
Per-node process memory usage (in MBs)
PID             Node 0        Node 1    Total
----------- ---------- ------------- --------
3095 a.out     3073.34          0.11  3073.45(Killed first. Max mem usage)
3199 b.out      501.35       1500.00  2001.35
3805 c.out        1.52 (grow)2248.00  2249.52(Killed then. Node1 is full)
----------- ---------- ------------- --------
Total          3576.21       3748.11  7324.31
```

Solution
--------
We store per node rss in mm_rss_stat for each process.

If a page allocation with mempolicy or cpuset in effect triger oom. We will
calculate oom_badness with rss counter for the corresponding node. Then
select the process with the highest oom_badness on the corresponding node
to kill.

OOM after patches:
```
Per-node process memory usage (in MBs)
PID             Node 0        Node 1     Total
----------- ---------- ------------- ----------
3095 a.out     3073.34          0.11    3073.45
3199 b.out      501.35       1500.00    2001.35
3805 c.out        1.52 (grow)2248.00    2249.52(killed)
----------- ---------- ------------- ----------
Total          3576.21       3748.11    7324.31
```

Overhead
--------
CPU:

According to the result of Unixbench. There is less than one percent
performance loss in most cases.

On 40c512g machine.

40 parallel copies of tests:
+----------+----------+-----+----------+---------+---------+---------+
| numastat | FileCopy | ... |   Pipe   |  Fork   | syscall |  total  |
+----------+----------+-----+----------+---------+---------+---------+
| off      | 2920.24  | ... | 35926.58 | 6980.14 | 2617.18 | 8484.52 |
| on       | 2919.15  | ... | 36066.07 | 6835.01 | 2724.82 | 8461.24 |
| overhead | 0.04%    | ... | -0.39%   | 2.12%   | -3.95%  | 0.28%   |
+----------+----------+-----+----------+---------+---------+---------+

1 parallel copy of tests:
+----------+----------+-----+---------+--------+---------+---------+
| numastat | FileCopy | ... |  Pipe   |  Fork  | syscall |  total  |
+----------+----------+-----+---------+--------+---------+---------+
| off      | 1515.37  | ... | 1473.97 | 546.88 | 1152.37 | 1671.2  |
| on       | 1508.09  | ... | 1473.75 | 532.61 | 1148.83 | 1662.72 |
| overhead | 0.48%    | ... | 0.01%   | 2.68%  | 0.31%   | 0.51%   |
+----------+----------+-----+---------+--------+---------+---------+

MEM:

per task_struct:
sizeof(int) * num_possible_nodes() + sizeof(int*)
typically 4 * 2 + 8 bytes

per mm_struct:
sizeof(atomic_long_t) * num_possible_nodes() + sizeof(atomic_long_t*)
typically 8 * 2 + 8 bytes

zap_pte_range:
sizeof(int) * num_possible_nodes() + sizeof(int*)
typically 4 * 2 + 8 bytes 

Changelog
----------
v2:
  - enable per numa node oom for `CONSTRAINT_CPUSET`.
  - add benchmark result in cover letter.

Gang Li (5):
  mm: add a new parameter `node` to `get/add/inc/dec_mm_counter`
  mm: add numa_count field for rss_stat
  mm: add numa fields for tracepoint rss_stat
  mm: enable per numa node rss_stat count
  mm, oom: enable per numa node oom for
    CONSTRAINT_{MEMORY_POLICY,CPUSET}

 arch/s390/mm/pgtable.c        |   4 +-
 fs/exec.c                     |   2 +-
 fs/proc/base.c                |   6 +-
 fs/proc/task_mmu.c            |  14 ++--
 include/linux/mm.h            |  59 ++++++++++++-----
 include/linux/mm_types_task.h |  16 +++++
 include/linux/oom.h           |   2 +-
 include/trace/events/kmem.h   |  27 ++++++--
 kernel/events/uprobes.c       |   6 +-
 kernel/fork.c                 |  70 +++++++++++++++++++-
 mm/huge_memory.c              |  13 ++--
 mm/khugepaged.c               |   4 +-
 mm/ksm.c                      |   2 +-
 mm/madvise.c                  |   2 +-
 mm/memory.c                   | 119 ++++++++++++++++++++++++----------
 mm/migrate.c                  |   4 ++
 mm/migrate_device.c           |   2 +-
 mm/oom_kill.c                 |  69 +++++++++++++++-----
 mm/rmap.c                     |  19 +++---
 mm/swapfile.c                 |   6 +-
 mm/userfaultfd.c              |   2 +-
 21 files changed, 335 insertions(+), 113 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-07-18 12:11 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-08  8:21 [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET} Gang Li
2022-07-08  8:21 ` [PATCH v2 1/5] mm: add a new parameter `node` to `get/add/inc/dec_mm_counter` Gang Li
2022-07-12  6:33   ` [mm] c20f7bacef: WARNING:possible_circular_locking_dependency_detected kernel test robot
2022-07-12  6:33     ` kernel test robot
2022-07-08  8:21 ` [PATCH v2 2/5] mm: add numa_count field for rss_stat Gang Li
2022-07-08 12:22   ` kernel test robot
2022-07-08  8:21 ` [PATCH v2 3/5] mm: add numa fields for tracepoint rss_stat Gang Li
2022-07-08 17:31   ` Steven Rostedt
2022-07-08  8:21 ` [PATCH v2 4/5] mm: enable per numa node rss_stat count Gang Li
2022-07-08  8:21 ` [PATCH v2 5/5] mm, oom: enable per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET} Gang Li
2022-07-08  8:54 ` [PATCH v2 0/5] mm, oom: Introduce " Michal Hocko
2022-07-08  9:25   ` Gang Li
2022-07-08  9:37     ` Michal Hocko
2022-07-12 11:12   ` Abel Wu
2022-07-12 13:35     ` Michal Hocko
2022-07-12 15:00       ` Abel Wu
2022-07-18 12:11         ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.