* [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18 8:13 Vasily Averin
2021-10-18 9:04 ` Michal Hocko
0 siblings, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-18 8:13 UTC (permalink / raw)
To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
Mel Gorman, cgroups, linux-mm, linux-kernel, kernel
[-- Attachment #1: Type: text/plain, Size: 2770 bytes --]
While checking the patches fixed broken memcg accounting in vmalloc I found
another issue: a false global OOM triggered by memcg-limited user space task.
I executed vmalloc-eater inside a memcg limited LXC container in a loop, checked
that it does not consume host memory beyond the assigned limit, triggers memcg OOM
and generates "Memory cgroup out of memory" messages. Everything was as expected.
However I was surprised to find quite rare global OOM messages too.
I set sysctl vm.panic_on_oom to 1, repeated the test and successfully
crashed the node.
Dmesg showed that global OOM was detected on 16 GB node with ~10 GB of free memory.
syz-executor invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=1000
CPU: 2 PID: 15307 Comm: syz-executor Kdump: loaded Not tainted 5.15.0-rc4+ #55
Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
Call Trace:
dump_stack_lvl+0x57/0x72
dump_header+0x4a/0x2c1
out_of_memory.cold+0xa/0x7e
pagefault_out_of_memory+0x46/0x60
exc_page_fault+0x79/0x2b0
asm_exc_page_fault+0x1e/0x30
...
Mem-Info:
Node 0 DMA: 0*4kB 0*8kB <...> = 13296kB
Node 0 DMA32: 705*4kB (UM) <...> = 2586964kB
Node 0 Normal: 2743*4kB (UME) <...> = 6904828kB
...
4095866 pages RAM
...
Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
Full dmesg can be found in attached file.
How could this happen?
User-space task inside the memcg-limited container generated a page fault,
its handler do_user_addr_fault() called handle_mm_fault which could not
allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
Then do_user_addr_fault() called pagefault_out_of_memory() which executed
out_of_memory() without set of memcg.
Partially this problem depends on one of my recent patches, disabled unlimited
memory allocation for dying tasks. However I think the problem can happen
on non-killed tasks too, for example because of kmem limit.
At present do_user_addr_fault() does not know why page allocation was failed,
i.e. was it global or memcg OOM. I propose to save this information in new flag
on task_struct. It can be set in case of memcg restrictons in
obj_cgroup_charge_pages() (for memory controller) and in try_charge_memcg()
(for kmem controller). Then it can be used in mem_cgroup_oom_synchronize()
called inside pagefault_out_of_memory():
in case of memcg-related restrictions it will not trigger fake global OOM and
returns to user space which will retry the fault or kill the process if it got
a fatal signal.
Thank you,
Vasily Averin
Vasily Averin (1):
memcg: prevent false global OOM trigggerd by memcg limited task.
include/linux/sched.h | 1 +
mm/memcontrol.c | 12 +++++++++---
2 files changed, 10 insertions(+), 3 deletions(-)
--
2.32.0
[-- Attachment #2: dmesg-oom.txt --]
[-- Type: text/plain, Size: 25650 bytes --]
[59622.176098] syz-executor invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=1000
[59622.178633] CPU: 2 PID: 15307 Comm: syz-executor Kdump: loaded Not tainted 5.15.0-rc4+ #55
[59622.180840] Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
[59622.182562] Call Trace:
[59622.183525] dump_stack_lvl+0x57/0x72
[59622.184782] dump_header+0x4a/0x2c1
[59622.185929] oom_kill_process.cold+0xb/0x10
[59622.187203] out_of_memory+0x229/0x5b0
[59622.188399] mem_cgroup_out_of_memory+0x111/0x130
[59622.189773] try_charge_memcg+0x693/0x720
[59622.191013] ? kvm_sched_clock_read+0x14/0x30
[59622.192318] charge_memcg+0x57/0x170
[59622.193482] mem_cgroup_swapin_charge_page+0x99/0x1d0
[59622.194932] do_swap_page+0x916/0xbf0
[59622.196110] ? __lock_acquire+0x3b3/0x1e00
[59622.197377] __handle_mm_fault+0xa5f/0x14e0
[59622.198649] ? lock_acquire+0xc4/0x2e0
[59622.199847] handle_mm_fault+0x149/0x3f0
[59622.201071] do_user_addr_fault+0x1f4/0x6c0
[59622.202346] exc_page_fault+0x79/0x2b0
[59622.203545] ? asm_exc_page_fault+0x8/0x30
[59622.204802] asm_exc_page_fault+0x1e/0x30
[59622.206043] RIP: 0033:0x41a948
[59622.207115] Code: 64 83 0c 25 08 03 00 00 10 64 48 8b 3c 25 00 03 00 00 e8 8b fe ff ff f4 66 2e 0f 1f 84 00 00 00 00 00 f7 c7 02 00 00 00 75 27 <64> 8b 04 25 08 03 00 00 41 89 c3 41 83 e3 fd f0 64 44 0f b1 1c 25
[59622.211557] RSP: 002b:00007ffd10ebf6e8 EFLAGS: 00010246
[59622.213058] RAX: 0000000000000000 RBX: 000000000119cf40 RCX: 000000000041b331
[59622.214919] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[59622.216772] RBP: 000000000119d940 R08: 0000000000000000 R09: 0000000000000000
[59622.218623] R10: 00007ffd10ebf7c0 R11: 0000000000000293 R12: 00000000038dc0e1
[59622.220467] R13: 00000000038dbdc3 R14: 20c49ba5e353f7cf R15: ffffffffffffffff
[59622.222367] memory: usage 524268kB, limit 524288kB, failcnt 28608
[59622.224112] memory+swap: usage 554776kB, limit 9007199254740988kB, failcnt 0
[59622.225982] kmem: usage 524168kB, limit 9007199254740988kB, failcnt 0
[59622.227740] Memory cgroup stats for /lxc.payload.test:
[59622.234613] anon 0
file 122880
kernel_stack 1294336
pagetables 3665920
percpu 490896
sock 0
shmem 0
file_mapped 12288
file_dirty 0
file_writeback 0
swapcached 49008640
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 0
active_anon 0
inactive_file 110592
active_file 0
unevictable 0
slab_reclaimable 1513744
slab_unreclaimable 14117008
slab 15630752
workingset_refault_anon 3800
workingset_refault_file 24974
workingset_activate_anon 923
workingset_activate_file 414
workingset_restore_anon 422
workingset_restore_file 190
workingset_nodereclaim 74
pgfault 120040
pgmajfault 4766
pgrefill 15999
pgscan 486263
pgsteal 44646
pgactivate 13066
pgdeactivate 14277
pglazyfree 60
pglazyfreed 34
thp_fault_alloc 0
thp_collapse_alloc 0
[59622.265206] Tasks state (memory values in pages):
[59622.266454] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[59622.268394] [ 14696] 0 14696 56226 690 90112 145 0 bash
[59622.270246] [ 14746] 0 14746 262198 0 266240 2900 0 syz-execprog
[59622.272217] [ 14764] 0 14764 12552 0 36864 16 0 syz-executor
[59622.274217] [ 14765] 0 14765 12552 0 36864 16 0 syz-executor
[59622.276165] [ 14769] 0 14769 12552 0 36864 16 0 syz-executor
[59622.278091] [ 14772] 0 14772 12552 0 36864 17 0 syz-executor
[59622.280023] [ 14775] 0 14775 12552 0 36864 17 0 syz-executor
[59622.281949] [ 14777] 0 14777 12552 0 36864 16 0 syz-executor
[59622.283847] [ 14781] 0 14781 12552 0 36864 16 0 syz-executor
[59622.285737] [ 14783] 0 14783 12552 0 36864 17 0 syz-executor
[59622.287673] [ 14787] 0 14787 12552 0 36864 16 0 syz-executor
[59622.287968] systemd-journald[648]: Compressed data object 720 -> 393 using ZSTD
[59622.289564] [ 14800] 0 14800 12552 0 36864 16 0 syz-executor
[59622.293144] [ 14803] 0 14803 12552 0 36864 16 0 syz-executor
[59622.295054] [ 14805] 0 14805 12552 0 36864 16 0 syz-executor
[59622.296928] [ 14812] 0 14812 12552 0 36864 17 0 syz-executor
[59622.298796] [ 14814] 0 14814 12552 0 36864 16 0 syz-executor
[59622.300663] [ 14817] 0 14817 12552 0 36864 17 0 syz-executor
[59622.302528] [ 14820] 0 14820 12552 0 36864 16 0 syz-executor
[59622.304399] [ 14779] 0 14779 12551 0 49152 37 0 syz-executor
[59622.306271] [ 15280] 0 15279 12584 0 69632 0 1000 syz-executor
[59622.308144] [ 14767] 0 14767 12551 0 49152 41 0 syz-executor
[59622.310046] [ 15282] 0 15281 12584 0 69632 0 1000 syz-executor
[59622.311926] [ 14766] 0 14766 12551 0 49152 44 0 syz-executor
[59622.313812] [ 15286] 0 15285 12584 0 65536 0 1000 syz-executor
[59622.315704] [ 14782] 0 14782 12551 0 49152 35 0 syz-executor
[59622.317621] [ 15284] 0 15283 12584 0 65536 0 1000 syz-executor
[59622.319570] [ 14784] 0 14784 12551 0 49152 36 0 syz-executor
[59622.321496] [ 15288] 0 15287 12617 0 65536 0 1000 syz-executor
[59622.323454] [ 14776] 0 14776 12551 0 49152 34 0 syz-executor
[59622.325374] [ 15291] 0 15289 12584 0 69632 4 1000 syz-executor
[59622.327293] [ 14813] 0 14813 12551 0 49152 33 0 syz-executor
[59622.329216] [ 15295] 0 15294 12584 0 65536 0 1000 syz-executor
[59622.331139] [ 14819] 0 14819 12551 0 49152 32 0 syz-executor
[59622.333068] [ 15293] 0 15292 12584 0 65536 0 1000 syz-executor
[59622.334991] [ 14821] 0 14821 12551 0 49152 31 0 syz-executor
[59622.336921] [ 15297] 0 15296 12584 0 65536 0 1000 syz-executor
[59622.338875] [ 14831] 0 14831 12551 0 49152 42 0 syz-executor
[59622.340811] [ 15299] 0 15298 12617 0 65536 0 1000 syz-executor
[59622.342752] [ 14804] 0 14804 12551 0 49152 38 0 syz-executor
[59622.344700] [ 15301] 0 15300 12584 0 69632 0 1000 syz-executor
[59622.346649] [ 14816] 0 14816 12551 0 49152 27 0 syz-executor
[59622.348604] [ 15306] 0 15305 12584 0 69632 0 1000 syz-executor
[59622.350615] [ 14774] 0 14774 12551 0 49152 30 0 syz-executor
[59622.352606] [ 15304] 0 15303 12584 0 65536 0 1000 syz-executor
[59622.354575] [ 14807] 0 14807 12551 0 49152 25 0 syz-executor
[59622.356531] [ 15307] 0 15307 12584 0 69632 37 1000 syz-executor
[59622.358481] [ 14801] 0 14801 12551 0 49152 23 0 syz-executor
[59622.360418] [ 15261] 0 15260 12618 0 65536 31 1000 syz-executor
[59622.362355] [ 14788] 0 14788 12551 0 49152 24 0 syz-executor
[59622.364291] [ 15263] 0 15262 12618 0 69632 0 1000 syz-executor
[59622.366269] [ 14503] 0 14503 25205 1712 241664 235 0 systemd-journal
[59622.368288] [ 14736] 0 14736 1637 398 61440 32 0 agetty
[59622.370161] [ 14737] 0 14737 1637 383 61440 31 0 agetty
[59622.372037] [ 14743] 0 14743 1637 398 61440 31 0 agetty
[59622.373939] [ 14744] 0 14744 1637 387 57344 31 0 agetty
[59622.375849] [ 14512] 0 14512 22648 1593 221184 233 0 systemd-logind
[59622.377858] [ 14513] 81 14513 13516 801 155648 138 -900 dbus-daemon
[59622.379867] [ 14734] 0 14734 52899 856 180224 440 0 rsyslogd
[59622.381813] [ 14735] 0 14735 1637 343 61440 31 0 agetty
[59622.383701] [ 14741] 0 14741 5724 572 94208 239 0 crond
[59622.385568] [ 14086] 0 14086 25345 2120 233472 520 0 systemd
[59622.387470] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=lxc.payload.test,mems_allowed=0,oom_memcg=/lxc.payload.test,task_memcg=/lxc.payload.test,task=syz-executor,pid=15307,uid=0
[59622.391402] Memory cgroup out of memory: Killed process 15307 (syz-executor) total-vm:50336kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:68kB oom_score_adj:1000
[59622.395544] oom_reaper: reaped process 15307 (syz-executor), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[59622.411181] syz-executor invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=1000
[59622.414272] CPU: 2 PID: 15307 Comm: syz-executor Kdump: loaded Not tainted 5.15.0-rc4+ #55
[59622.416247] Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
[59622.417950] Call Trace:
[59622.418901] dump_stack_lvl+0x57/0x72
[59622.420067] dump_header+0x4a/0x2c1
[59622.421200] out_of_memory.cold+0xa/0x7e
[59622.422415] pagefault_out_of_memory+0x46/0x60
[59622.423784] exc_page_fault+0x79/0x2b0
[59622.424966] ? asm_exc_page_fault+0x8/0x30
[59622.426207] asm_exc_page_fault+0x1e/0x30
[59622.427426] RIP: 0033:0x41a948
[59622.428485] Code: Unable to access opcode bytes at RIP 0x41a91e.
[59622.430064] RSP: 002b:00007ffd10ebf6e8 EFLAGS: 00010246
[59622.431513] RAX: 0000000000000000 RBX: 000000000119cf40 RCX: 000000000041b331
[59622.433299] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[59622.435090] RBP: 000000000119d940 R08: 0000000000000000 R09: 0000000000000000
[59622.436875] R10: 00007ffd10ebf7c0 R11: 0000000000000293 R12: 00000000038dc0e1
[59622.438691] R13: 00000000038dbdc3 R14: 20c49ba5e353f7cf R15: ffffffffffffffff
[59622.440547] Mem-Info:
[59622.441516] active_anon:279 inactive_anon:24996 isolated_anon:0
active_file:395743 inactive_file:848017 isolated_file:0
unevictable:0 dirty:159 writeback:21
slab_reclaimable:40993 slab_unreclaimable:45236
mapped:66928 shmem:286 pagetables:2477 bounce:0
kernel_misc_reclaimable:0
free:2376272 free_pcp:7579 free_cma:0
[59622.451730] Node 0 active_anon:1116kB inactive_anon:99984kB active_file:1582972kB inactive_file:3392068kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:267712kB dirty:832kB writeback:84kB shmem:1144kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:5488kB pagetables:9908kB all_unreclaimable? no
[59622.458826] Node 0 DMA free:13296kB min:64kB low:80kB high:96kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[59622.465131] lowmem_reserve[]: 0 2706 15265 15265 15265
[59622.466746] Node 0 DMA32 free:2586964kB min:11968kB low:14960kB high:17952kB reserved_highatomic:0KB active_anon:0kB inactive_anon:20kB active_file:27084kB inactive_file:172344kB unevictable:0kB writepending:0kB present:3129200kB managed:2801520kB mlocked:0kB bounce:0kB free_pcp:6120kB local_pcp:1004kB free_cma:0kB
[59622.473718] lowmem_reserve[]: 0 0 12559 12559 12559
[59622.475311] Node 0 Normal free:6904828kB min:55544kB low:69428kB high:83312kB reserved_highatomic:0KB active_anon:1116kB inactive_anon:99964kB active_file:1555888kB inactive_file:3219724kB unevictable:0kB writepending:916kB present:13238272kB managed:12871060kB mlocked:0kB bounce:0kB free_pcp:24068kB local_pcp:692kB free_cma:0kB
[59622.482647] lowmem_reserve[]: 0 0 0 0 0
[59622.484096] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (U) 0*1024kB 2*2048kB (UM) 2*4096kB (M) = 13296kB
[59622.488044] Node 0 DMA32: 705*4kB (UM) 624*8kB (UME) 497*16kB (UME) 594*32kB (UME) 532*64kB (UME) 49*128kB (UME) 36*256kB (UE) 36*512kB (UE) 20*1024kB (U) 17*2048kB (UE) 593*4096kB (U) = 2586964kB
[59622.492687] Node 0 Normal: 2743*4kB (UME) 2084*8kB (UME) 536*16kB (UME) 1496*32kB (UME) 1148*64kB (ME) 171*128kB (UME) 115*256kB (M) 94*512kB (M) 98*1024kB (UME) 163*2048kB (UME) 1517*4096kB (UM) = 6904828kB
[59622.497553] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[59622.499898] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[59622.502197] 1244068 total pagecache pages
[59622.503688] 4 pages in swap cache
[59622.505046] Swap cache stats: add 54921, delete 54922, find 137/9740
[59622.506956] Free swap = 8357116kB
[59622.508352] Total swap = 8388604kB
[59622.510564] 4095866 pages RAM
[59622.512379] 0 pages HighMem/MovableOnly
[59622.513853] 173881 pages reserved
[59622.515220] 0 pages cma reserved
[59622.516581] 0 pages hwpoisoned
[59622.517897] Tasks state (memory values in pages):
[59622.519524] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[59622.521895] [ 648] 0 648 71491 45126 552960 0 -250 systemd-journal
[59622.524550] [ 666] 0 666 11976 3110 102400 0 -1000 systemd-udevd
[59622.526944] [ 797] 193 797 10732 5087 122880 0 0 systemd-resolve
[59622.529352] [ 798] 0 798 26617 605 69632 0 -1000 auditd
[59622.531635] [ 800] 0 800 2077 1056 49152 0 0 sedispatch
[59622.533963] [ 821] 0 821 19874 761 57344 0 0 irqbalance
[59622.536297] [ 823] 0 823 2921 533 45056 0 0 mcelog
[59622.538590] [ 824] 996 824 666787 5873 217088 0 0 polkitd
[59622.540882] [ 825] 0 825 20273 951 53248 0 0 qemu-ga
[59622.543158] [ 826] 0 826 205022 10950 1179648 0 0 rsyslogd
[59622.545449] [ 827] 0 827 12186 2965 102400 0 0 sssd
[59622.547678] [ 828] 0 828 4502 2175 73728 0 0 systemd-homed
[59622.550047] [ 829] 0 829 4466 2118 69632 0 0 systemd-machine
[59622.552466] [ 830] 81 830 2530 1083 69632 0 -900 dbus-broker-lau
[59622.554852] [ 835] 0 835 68167 3994 139264 0 0 abrtd
[59622.557060] [ 839] 982 839 23747 967 69632 0 0 chronyd
[59622.559289] [ 840] 81 840 1525 919 49152 0 -900 dbus-broker
[59622.561576] [ 843] 0 843 12718 3366 110592 0 0 sssd_be
[59622.563803] [ 847] 0 847 177006 5860 655360 0 0 abrt-dump-journ
[59622.566207] [ 849] 0 849 174956 6978 827392 0 0 abrt-dump-journ
[59622.568518] [ 850] 0 850 15570 9663 143360 0 0 sssd_nss
[59622.570712] [ 857] 0 857 4966 2724 81920 0 0 systemd-logind
[59622.572980] [ 860] 0 860 79033 2793 118784 0 0 ModemManager
[59622.575247] [ 861] 0 861 34022 10210 155648 0 0 firewalld
[59622.577428] [ 867] 0 867 66866 5087 147456 0 0 NetworkManager
[59622.579690] [ 881] 0 881 7636 2072 73728 0 -1000 sshd
[59622.581790] [ 887] 0 887 13477 801 77824 0 0 gssproxy
[59622.583916] [ 941] 0 941 5246 715 61440 0 0 atd
[59622.585957] [ 943] 0 943 4474 892 65536 0 0 crond
[59622.588007] [ 961] 0 961 2408 455 49152 0 0 agetty
[59622.590051] [ 962] 0 962 3050 458 57344 0 0 agetty
[59622.592069] [ 1018] 991 1018 6645 533 65536 0 0 dnsmasq
[59622.594114] [ 1060] 0 1060 11399 2828 90112 0 0 sshd
[59622.596075] [ 1459] 0 1459 4387 1924 73728 0 0 systemd-userdbd
[59622.598197] [ 1488] 1000 1488 7862 3570 90112 0 0 systemd
[59622.600204] [ 1490] 1000 1490 12058 1458 102400 0 0 (sd-pam)
[59622.602218] [ 1555] 1000 1555 11399 1367 81920 0 0 sshd
[59622.604170] [ 1561] 1000 1561 5188 1443 61440 0 0 bash
[59622.606168] [ 2078] 991 2078 6645 532 61440 0 0 dnsmasq
[59622.608154] [ 2080] 0 2080 6638 100 61440 0 0 dnsmasq
[59622.610161] [ 6747] 1000 6747 10102 2101 94208 0 0 su
[59622.612074] [ 6751] 0 6751 4504 1424 65536 0 0 bash
[59622.614022] [ 14077] 0 14077 4470 1927 73728 0 0 systemd-userwor
[59622.616135] [ 14078] 0 14078 4470 1908 77824 0 0 systemd-userwor
[59622.618252] [ 14080] 0 14080 4470 1929 69632 0 0 systemd-userwor
[59622.620358] [ 14085] 0 14085 2069 642 57344 0 0 lxc-start
[59622.622379] [ 14086] 0 14086 25345 2120 233472 520 0 systemd
[59622.624401] [ 14503] 0 14503 25205 1712 241664 235 0 systemd-journal
[59622.626610] [ 14512] 0 14512 22648 1593 221184 233 0 systemd-logind
[59622.628684] [ 14513] 81 14513 13516 801 155648 138 -900 dbus-daemon
[59622.630714] [ 14694] 0 14694 2072 1104 49152 0 0 3
[59622.632591] [ 14696] 0 14696 56226 690 90112 145 0 bash
[59622.634510] [ 14734] 0 14734 52899 856 180224 440 0 rsyslogd
[59622.636496] [ 14735] 0 14735 1637 343 61440 31 0 agetty
[59622.638468] [ 14736] 0 14736 1637 398 61440 32 0 agetty
[59622.640417] [ 14737] 0 14737 1637 383 61440 31 0 agetty
[59622.642356] [ 14741] 0 14741 5724 572 94208 239 0 crond
[59622.644286] [ 14743] 0 14743 1637 398 61440 31 0 agetty
[59622.646226] [ 14744] 0 14744 1637 387 57344 31 0 agetty
[59622.648164] [ 14746] 0 14746 262198 0 266240 2900 0 syz-execprog
[59622.650200] [ 14764] 0 14764 12552 0 36864 16 0 syz-executor
[59622.652255] [ 14765] 0 14765 12552 0 36864 16 0 syz-executor
[59622.654283] [ 14766] 0 14766 12551 0 49152 44 0 syz-executor
[59622.656306] [ 14767] 0 14767 12551 0 49152 41 0 syz-executor
[59622.658318] [ 14769] 0 14769 12552 0 36864 16 0 syz-executor
[59622.660328] [ 14772] 0 14772 12552 0 36864 17 0 syz-executor
[59622.662330] [ 14774] 0 14774 12551 0 49152 30 0 syz-executor
[59622.664337] [ 14775] 0 14775 12552 0 36864 17 0 syz-executor
[59622.666456] [ 14776] 0 14776 12551 0 49152 34 0 syz-executor
[59622.668449] [ 14777] 0 14777 12552 0 36864 16 0 syz-executor
[59622.670459] [ 14779] 0 14779 12551 0 49152 37 0 syz-executor
[59622.672445] [ 14781] 0 14781 12552 0 36864 16 0 syz-executor
[59622.674449] [ 14782] 0 14782 12551 0 49152 35 0 syz-executor
[59622.676423] [ 14783] 0 14783 12552 0 36864 17 0 syz-executor
[59622.678418] [ 14784] 0 14784 12551 0 49152 36 0 syz-executor
[59622.680430] [ 14787] 0 14787 12552 0 36864 16 0 syz-executor
[59622.682398] [ 14788] 0 14788 12551 0 49152 24 0 syz-executor
[59622.684359] [ 14800] 0 14800 12552 0 36864 16 0 syz-executor
[59622.686357] [ 14801] 0 14801 12551 0 49152 23 0 syz-executor
[59622.688449] [ 14803] 0 14803 12552 0 36864 16 0 syz-executor
[59622.690434] [ 14804] 0 14804 12551 0 49152 38 0 syz-executor
[59622.692466] [ 14805] 0 14805 12552 0 36864 16 0 syz-executor
[59622.694436] [ 14807] 0 14807 12551 0 49152 25 0 syz-executor
[59622.696378] [ 14812] 0 14812 12552 0 36864 17 0 syz-executor
[59622.698313] [ 14813] 0 14813 12551 0 49152 33 0 syz-executor
[59622.700252] [ 14814] 0 14814 12552 0 36864 16 0 syz-executor
[59622.702186] [ 14816] 0 14816 12551 0 49152 27 0 syz-executor
[59622.704126] [ 14817] 0 14817 12552 0 36864 17 0 syz-executor
[59622.706053] [ 14819] 0 14819 12551 0 49152 32 0 syz-executor
[59622.707978] [ 14820] 0 14820 12552 0 36864 16 0 syz-executor
[59622.709923] [ 14821] 0 14821 12551 0 49152 31 0 syz-executor
[59622.711847] [ 14831] 0 14831 12551 0 49152 42 0 syz-executor
[59622.713773] [ 15261] 0 15260 12618 0 65536 31 1000 syz-executor
[59622.715691] [ 15263] 0 15262 12618 0 69632 0 1000 syz-executor
[59622.717604] [ 15280] 0 15279 12584 0 69632 0 1000 syz-executor
[59622.719506] [ 15282] 0 15281 12584 0 69632 0 1000 syz-executor
[59622.721401] [ 15284] 0 15283 12584 0 65536 0 1000 syz-executor
[59622.723316] [ 15286] 0 15285 12584 0 65536 0 1000 syz-executor
[59622.725243] [ 15288] 0 15287 12617 0 65536 0 1000 syz-executor
[59622.727121] [ 15291] 0 15289 12584 0 69632 4 1000 syz-executor
[59622.728993] [ 15293] 0 15292 12584 0 65536 0 1000 syz-executor
[59622.730857] [ 15295] 0 15294 12584 0 65536 0 1000 syz-executor
[59622.732722] [ 15297] 0 15296 12584 0 65536 0 1000 syz-executor
[59622.734587] [ 15299] 0 15298 12617 0 65536 0 1000 syz-executor
[59622.736441] [ 15301] 0 15300 12584 0 69632 0 1000 syz-executor
[59622.738319] [ 15304] 0 15303 12584 0 65536 0 1000 syz-executor
[59622.740174] [ 15306] 0 15305 12584 0 69632 0 1000 syz-executor
[59622.742022] [ 15307] 0 15307 12584 0 69632 7 1000 syz-executor
[59622.743869] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
[59622.745603] CPU: 2 PID: 15307 Comm: syz-executor Kdump: loaded Not tainted 5.15.0-rc4+ #55
[59622.747339] Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
[59622.748817] Call Trace:
[59622.749557] dump_stack_lvl+0x57/0x72
[59622.750520] panic+0xff/0x2ea
[59622.751390] out_of_memory.cold+0x2f/0x7e
[59622.752413] pagefault_out_of_memory+0x46/0x60
[59622.753522] exc_page_fault+0x79/0x2b0
[59622.754504] ? asm_exc_page_fault+0x8/0x30
[59622.755549] asm_exc_page_fault+0x1e/0x30
[59622.756579] RIP: 0033:0x41a948
[59622.757441] Code: Unable to access opcode bytes at RIP 0x41a91e.
[59622.758831] RSP: 002b:00007ffd10ebf6e8 EFLAGS: 00010246
[59622.760086] RAX: 0000000000000000 RBX: 000000000119cf40 RCX: 000000000041b331
[59622.761687] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[59622.763287] RBP: 000000000119d940 R08: 0000000000000000 R09: 0000000000000000
[59622.764888] R10: 00007ffd10ebf7c0 R11: 0000000000000293 R12: 00000000038dc0e1
[59622.766512] R13: 00000000038dbdc3 R14: 20c49ba5e353f7cf R15: ffffffffffffffff
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-18 8:13 [PATCH memcg 0/1] false global OOM triggered by memcg-limited task Vasily Averin
@ 2021-10-18 9:04 ` Michal Hocko
2021-10-18 10:05 ` Vasily Averin
2021-10-21 8:03 ` [PATCH memcg 0/1] false global OOM triggered by memcg-limited task Vasily Averin
0 siblings, 2 replies; 66+ messages in thread
From: Michal Hocko @ 2021-10-18 9:04 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
cgroups, linux-mm, linux-kernel, kernel
On Mon 18-10-21 11:13:52, Vasily Averin wrote:
[...]
> How could this happen?
>
> User-space task inside the memcg-limited container generated a page fault,
> its handler do_user_addr_fault() called handle_mm_fault which could not
> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
> out_of_memory() without set of memcg.
>
> Partially this problem depends on one of my recent patches, disabled unlimited
> memory allocation for dying tasks. However I think the problem can happen
> on non-killed tasks too, for example because of kmem limit.
Could you be more specific on how this can happen without your patch? I
have to say I haven't realized this side effect when discussing it.
I will be honest that I am not really happy about pagefault_out_of_memory.
I have tried to remove it in the past. Without much success back then,
unfortunately[1].
Maybe we should get rid of it finally. The OOM is always triggered from
inside the allocator where we have much more infromation about the
allocation context. A first step would be to skip pagefault_out_of_memory
for killed or exiting processes.
[1] I do not have msg-id so I cannot provide a lore link but google
pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-18 9:04 ` Michal Hocko
@ 2021-10-18 10:05 ` Vasily Averin
2021-10-18 10:12 ` Vasily Averin
2021-10-18 11:53 ` Michal Hocko
2021-10-21 8:03 ` [PATCH memcg 0/1] false global OOM triggered by memcg-limited task Vasily Averin
1 sibling, 2 replies; 66+ messages in thread
From: Vasily Averin @ 2021-10-18 10:05 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
cgroups, linux-mm, linux-kernel, kernel
On 18.10.2021 12:04, Michal Hocko wrote:
> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> [...]
>> How could this happen?
>>
>> User-space task inside the memcg-limited container generated a page fault,
>> its handler do_user_addr_fault() called handle_mm_fault which could not
>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>> out_of_memory() without set of memcg.
>>
>> Partially this problem depends on one of my recent patches, disabled unlimited
>> memory allocation for dying tasks. However I think the problem can happen
>> on non-killed tasks too, for example because of kmem limit.
>
> Could you be more specific on how this can happen without your patch? I
> have to say I haven't realized this side effect when discussing it.
We can reach obj_cgroup_charge_pages() for example via
do_user_addr_fault
handle_mm_fault
__handle_mm_fault
p4d_alloc
__p4d_alloc
p4d_alloc_one
get_zeroed_page
__get_free_pages
alloc_pages
__alloc_pages
__memcg_kmem_charge_page
obj_cgroup_charge_pages
Here we call try_charge_memcg() that return success and approve the allocation,
however then we hit into kmem limit and fail the allocation.
If required I can try to search how try_charge_memcg() can reject page allocation
of non-dying task too.
> I will be honest that I am not really happy about pagefault_out_of_memory.
> I have tried to remove it in the past. Without much success back then,
> unfortunately[1].
> Maybe we should get rid of it finally. The OOM is always triggered from
> inside the allocator where we have much more infromation about the
> allocation context. A first step would be to skip pagefault_out_of_memory
> for killed or exiting processes.
I like this idea, however it may be not enough, at least in scenario described above.
> [1] I do not have msg-id so I cannot provide a lore link but google
> pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html
Thank you, I'll read this discussion.
Vasily Averin
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-18 10:05 ` Vasily Averin
@ 2021-10-18 10:12 ` Vasily Averin
2021-10-18 11:53 ` Michal Hocko
1 sibling, 0 replies; 66+ messages in thread
From: Vasily Averin @ 2021-10-18 10:12 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
cgroups, linux-mm, linux-kernel, kernel
On 18.10.2021 13:05, Vasily Averin wrote:
> On 18.10.2021 12:04, Michal Hocko wrote:
>> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
>>> Partially this problem depends on one of my recent patches, disabled unlimited
>>> memory allocation for dying tasks. However I think the problem can happen
>>> on non-killed tasks too, for example because of kmem limit.
>>
>> Could you be more specific on how this can happen without your patch? I
>> have to say I haven't realized this side effect when discussing it.
>
> We can reach obj_cgroup_charge_pages() for example via
>
> do_user_addr_fault
> handle_mm_fault
> __handle_mm_fault
> p4d_alloc
> __p4d_alloc
> p4d_alloc_one
> get_zeroed_page
> __get_free_pages
> alloc_pages
> __alloc_pages
> __memcg_kmem_charge_page
> obj_cgroup_charge_pages
>
> Here we call try_charge_memcg() that return success and approve the allocation,
> however then we hit into kmem limit and fail the allocation.
btw. in OpenVZ kernels we trying to cleanup the memory in when task hit kmem limit,
therefore we moved kmem limit check into try_charge_memcg.
Are any improvements for kmem controller interesting for upstream?
Thank you,
Vasily Averin
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-18 10:05 ` Vasily Averin
2021-10-18 10:12 ` Vasily Averin
@ 2021-10-18 11:53 ` Michal Hocko
[not found] ` <27dc0c49-a0d6-875b-49c6-0ef5c0cc3ac8@virtuozzo.com>
2021-10-19 6:30 ` Vasily Averin
1 sibling, 2 replies; 66+ messages in thread
From: Michal Hocko @ 2021-10-18 11:53 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
cgroups, linux-mm, linux-kernel, kernel
On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> On 18.10.2021 12:04, Michal Hocko wrote:
> > On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> > [...]
> >> How could this happen?
> >>
> >> User-space task inside the memcg-limited container generated a page fault,
> >> its handler do_user_addr_fault() called handle_mm_fault which could not
> >> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
> >> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
> >> out_of_memory() without set of memcg.
> >>
> >> Partially this problem depends on one of my recent patches, disabled unlimited
> >> memory allocation for dying tasks. However I think the problem can happen
> >> on non-killed tasks too, for example because of kmem limit.
> >
> > Could you be more specific on how this can happen without your patch? I
> > have to say I haven't realized this side effect when discussing it.
>
> We can reach obj_cgroup_charge_pages() for example via
>
> do_user_addr_fault
> handle_mm_fault
> __handle_mm_fault
> p4d_alloc
> __p4d_alloc
> p4d_alloc_one
> get_zeroed_page
> __get_free_pages
> alloc_pages
> __alloc_pages
> __memcg_kmem_charge_page
> obj_cgroup_charge_pages
>
> Here we call try_charge_memcg() that return success and approve the allocation,
> however then we hit into kmem limit and fail the allocation.
Just to make sure I understand this would be for the v1 kmem explicit
limit, correct?
> If required I can try to search how try_charge_memcg() can reject page allocation
> of non-dying task too.
Yes.
> > I will be honest that I am not really happy about pagefault_out_of_memory.
> > I have tried to remove it in the past. Without much success back then,
> > unfortunately[1].
> > Maybe we should get rid of it finally. The OOM is always triggered from
> > inside the allocator where we have much more infromation about the
> > allocation context. A first step would be to skip pagefault_out_of_memory
> > for killed or exiting processes.
>
> I like this idea, however it may be not enough, at least in scenario described above.
I original patch has removed the oom killer completely.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
[not found] ` <27dc0c49-a0d6-875b-49c6-0ef5c0cc3ac8@virtuozzo.com>
@ 2021-10-18 12:27 ` Michal Hocko
2021-10-18 15:07 ` Shakeel Butt
0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2021-10-18 12:27 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
cgroups, linux-mm, linux-kernel, kernel
[restore the cc list]
On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> On 18.10.2021 14:53, Michal Hocko wrote:
> > On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >> On 18.10.2021 12:04, Michal Hocko wrote:
> >> Here we call try_charge_memcg() that return success and approve the allocation,
> >> however then we hit into kmem limit and fail the allocation.
> >
> > Just to make sure I understand this would be for the v1 kmem explicit
> > limit, correct?
>
> yes, I mean this limit.
OK, thanks for the clarification. This is a known problem. Have a look
at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
status since 2019 without any actual report sugested by the kernel
message. Maybe we should try and remove it and see whether that prompts
some pushback.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-18 12:27 ` Michal Hocko
@ 2021-10-18 15:07 ` Shakeel Butt
2021-10-18 16:51 ` Michal Hocko
2021-10-18 18:52 ` Vasily Averin
0 siblings, 2 replies; 66+ messages in thread
From: Shakeel Butt @ 2021-10-18 15:07 UTC (permalink / raw)
To: Michal Hocko
Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Andrew Morton,
Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
Cgroups, Linux MM, LKML, kernel
On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
>
> [restore the cc list]
>
> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> > On 18.10.2021 14:53, Michal Hocko wrote:
> > > On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> > >> On 18.10.2021 12:04, Michal Hocko wrote:
> > >> Here we call try_charge_memcg() that return success and approve the allocation,
> > >> however then we hit into kmem limit and fail the allocation.
> > >
> > > Just to make sure I understand this would be for the v1 kmem explicit
> > > limit, correct?
> >
> > yes, I mean this limit.
>
> OK, thanks for the clarification. This is a known problem. Have a look
> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> status since 2019 without any actual report sugested by the kernel
> message. Maybe we should try and remove it and see whether that prompts
> some pushback.
>
Yes, I think now should be the right time to take the next step for
deprecation of kmem limits:
https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-18 15:07 ` Shakeel Butt
@ 2021-10-18 16:51 ` Michal Hocko
2021-10-18 17:13 ` Shakeel Butt
2021-10-18 18:52 ` Vasily Averin
1 sibling, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2021-10-18 16:51 UTC (permalink / raw)
To: Shakeel Butt
Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Andrew Morton,
Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
Cgroups, Linux MM, LKML, kernel
On Mon 18-10-21 08:07:20, Shakeel Butt wrote:
> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > [restore the cc list]
> >
> > On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> > > On 18.10.2021 14:53, Michal Hocko wrote:
> > > > On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> > > >> On 18.10.2021 12:04, Michal Hocko wrote:
> > > >> Here we call try_charge_memcg() that return success and approve the allocation,
> > > >> however then we hit into kmem limit and fail the allocation.
> > > >
> > > > Just to make sure I understand this would be for the v1 kmem explicit
> > > > limit, correct?
> > >
> > > yes, I mean this limit.
> >
> > OK, thanks for the clarification. This is a known problem. Have a look
> > at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> > kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> > status since 2019 without any actual report sugested by the kernel
> > message. Maybe we should try and remove it and see whether that prompts
> > some pushback.
> >
>
> Yes, I think now should be the right time to take the next step for
> deprecation of kmem limits:
> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
I completely forgot about your patch. Anyway, it usually takes us years
to deprecate something so let's stick with it and consider 2 years as
years ;)
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-18 16:51 ` Michal Hocko
@ 2021-10-18 17:13 ` Shakeel Butt
0 siblings, 0 replies; 66+ messages in thread
From: Shakeel Butt @ 2021-10-18 17:13 UTC (permalink / raw)
To: Michal Hocko
Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Andrew Morton,
Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
Cgroups, Linux MM, LKML, kernel
On Mon, Oct 18, 2021 at 9:51 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 18-10-21 08:07:20, Shakeel Butt wrote:
> > On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > [restore the cc list]
> > >
> > > On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> > > > On 18.10.2021 14:53, Michal Hocko wrote:
> > > > > On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> > > > >> On 18.10.2021 12:04, Michal Hocko wrote:
> > > > >> Here we call try_charge_memcg() that return success and approve the allocation,
> > > > >> however then we hit into kmem limit and fail the allocation.
> > > > >
> > > > > Just to make sure I understand this would be for the v1 kmem explicit
> > > > > limit, correct?
> > > >
> > > > yes, I mean this limit.
> > >
> > > OK, thanks for the clarification. This is a known problem. Have a look
> > > at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> > > kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> > > status since 2019 without any actual report sugested by the kernel
> > > message. Maybe we should try and remove it and see whether that prompts
> > > some pushback.
> > >
> >
> > Yes, I think now should be the right time to take the next step for
> > deprecation of kmem limits:
> > https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
>
> I completely forgot about your patch. Anyway, it usually takes us years
> to deprecate something so let's stick with it and consider 2 years as
> years ;)
>
Sure, I will rebase and resend.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-18 15:07 ` Shakeel Butt
2021-10-18 16:51 ` Michal Hocko
@ 2021-10-18 18:52 ` Vasily Averin
2021-10-18 19:18 ` Vasily Averin
2021-10-19 5:33 ` Shakeel Butt
1 sibling, 2 replies; 66+ messages in thread
From: Vasily Averin @ 2021-10-18 18:52 UTC (permalink / raw)
To: Shakeel Butt, Michal Hocko
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Mel Gorman, Cgroups, Linux MM,
LKML, kernel
On 18.10.2021 18:07, Shakeel Butt wrote:
> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
>>
>> [restore the cc list]
>>
>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
>>> On 18.10.2021 14:53, Michal Hocko wrote:
>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
>>>>> however then we hit into kmem limit and fail the allocation.
>>>>
>>>> Just to make sure I understand this would be for the v1 kmem explicit
>>>> limit, correct?
>>>
>>> yes, I mean this limit.
>>
>> OK, thanks for the clarification. This is a known problem. Have a look
>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
>> status since 2019 without any actual report sugested by the kernel
>> message. Maybe we should try and remove it and see whether that prompts
>> some pushback.
>>
>
> Yes, I think now should be the right time to take the next step for
> deprecation of kmem limits:
> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
Are you going to push it to stable kernels too?
Thank you,
Vasily Averin
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-18 18:52 ` Vasily Averin
@ 2021-10-18 19:18 ` Vasily Averin
2021-10-19 5:34 ` Shakeel Butt
2021-10-19 5:33 ` Shakeel Butt
1 sibling, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-18 19:18 UTC (permalink / raw)
To: Shakeel Butt, Michal Hocko
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Mel Gorman, Cgroups, Linux MM,
LKML, kernel
On 18.10.2021 21:52, Vasily Averin wrote:
> On 18.10.2021 18:07, Shakeel Butt wrote:
>> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
>>>
>>> [restore the cc list]
>>>
>>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
>>>> On 18.10.2021 14:53, Michal Hocko wrote:
>>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
>>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
>>>>>> however then we hit into kmem limit and fail the allocation.
>>>>>
>>>>> Just to make sure I understand this would be for the v1 kmem explicit
>>>>> limit, correct?
>>>>
>>>> yes, I mean this limit.
>>>
>>> OK, thanks for the clarification. This is a known problem. Have a look
>>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
>>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
>>> status since 2019 without any actual report sugested by the kernel
>>> message. Maybe we should try and remove it and see whether that prompts
>>> some pushback.
>>>
>>
>> Yes, I think now should be the right time to take the next step for
>> deprecation of kmem limits:
>> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
>
> Are you going to push it to stable kernels too?
Btw CONFIG_MEMCG_KMEM=y is set both in RHEL8 kernels and in ubuntu 20.04 LTS kernel 5.11.0-37.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-18 18:52 ` Vasily Averin
2021-10-18 19:18 ` Vasily Averin
@ 2021-10-19 5:33 ` Shakeel Butt
2021-10-19 6:42 ` Vasily Averin
1 sibling, 1 reply; 66+ messages in thread
From: Shakeel Butt @ 2021-10-19 5:33 UTC (permalink / raw)
To: Vasily Averin
Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton,
Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
Cgroups, Linux MM, LKML, kernel
On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> On 18.10.2021 18:07, Shakeel Butt wrote:
> > On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
> >>
> >> [restore the cc list]
> >>
> >> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> >>> On 18.10.2021 14:53, Michal Hocko wrote:
> >>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >>>>> On 18.10.2021 12:04, Michal Hocko wrote:
> >>>>> Here we call try_charge_memcg() that return success and approve the allocation,
> >>>>> however then we hit into kmem limit and fail the allocation.
> >>>>
> >>>> Just to make sure I understand this would be for the v1 kmem explicit
> >>>> limit, correct?
> >>>
> >>> yes, I mean this limit.
> >>
> >> OK, thanks for the clarification. This is a known problem. Have a look
> >> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> >> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> >> status since 2019 without any actual report sugested by the kernel
> >> message. Maybe we should try and remove it and see whether that prompts
> >> some pushback.
> >>
> >
> > Yes, I think now should be the right time to take the next step for
> > deprecation of kmem limits:
> > https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
>
> Are you going to push it to stable kernels too?
>
Not really. Is there a reason I should? More exposure to catch breakage?
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-18 19:18 ` Vasily Averin
@ 2021-10-19 5:34 ` Shakeel Butt
0 siblings, 0 replies; 66+ messages in thread
From: Shakeel Butt @ 2021-10-19 5:34 UTC (permalink / raw)
To: Vasily Averin
Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton,
Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
Cgroups, Linux MM, LKML, kernel
On Mon, Oct 18, 2021 at 12:19 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> On 18.10.2021 21:52, Vasily Averin wrote:
> > On 18.10.2021 18:07, Shakeel Butt wrote:
> >> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
> >>>
> >>> [restore the cc list]
> >>>
> >>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> >>>> On 18.10.2021 14:53, Michal Hocko wrote:
> >>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
> >>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
> >>>>>> however then we hit into kmem limit and fail the allocation.
> >>>>>
> >>>>> Just to make sure I understand this would be for the v1 kmem explicit
> >>>>> limit, correct?
> >>>>
> >>>> yes, I mean this limit.
> >>>
> >>> OK, thanks for the clarification. This is a known problem. Have a look
> >>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> >>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> >>> status since 2019 without any actual report sugested by the kernel
> >>> message. Maybe we should try and remove it and see whether that prompts
> >>> some pushback.
> >>>
> >>
> >> Yes, I think now should be the right time to take the next step for
> >> deprecation of kmem limits:
> >> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
> >
> > Are you going to push it to stable kernels too?
>
> Btw CONFIG_MEMCG_KMEM=y is set both in RHEL8 kernels and in ubuntu 20.04 LTS kernel 5.11.0-37.
>
CONFIG_MEMCG_KMEM is orthogonal to setting kmem limits. We are not
disabling the kmem accounting.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-18 11:53 ` Michal Hocko
[not found] ` <27dc0c49-a0d6-875b-49c6-0ef5c0cc3ac8@virtuozzo.com>
@ 2021-10-19 6:30 ` Vasily Averin
2021-10-19 8:49 ` Michal Hocko
1 sibling, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-19 6:30 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
cgroups, linux-mm, linux-kernel, kernel
On 18.10.2021 14:53, Michal Hocko wrote:
> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>> On 18.10.2021 12:04, Michal Hocko wrote:
>>> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
>>> [...]
>>>> How could this happen?
>>>>
>>>> User-space task inside the memcg-limited container generated a page fault,
>>>> its handler do_user_addr_fault() called handle_mm_fault which could not
>>>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>>>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>>>> out_of_memory() without set of memcg.
>>>>
>>>> Partially this problem depends on one of my recent patches, disabled unlimited
>>>> memory allocation for dying tasks. However I think the problem can happen
>>>> on non-killed tasks too, for example because of kmem limit.
>>>
>>> Could you be more specific on how this can happen without your patch? I
>>> have to say I haven't realized this side effect when discussing it.
>> If required I can try to search how try_charge_memcg() can reject page allocation
>> of non-dying task too.
>
> Yes.
Now I think that such failure was very unlikely (w/o my patch and kmem limit).
I cannot exclude it completely, because I did not finished this review and perhaps I missed something,
but I checked most part of code and found nothing.
With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
a) due to fatal signal
b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
However all these cases can be successfully handled by my new patch
"memcg: prevent false global OOM triggered by memcg limited task"
and I think it is better solution.
Thank you,
Vasily Averin
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-19 5:33 ` Shakeel Butt
@ 2021-10-19 6:42 ` Vasily Averin
2021-10-19 8:47 ` Michal Hocko
0 siblings, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-19 6:42 UTC (permalink / raw)
To: Shakeel Butt
Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton,
Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
Cgroups, Linux MM, LKML, kernel
On 19.10.2021 08:33, Shakeel Butt wrote:
> On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>>
>> On 18.10.2021 18:07, Shakeel Butt wrote:
>>> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
>>>>
>>>> [restore the cc list]
>>>>
>>>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
>>>>> On 18.10.2021 14:53, Michal Hocko wrote:
>>>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>>>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
>>>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
>>>>>>> however then we hit into kmem limit and fail the allocation.
>>>>>>
>>>>>> Just to make sure I understand this would be for the v1 kmem explicit
>>>>>> limit, correct?
>>>>>
>>>>> yes, I mean this limit.
>>>>
>>>> OK, thanks for the clarification. This is a known problem. Have a look
>>>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
>>>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
>>>> status since 2019 without any actual report sugested by the kernel
>>>> message. Maybe we should try and remove it and see whether that prompts
>>>> some pushback.
>>>
>>> Yes, I think now should be the right time to take the next step for
>>> deprecation of kmem limits:
>>> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
>>
>> Are you going to push it to stable kernels too?
>
> Not really. Is there a reason I should? More exposure to catch breakage?
There is a problem: kmem limit can trigger fake global OOM.
To fix it in upstream you can remove kmem controller.
However how to handle this problem in stable and LTS kernels?
My current patch resolves the problem too, and it can be backported.
However I afraid nobody will do it if teh patch will not be approved in upsteam.
Thank you,
Vasily Averin
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-19 6:42 ` Vasily Averin
@ 2021-10-19 8:47 ` Michal Hocko
0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2021-10-19 8:47 UTC (permalink / raw)
To: Vasily Averin
Cc: Shakeel Butt, Johannes Weiner, Vladimir Davydov, Andrew Morton,
Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
Cgroups, Linux MM, LKML, kernel
On Tue 19-10-21 09:42:38, Vasily Averin wrote:
> On 19.10.2021 08:33, Shakeel Butt wrote:
> > On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <vvs@virtuozzo.com> wrote:
> >>
> >> On 18.10.2021 18:07, Shakeel Butt wrote:
> >>> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
> >>>>
> >>>> [restore the cc list]
> >>>>
> >>>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> >>>>> On 18.10.2021 14:53, Michal Hocko wrote:
> >>>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >>>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
> >>>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
> >>>>>>> however then we hit into kmem limit and fail the allocation.
> >>>>>>
> >>>>>> Just to make sure I understand this would be for the v1 kmem explicit
> >>>>>> limit, correct?
> >>>>>
> >>>>> yes, I mean this limit.
> >>>>
> >>>> OK, thanks for the clarification. This is a known problem. Have a look
> >>>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> >>>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> >>>> status since 2019 without any actual report sugested by the kernel
> >>>> message. Maybe we should try and remove it and see whether that prompts
> >>>> some pushback.
> >>>
> >>> Yes, I think now should be the right time to take the next step for
> >>> deprecation of kmem limits:
> >>> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
> >>
> >> Are you going to push it to stable kernels too?
> >
> > Not really. Is there a reason I should? More exposure to catch breakage?
>
> There is a problem: kmem limit can trigger fake global OOM.
> To fix it in upstream you can remove kmem controller.
>
> However how to handle this problem in stable and LTS kernels?
I do not see any bug reports coming in and I strongly suspect this is
because the functionality is simply not used wildly enough or in the
mode where it would matter (U != 0, K < U from our documentation).
If there are relevant usecases for this setup then we really want to
hear about those because we do not want to break userspace. Handling
that setup would far from trivial and the global oom killer is not the
only one of those.
> My current patch resolves the problem too, and it can be backported.
> However I afraid nobody will do it if teh patch will not be approved in upsteam.
I do not think your current approach is the right one. We do not really
want yet another flag to tell we are in a memcg OOM. We already have
one.
A proper way is to deal with the pagefault oom killer trigger but that
might be just too risky for stable kernels.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-19 6:30 ` Vasily Averin
@ 2021-10-19 8:49 ` Michal Hocko
2021-10-19 10:30 ` Vasily Averin
0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2021-10-19 8:49 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
cgroups, linux-mm, linux-kernel, kernel
On Tue 19-10-21 09:30:18, Vasily Averin wrote:
[...]
> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> a) due to fatal signal
> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
>
> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
How is b) possible without current being killed? Do we allow remote
charging?
> However all these cases can be successfully handled by my new patch
> "memcg: prevent false global OOM triggered by memcg limited task"
> and I think it is better solution.
I have already replied to your approach in other email. Sorry our
replies have crossed.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-19 8:49 ` Michal Hocko
@ 2021-10-19 10:30 ` Vasily Averin
2021-10-19 11:54 ` Michal Hocko
0 siblings, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-19 10:30 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
cgroups, linux-mm, linux-kernel, kernel
On 19.10.2021 11:49, Michal Hocko wrote:
> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> [...]
>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
>> a) due to fatal signal
>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
>>
>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
> How is b) possible without current being killed? Do we allow remote
> charging?
out_of_memory for memcg_oom
select_bad_process
mem_cgroup_scan_tasks
oom_evaluate_task
oom_badness
/*
* Do not even consider tasks which are explicitly marked oom
* unkillable or have been already oom reaped or the are in
* the middle of vfork
*/
adj = (long)p->signal->oom_score_adj;
if (adj == OOM_SCORE_ADJ_MIN ||
test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
in_vfork(p)) {
task_unlock(p);
return LONG_MIN;
}
This time we handle userspace page fault, so we cannot be kenrel thread,
and cannot be in_vfork().
However task can be marked as oom unkillable,
i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
It can be set via oom_score_adj_write().
Thank you,
Vasily Averin
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-19 10:30 ` Vasily Averin
@ 2021-10-19 11:54 ` Michal Hocko
2021-10-19 12:04 ` Michal Hocko
0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2021-10-19 11:54 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
cgroups, linux-mm, linux-kernel, kernel
On Tue 19-10-21 13:30:06, Vasily Averin wrote:
> On 19.10.2021 11:49, Michal Hocko wrote:
> > On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> > [...]
> >> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> >> a) due to fatal signal
> >> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
> >>
> >> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> >> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
>
> > How is b) possible without current being killed? Do we allow remote
> > charging?
>
> out_of_memory for memcg_oom
> select_bad_process
> mem_cgroup_scan_tasks
> oom_evaluate_task
> oom_badness
>
> /*
> * Do not even consider tasks which are explicitly marked oom
> * unkillable or have been already oom reaped or the are in
> * the middle of vfork
> */
> adj = (long)p->signal->oom_score_adj;
> if (adj == OOM_SCORE_ADJ_MIN ||
> test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
> in_vfork(p)) {
> task_unlock(p);
> return LONG_MIN;
> }
>
> This time we handle userspace page fault, so we cannot be kenrel thread,
> and cannot be in_vfork().
> However task can be marked as oom unkillable,
> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
You are right. I am not sure there is a way out of this though. The task
can only retry for ever in this case. There is nothing actionable here.
We cannot kill the task and there is no other way to release the memory.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-19 11:54 ` Michal Hocko
@ 2021-10-19 12:04 ` Michal Hocko
2021-10-19 13:26 ` Vasily Averin
0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2021-10-19 12:04 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
cgroups, linux-mm, linux-kernel, kernel
On Tue 19-10-21 13:54:42, Michal Hocko wrote:
> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
> > On 19.10.2021 11:49, Michal Hocko wrote:
> > > On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> > > [...]
> > >> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> > >> a) due to fatal signal
> > >> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
> > >>
> > >> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> > >> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
> >
> > > How is b) possible without current being killed? Do we allow remote
> > > charging?
> >
> > out_of_memory for memcg_oom
> > select_bad_process
> > mem_cgroup_scan_tasks
> > oom_evaluate_task
> > oom_badness
> >
> > /*
> > * Do not even consider tasks which are explicitly marked oom
> > * unkillable or have been already oom reaped or the are in
> > * the middle of vfork
> > */
> > adj = (long)p->signal->oom_score_adj;
> > if (adj == OOM_SCORE_ADJ_MIN ||
> > test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
> > in_vfork(p)) {
> > task_unlock(p);
> > return LONG_MIN;
> > }
> >
> > This time we handle userspace page fault, so we cannot be kenrel thread,
> > and cannot be in_vfork().
> > However task can be marked as oom unkillable,
> > i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
>
> You are right. I am not sure there is a way out of this though. The task
> can only retry for ever in this case. There is nothing actionable here.
> We cannot kill the task and there is no other way to release the memory.
Btw. don't we force the charge in that case?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-19 12:04 ` Michal Hocko
@ 2021-10-19 13:26 ` Vasily Averin
2021-10-19 14:13 ` Michal Hocko
0 siblings, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-19 13:26 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
cgroups, linux-mm, linux-kernel, kernel
On 19.10.2021 15:04, Michal Hocko wrote:
> On Tue 19-10-21 13:54:42, Michal Hocko wrote:
>> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
>>> On 19.10.2021 11:49, Michal Hocko wrote:
>>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
>>>> [...]
>>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
>>>>> a) due to fatal signal
>>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
>>>>>
>>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
>>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
>>>
>>>> How is b) possible without current being killed? Do we allow remote
>>>> charging?
>>>
>>> out_of_memory for memcg_oom
>>> select_bad_process
>>> mem_cgroup_scan_tasks
>>> oom_evaluate_task
>>> oom_badness
>>>
>>> /*
>>> * Do not even consider tasks which are explicitly marked oom
>>> * unkillable or have been already oom reaped or the are in
>>> * the middle of vfork
>>> */
>>> adj = (long)p->signal->oom_score_adj;
>>> if (adj == OOM_SCORE_ADJ_MIN ||
>>> test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
>>> in_vfork(p)) {
>>> task_unlock(p);
>>> return LONG_MIN;
>>> }
>>>
>>> This time we handle userspace page fault, so we cannot be kenrel thread,
>>> and cannot be in_vfork().
>>> However task can be marked as oom unkillable,
>>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
>>
>> You are right. I am not sure there is a way out of this though. The task
>> can only retry for ever in this case. There is nothing actionable here.
>> We cannot kill the task and there is no other way to release the memory.
>
> Btw. don't we force the charge in that case?
We should force charge for allocation from inside page fault handler,
to prevent endless cycle in retried page faults.
However we should not do it for allocations from task context,
to prevent memcg-limited vmalloc-eaters from to consume all host memory.
Also I would like to return to the following hunk.
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
* A few threads which were not waiting at mutex_lock_killable() can
* fail to bail out. Therefore, check again after holding oom_lock.
*/
- ret = should_force_charge() || out_of_memory(&oc);
+ ret = task_is_dying() || out_of_memory(&oc);
unlock:
mutex_unlock(&oom_lock);
Now I think it's better to keep task_is_dying() check here.
if task is dying, it is not necessary to push other task to free the memory.
We broke vmalloc cycle already, so it looks like nothing should prevent us
from returning to userspace, handle fatal signal, exit and free the memory.
Thank you,
Vasily Averin
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-19 13:26 ` Vasily Averin
@ 2021-10-19 14:13 ` Michal Hocko
2021-10-19 14:19 ` Michal Hocko
2021-10-19 19:09 ` Vasily Averin
0 siblings, 2 replies; 66+ messages in thread
From: Michal Hocko @ 2021-10-19 14:13 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
cgroups, linux-mm, linux-kernel, kernel
On Tue 19-10-21 16:26:50, Vasily Averin wrote:
> On 19.10.2021 15:04, Michal Hocko wrote:
> > On Tue 19-10-21 13:54:42, Michal Hocko wrote:
> >> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
> >>> On 19.10.2021 11:49, Michal Hocko wrote:
> >>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> >>>> [...]
> >>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> >>>>> a) due to fatal signal
> >>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
> >>>>>
> >>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> >>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
> >>>
> >>>> How is b) possible without current being killed? Do we allow remote
> >>>> charging?
> >>>
> >>> out_of_memory for memcg_oom
> >>> select_bad_process
> >>> mem_cgroup_scan_tasks
> >>> oom_evaluate_task
> >>> oom_badness
> >>>
> >>> /*
> >>> * Do not even consider tasks which are explicitly marked oom
> >>> * unkillable or have been already oom reaped or the are in
> >>> * the middle of vfork
> >>> */
> >>> adj = (long)p->signal->oom_score_adj;
> >>> if (adj == OOM_SCORE_ADJ_MIN ||
> >>> test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
> >>> in_vfork(p)) {
> >>> task_unlock(p);
> >>> return LONG_MIN;
> >>> }
> >>>
> >>> This time we handle userspace page fault, so we cannot be kenrel thread,
> >>> and cannot be in_vfork().
> >>> However task can be marked as oom unkillable,
> >>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> >>
> >> You are right. I am not sure there is a way out of this though. The task
> >> can only retry for ever in this case. There is nothing actionable here.
> >> We cannot kill the task and there is no other way to release the memory.
> >
> > Btw. don't we force the charge in that case?
>
> We should force charge for allocation from inside page fault handler,
> to prevent endless cycle in retried page faults.
> However we should not do it for allocations from task context,
> to prevent memcg-limited vmalloc-eaters from to consume all host memory.
I don't see a big difference between those two. Because the #PF could
result into the very same situation depleting all the memory by
overcharging. A different behavior just leads to a confusion and
unexpected behavior. E.g. in the past we only triggered memcg OOM killer
from the #PF path and failed the charge otherwise. That is something
different but it shows problems we haven't anticipated and had user
visible problems. See 29ef680ae7c2 ("memcg, oom: move out_of_memory back
to the charge path").
> Also I would like to return to the following hunk.
> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> * A few threads which were not waiting at mutex_lock_killable() can
> * fail to bail out. Therefore, check again after holding oom_lock.
> */
> - ret = should_force_charge() || out_of_memory(&oc);
> + ret = task_is_dying() || out_of_memory(&oc);
>
> unlock:
> mutex_unlock(&oom_lock);
>
> Now I think it's better to keep task_is_dying() check here.
> if task is dying, it is not necessary to push other task to free the memory.
> We broke vmalloc cycle already, so it looks like nothing should prevent us
> from returning to userspace, handle fatal signal, exit and free the memory.
That patch has to be discuss in its full length. There were other
details I have brought up AFAIU.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-19 14:13 ` Michal Hocko
@ 2021-10-19 14:19 ` Michal Hocko
2021-10-19 19:09 ` Vasily Averin
1 sibling, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2021-10-19 14:19 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
cgroups, linux-mm, linux-kernel, kernel
On Tue 19-10-21 16:13:19, Michal Hocko wrote:
[...]
> That patch has to be discuss in its full length. There were other
> details I have brought up AFAIU.
And btw. thanks for walking through that maze! It is far from trivial
with side effects all over the place which far from obvious and heavily
underdocumented.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-19 14:13 ` Michal Hocko
2021-10-19 14:19 ` Michal Hocko
@ 2021-10-19 19:09 ` Vasily Averin
2021-10-20 8:07 ` [PATCH memcg v4] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
1 sibling, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-19 19:09 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
cgroups, linux-mm, linux-kernel, kernel
On 19.10.2021 17:13, Michal Hocko wrote:
> On Tue 19-10-21 16:26:50, Vasily Averin wrote:
>> On 19.10.2021 15:04, Michal Hocko wrote:
>>> On Tue 19-10-21 13:54:42, Michal Hocko wrote:
>>>> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
>>>>> On 19.10.2021 11:49, Michal Hocko wrote:
>>>>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
>>>>>> [...]
>>>>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
>>>>>>> a) due to fatal signal
>>>>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
>>>>>>>
>>>>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
>>>>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
>>>>>
>>>>>> How is b) possible without current being killed? Do we allow remote
>>>>>> charging?
>>>>>
>>>>> out_of_memory for memcg_oom
>>>>> select_bad_process
>>>>> mem_cgroup_scan_tasks
>>>>> oom_evaluate_task
>>>>> oom_badness
>>>>>
>>>>> /*
>>>>> * Do not even consider tasks which are explicitly marked oom
>>>>> * unkillable or have been already oom reaped or the are in
>>>>> * the middle of vfork
>>>>> */
>>>>> adj = (long)p->signal->oom_score_adj;
>>>>> if (adj == OOM_SCORE_ADJ_MIN ||
>>>>> test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
>>>>> in_vfork(p)) {
>>>>> task_unlock(p);
>>>>> return LONG_MIN;
>>>>> }
>>>>>
>>>>> This time we handle userspace page fault, so we cannot be kenrel thread,
>>>>> and cannot be in_vfork().
>>>>> However task can be marked as oom unkillable,
>>>>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
>>>>
>>>> You are right. I am not sure there is a way out of this though. The task
>>>> can only retry for ever in this case. There is nothing actionable here.
>>>> We cannot kill the task and there is no other way to release the memory.
>>>
>>> Btw. don't we force the charge in that case?
>>
>> We should force charge for allocation from inside page fault handler,
>> to prevent endless cycle in retried page faults.
>> However we should not do it for allocations from task context,
>> to prevent memcg-limited vmalloc-eaters from to consume all host memory.
>
> I don't see a big difference between those two. Because the #PF could
> result into the very same situation depleting all the memory by
> overcharging. A different behavior just leads to a confusion and
> unexpected behavior. E.g. in the past we only triggered memcg OOM killer
> from the #PF path and failed the charge otherwise. That is something
> different but it shows problems we haven't anticipated and had user
> visible problems. See 29ef680ae7c2 ("memcg, oom: move out_of_memory back
> to the charge path").
In this case I think we should fail this allocation.
It's better do not allow overcharge, neither in #PF not in regular allocations.
However this failure will trigger false global OOM in pagefault_out_of_memory(),
and we need to find some way to prevent it.
Thank you,
Vasily Averin
^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH memcg v4] memcg: prohibit unconditional exceeding the limit of dying tasks
2021-10-19 19:09 ` Vasily Averin
@ 2021-10-20 8:07 ` Vasily Averin
2021-10-20 8:43 ` Michal Hocko
0 siblings, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-20 8:07 UTC (permalink / raw)
To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
kernel
Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It is assumed that the amount of the memory charged by those
tasks is bound and most of the memory will get released while the task
is exiting. This is resembling a heuristic for the global OOM situation
when tasks get access to memory reserves. There is no global memory
shortage at the memcg level so the memcg heuristic is more relieved.
The above assumption is overly optimistic though. E.g. vmalloc can scale
to really large requests and the heuristic would allow that. We used to
have an early break in the vmalloc allocator for killed tasks but this
has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
the current task is killed""). There are likely other similar code paths
which do not check for fatal signals in an allocation&charge loop.
Also there are some kernel objects charged to a memcg which are not
bound to a process life time.
It has been observed that it is not really hard to trigger these
bypasses and cause global OOM situation.
One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves). This is
certainly possible but it is not really clear how much of an excess is
desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.
This patch is addressing the problem by removing the heuristic
altogether. Bypass is only allowed for requests which either cannot fail
or where the failure is not desirable while excess should be still
limited (e.g. atomic requests). Implementation wise a killed or dying
task fails to charge if it has passed the OOM killer stage. That should
give all forms of reclaim chance to restore the limit before the
failure (ENOMEM) and tell the caller to back off.
In addition, this patch renames should_force_charge() helper
to task_is_dying() because now its use is not associated witch forced
charging.
If try_charge_memcg() is called from #PF, its new failres can force
pagefault_out_of_memory() to execute the global OOM. To prevent it
pagefault_out_of_memory() was updated to properly handle memcg-related
restrictions.
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
v4: updated pagefault_out_of_memory() to properly handle new memcg-related
restrictions and not allow false global OOM
v3: no functional changes, just improved patch description
v2: swicthed to patch version proposed by mhocko@
mm/memcontrol.c | 52 ++++++++++++++++++++++++++++---------------------
mm/oom_kill.c | 3 +++
2 files changed, 33 insertions(+), 22 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..b09d3c64f63f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -239,7 +239,7 @@ enum res_type {
iter != NULL; \
iter = mem_cgroup_iter(NULL, iter, NULL))
-static inline bool should_force_charge(void)
+static inline bool task_is_dying(void)
{
return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
(current->flags & PF_EXITING);
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
* A few threads which were not waiting at mutex_lock_killable() can
* fail to bail out. Therefore, check again after holding oom_lock.
*/
- ret = should_force_charge() || out_of_memory(&oc);
+ ret = task_is_dying() || out_of_memory(&oc);
unlock:
mutex_unlock(&oom_lock);
@@ -1810,11 +1810,21 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
mem_cgroup_oom_notify(memcg);
mem_cgroup_unmark_under_oom(memcg);
- if (mem_cgroup_out_of_memory(memcg, mask, order))
+ if (mem_cgroup_out_of_memory(memcg, mask, order)) {
ret = OOM_SUCCESS;
- else
+ } else {
ret = OOM_FAILED;
-
+ /*
+ * In some rare cases mem_cgroup_out_of_memory() can return false.
+ * If it was called from #PF it forces handle_mm_fault()
+ * return VM_FAULT_OOM and executes pagefault_out_of_memory().
+ * memcg_in_oom is set here to notify pagefault_out_of_memory()
+ * that it was a memcg-related failure and not allow to run
+ * global OOM.
+ */
+ if (current->in_user_fault)
+ current->memcg_in_oom = (struct mem_cgroup *)ret;
+ }
if (locked)
mem_cgroup_oom_unlock(memcg);
@@ -1848,6 +1858,15 @@ bool mem_cgroup_oom_synchronize(bool handle)
if (!memcg)
return false;
+ /* OOM is memcg, however out_of_memory() found no victim */
+ if (memcg == (struct mem_cgroup *)OOM_FAILED) {
+ /*
+ * Should be called from pagefault_out_of_memory() only,
+ * where it is used to prevent false global OOM.
+ */
+ current->memcg_in_oom = NULL;
+ return true;
+ }
if (!handle)
goto cleanup;
@@ -2530,6 +2549,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
struct page_counter *counter;
enum oom_status oom_status;
unsigned long nr_reclaimed;
+ bool passed_oom = false;
bool may_swap = true;
bool drained = false;
unsigned long pflags;
@@ -2564,15 +2584,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (gfp_mask & __GFP_ATOMIC)
goto force;
- /*
- * Unlike in global OOM situations, memcg is not in a physical
- * memory shortage. Allow dying and OOM-killed tasks to
- * bypass the last charges so that they can exit quickly and
- * free their memory.
- */
- if (unlikely(should_force_charge()))
- goto force;
-
/*
* Prevent unbounded recursion when reclaim operations need to
* allocate memory. This might exceed the limits temporarily,
@@ -2630,8 +2641,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (gfp_mask & __GFP_RETRY_MAYFAIL)
goto nomem;
- if (fatal_signal_pending(current))
- goto force;
+ /* Avoid endless loop for tasks bypassed by the oom killer */
+ if (passed_oom && task_is_dying())
+ goto nomem;
/*
* keep retrying as long as the memcg oom killer is able to make
@@ -2640,14 +2652,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
*/
oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
get_order(nr_pages * PAGE_SIZE));
- switch (oom_status) {
- case OOM_SUCCESS:
+ if (oom_status == OOM_SUCCESS) {
+ passed_oom = true;
nr_retries = MAX_RECLAIM_RETRIES;
goto retry;
- case OOM_FAILED:
- goto force;
- default:
- goto nomem;
}
nomem:
if (!(gfp_mask & __GFP_NOFAIL))
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 831340e7ad8b..1deef8c7a71b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
if (mem_cgroup_oom_synchronize(true))
return;
+ if (fatal_signal_pending(current))
+ return;
+
if (!mutex_trylock(&oom_lock))
return;
out_of_memory(&oc);
--
2.32.0
^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [PATCH memcg v4] memcg: prohibit unconditional exceeding the limit of dying tasks
2021-10-20 8:07 ` [PATCH memcg v4] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
@ 2021-10-20 8:43 ` Michal Hocko
2021-10-20 12:11 ` [PATCH memcg RFC 0/3] " Vasily Averin
[not found] ` <cover.1634730787.git.vvs@virtuozzo.com>
0 siblings, 2 replies; 66+ messages in thread
From: Michal Hocko @ 2021-10-20 8:43 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel
On Wed 20-10-21 11:07:02, Vasily Averin wrote:
[...]
I haven't read through the changelog and only focused on the patch this
time.
[...]
> @@ -1810,11 +1810,21 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
> mem_cgroup_oom_notify(memcg);
>
> mem_cgroup_unmark_under_oom(memcg);
> - if (mem_cgroup_out_of_memory(memcg, mask, order))
> + if (mem_cgroup_out_of_memory(memcg, mask, order)) {
> ret = OOM_SUCCESS;
> - else
> + } else {
> ret = OOM_FAILED;
> -
> + /*
> + * In some rare cases mem_cgroup_out_of_memory() can return false.
> + * If it was called from #PF it forces handle_mm_fault()
> + * return VM_FAULT_OOM and executes pagefault_out_of_memory().
> + * memcg_in_oom is set here to notify pagefault_out_of_memory()
> + * that it was a memcg-related failure and not allow to run
> + * global OOM.
> + */
> + if (current->in_user_fault)
> + current->memcg_in_oom = (struct mem_cgroup *)ret;
> + }
> if (locked)
> mem_cgroup_oom_unlock(memcg);
>
> @@ -1848,6 +1858,15 @@ bool mem_cgroup_oom_synchronize(bool handle)
> if (!memcg)
> return false;
>
> + /* OOM is memcg, however out_of_memory() found no victim */
> + if (memcg == (struct mem_cgroup *)OOM_FAILED) {
> + /*
> + * Should be called from pagefault_out_of_memory() only,
> + * where it is used to prevent false global OOM.
> + */
> + current->memcg_in_oom = NULL;
> + return true;
> + }
> if (!handle)
> goto cleanup;
I have to say I am not a great fan of this but this belongs to a
separate patch on top of all the previous ones.
[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 831340e7ad8b..1deef8c7a71b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
> if (mem_cgroup_oom_synchronize(true))
> return;
>
> + if (fatal_signal_pending(current))
> + return;
> +
This belongs to its own patch as well.
All that being said I would go with pagefault_out_of_memory as the first
patch because it is necessary to handle charge failures. Then go with a
patch to remove charge forcing when OOM killer succeeds but the retry
still fails and finally go with one that tries to handle oom failures.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH memcg RFC 0/3] memcg: prohibit unconditional exceeding the limit of dying tasks
2021-10-20 8:43 ` Michal Hocko
@ 2021-10-20 12:11 ` Vasily Averin
[not found] ` <cover.1634730787.git.vvs@virtuozzo.com>
1 sibling, 0 replies; 66+ messages in thread
From: Vasily Averin @ 2021-10-20 12:11 UTC (permalink / raw)
To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
kernel
Dear Michal,
as you requested, I splited v4 patch version into 3 separate parts.
Let's discuss each of them.
Open questions/ToDo:
- Update patch descriptions (and add some comments)
- in part 2 aka "memcg: remove charge forcinig for dying tasks":
should we keep task_is_dying() in mem_cgroup_out_of_memory() ?
- in part 3 aka "memcg: handle memcg oom failures"
what is the best way to notify pagefault_out_of_memory() about
mem_cgroup_out_of_memory failure ?
- what is the best way to handle memcg failure doe to kmem limit,
it can trigger false global OOM
Vasily Averin (3):
mm: do not firce global OOM from inside dying tasks
memcg: remove charge forcinig for dying tasks
memcg: handle memcg oom failures
mm/memcontrol.c | 52 ++++++++++++++++++++++++++++---------------------
mm/oom_kill.c | 3 +++
2 files changed, 33 insertions(+), 22 deletions(-)
--
2.32.0
^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH memcg 1/3] mm: do not firce global OOM from inside dying tasks
[not found] ` <cover.1634730787.git.vvs@virtuozzo.com>
@ 2021-10-20 12:12 ` Vasily Averin
2021-10-20 12:33 ` Michal Hocko
2021-10-20 12:13 ` [PATCH memcg 2/3] memcg: remove charge forcinig for " Vasily Averin
2021-10-20 12:14 ` [PATCH memcg 3/3] memcg: handle memcg oom failures Vasily Averin
2 siblings, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-20 12:12 UTC (permalink / raw)
To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
kernel
There is no sense to force global OOM if current task is dying.
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
mm/oom_kill.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 831340e7ad8b..1deef8c7a71b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
if (mem_cgroup_oom_synchronize(true))
return;
+ if (fatal_signal_pending(current))
+ return;
+
if (!mutex_trylock(&oom_lock))
return;
out_of_memory(&oc);
--
2.32.0
^ permalink raw reply related [flat|nested] 66+ messages in thread
* [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks
[not found] ` <cover.1634730787.git.vvs@virtuozzo.com>
2021-10-20 12:12 ` [PATCH memcg 1/3] mm: do not firce global OOM from inside " Vasily Averin
@ 2021-10-20 12:13 ` Vasily Averin
2021-10-20 12:41 ` Michal Hocko
2021-10-20 12:14 ` [PATCH memcg 3/3] memcg: handle memcg oom failures Vasily Averin
2 siblings, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-20 12:13 UTC (permalink / raw)
To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
kernel
ToDo: should we keep task_is_dying() in mem_cgroup_out_of_memory() ?
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
mm/memcontrol.c | 20 +++++++-------------
1 file changed, 7 insertions(+), 13 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..74a7379dbac1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -239,7 +239,7 @@ enum res_type {
iter != NULL; \
iter = mem_cgroup_iter(NULL, iter, NULL))
-static inline bool should_force_charge(void)
+static inline bool task_is_dying(void)
{
return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
(current->flags & PF_EXITING);
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
* A few threads which were not waiting at mutex_lock_killable() can
* fail to bail out. Therefore, check again after holding oom_lock.
*/
- ret = should_force_charge() || out_of_memory(&oc);
+ ret = task_is_dying() || out_of_memory(&oc);
unlock:
mutex_unlock(&oom_lock);
@@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
struct page_counter *counter;
enum oom_status oom_status;
unsigned long nr_reclaimed;
+ bool passed_oom = false;
bool may_swap = true;
bool drained = false;
unsigned long pflags;
@@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (gfp_mask & __GFP_ATOMIC)
goto force;
- /*
- * Unlike in global OOM situations, memcg is not in a physical
- * memory shortage. Allow dying and OOM-killed tasks to
- * bypass the last charges so that they can exit quickly and
- * free their memory.
- */
- if (unlikely(should_force_charge()))
- goto force;
-
/*
* Prevent unbounded recursion when reclaim operations need to
* allocate memory. This might exceed the limits temporarily,
@@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (gfp_mask & __GFP_RETRY_MAYFAIL)
goto nomem;
- if (fatal_signal_pending(current))
- goto force;
+ /* Avoid endless loop for tasks bypassed by the oom killer */
+ if (passed_oom && task_is_dying())
+ goto nomem;
/*
* keep retrying as long as the memcg oom killer is able to make
@@ -2642,6 +2635,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
get_order(nr_pages * PAGE_SIZE));
switch (oom_status) {
case OOM_SUCCESS:
+ passed_oom = true;
nr_retries = MAX_RECLAIM_RETRIES;
goto retry;
case OOM_FAILED:
--
2.32.0
^ permalink raw reply related [flat|nested] 66+ messages in thread
* [PATCH memcg 3/3] memcg: handle memcg oom failures
[not found] ` <cover.1634730787.git.vvs@virtuozzo.com>
2021-10-20 12:12 ` [PATCH memcg 1/3] mm: do not firce global OOM from inside " Vasily Averin
2021-10-20 12:13 ` [PATCH memcg 2/3] memcg: remove charge forcinig for " Vasily Averin
@ 2021-10-20 12:14 ` Vasily Averin
2021-10-20 13:02 ` Michal Hocko
2 siblings, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-20 12:14 UTC (permalink / raw)
To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
kernel
mem_cgroup_oom() can fail if current task was marked unkillable
and oom killer cannot find any victim.
Currently we force memcg charge for such allocations,
however it allow memcg-limited userspace task in to overuse assigned limits
and potentially trigger the global memory shortage.
Let's fail the memory charge in such cases.
This failure should be somehow recognised in #PF context,
so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED
ToDo: what is the best way to notify pagefault_out_of_memory() about
mem_cgroup_out_of_memory failure ?
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
mm/memcontrol.c | 32 +++++++++++++++++++++++---------
1 file changed, 23 insertions(+), 9 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 74a7379dbac1..b09d3c64f63f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1810,11 +1810,21 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
mem_cgroup_oom_notify(memcg);
mem_cgroup_unmark_under_oom(memcg);
- if (mem_cgroup_out_of_memory(memcg, mask, order))
+ if (mem_cgroup_out_of_memory(memcg, mask, order)) {
ret = OOM_SUCCESS;
- else
+ } else {
ret = OOM_FAILED;
-
+ /*
+ * In some rare cases mem_cgroup_out_of_memory() can return false.
+ * If it was called from #PF it forces handle_mm_fault()
+ * return VM_FAULT_OOM and executes pagefault_out_of_memory().
+ * memcg_in_oom is set here to notify pagefault_out_of_memory()
+ * that it was a memcg-related failure and not allow to run
+ * global OOM.
+ */
+ if (current->in_user_fault)
+ current->memcg_in_oom = (struct mem_cgroup *)ret;
+ }
if (locked)
mem_cgroup_oom_unlock(memcg);
@@ -1848,6 +1858,15 @@ bool mem_cgroup_oom_synchronize(bool handle)
if (!memcg)
return false;
+ /* OOM is memcg, however out_of_memory() found no victim */
+ if (memcg == (struct mem_cgroup *)OOM_FAILED) {
+ /*
+ * Should be called from pagefault_out_of_memory() only,
+ * where it is used to prevent false global OOM.
+ */
+ current->memcg_in_oom = NULL;
+ return true;
+ }
if (!handle)
goto cleanup;
@@ -2633,15 +2652,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
*/
oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
get_order(nr_pages * PAGE_SIZE));
- switch (oom_status) {
- case OOM_SUCCESS:
+ if (oom_status == OOM_SUCCESS) {
passed_oom = true;
nr_retries = MAX_RECLAIM_RETRIES;
goto retry;
- case OOM_FAILED:
- goto force;
- default:
- goto nomem;
}
nomem:
if (!(gfp_mask & __GFP_NOFAIL))
--
2.32.0
^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 1/3] mm: do not firce global OOM from inside dying tasks
2021-10-20 12:12 ` [PATCH memcg 1/3] mm: do not firce global OOM from inside " Vasily Averin
@ 2021-10-20 12:33 ` Michal Hocko
2021-10-20 13:52 ` Vasily Averin
0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2021-10-20 12:33 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel
s@firce@force@
On Wed 20-10-21 15:12:19, Vasily Averin wrote:
> There is no sense to force global OOM if current task is dying.
This really begs for much more information. Feel free to get an
inspiration from my previous attempt to achieve something similar.
In minimum it is important to mention that the OOM killer is already
handled at the page allocator level for the global OOM and at the
charging level for the memcg one. Both have much more information
about the scope of allocation/charge request. This means that either the
OOM killer has been invoked properly and didn't lead to the allocation
success or it has been skipped because it couldn't have been invoked.
In both cases triggering it from here is pointless and even harmful.
Another argument is that it is more reasonable to let killed task die
rather than hit the oom killer and retry the allocation.
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
> mm/oom_kill.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 831340e7ad8b..1deef8c7a71b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
> if (mem_cgroup_oom_synchronize(true))
> return;
>
> + if (fatal_signal_pending(current))
> + return;
> +
> if (!mutex_trylock(&oom_lock))
> return;
> out_of_memory(&oc);
> --
> 2.32.0
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks
2021-10-20 12:13 ` [PATCH memcg 2/3] memcg: remove charge forcinig for " Vasily Averin
@ 2021-10-20 12:41 ` Michal Hocko
2021-10-20 14:21 ` Vasily Averin
0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2021-10-20 12:41 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel
On Wed 20-10-21 15:13:46, Vasily Averin wrote:
> ToDo: should we keep task_is_dying() in mem_cgroup_out_of_memory() ?
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
> mm/memcontrol.c | 20 +++++++-------------
> 1 file changed, 7 insertions(+), 13 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6da5020a8656..74a7379dbac1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -239,7 +239,7 @@ enum res_type {
> iter != NULL; \
> iter = mem_cgroup_iter(NULL, iter, NULL))
>
> -static inline bool should_force_charge(void)
> +static inline bool task_is_dying(void)
> {
> return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
> (current->flags & PF_EXITING);
> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> * A few threads which were not waiting at mutex_lock_killable() can
> * fail to bail out. Therefore, check again after holding oom_lock.
> */
> - ret = should_force_charge() || out_of_memory(&oc);
> + ret = task_is_dying() || out_of_memory(&oc);
Why are you keeping the task_is_dying check here? IIRC I have already
pointed out that out_of_memory already has some means to do a bypass
when needed.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 3/3] memcg: handle memcg oom failures
2021-10-20 12:14 ` [PATCH memcg 3/3] memcg: handle memcg oom failures Vasily Averin
@ 2021-10-20 13:02 ` Michal Hocko
2021-10-20 15:46 ` Vasily Averin
0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2021-10-20 13:02 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel
On Wed 20-10-21 15:14:27, Vasily Averin wrote:
> mem_cgroup_oom() can fail if current task was marked unkillable
> and oom killer cannot find any victim.
>
> Currently we force memcg charge for such allocations,
> however it allow memcg-limited userspace task in to overuse assigned limits
> and potentially trigger the global memory shortage.
You should really go into more details whether that is a practical
problem to handle. OOM_FAILED means that the memcg oom killer couldn't
find any oom victim so it cannot help with a forward progress. There are
not that many situations when that can happen. Naming that would be
really useful.
> Let's fail the memory charge in such cases.
>
> This failure should be somehow recognised in #PF context,
explain why
> so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED
>
> ToDo: what is the best way to notify pagefault_out_of_memory() about
> mem_cgroup_out_of_memory failure ?
why don't you simply remove out_of_memory from pagefault_out_of_memory
and leave it only with the blocking memcg OOM handling? Wouldn't that be a
more generic solution? Your first patch already goes that way partially.
This change is more risky than the first one. If somebody returns
VM_FAULT_OOM without invoking allocator then it can loop for ever but
invoking OOM killer in such a situation is equally wrong as the oom
killer cannot really help, right?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 1/3] mm: do not firce global OOM from inside dying tasks
2021-10-20 12:33 ` Michal Hocko
@ 2021-10-20 13:52 ` Vasily Averin
0 siblings, 0 replies; 66+ messages in thread
From: Vasily Averin @ 2021-10-20 13:52 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel
On 20.10.2021 15:33, Michal Hocko wrote:
> s@firce@force@
>
> On Wed 20-10-21 15:12:19, Vasily Averin wrote:
>> There is no sense to force global OOM if current task is dying.
>
> This really begs for much more information. Feel free to get an
> inspiration from my previous attempt to achieve something similar.
> In minimum it is important to mention that the OOM killer is already
> handled at the page allocator level for the global OOM and at the
> charging level for the memcg one. Both have much more information
> about the scope of allocation/charge request. This means that either the
> OOM killer has been invoked properly and didn't lead to the allocation
> success or it has been skipped because it couldn't have been invoked.
> In both cases triggering it from here is pointless and even harmful.
>
> Another argument is that it is more reasonable to let killed task die
> rather than hit the oom killer and retry the allocation.
Thank you,
I'll update patch description later,
this time I would like to clarify patch content.
>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>> ---
>> mm/oom_kill.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 831340e7ad8b..1deef8c7a71b 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
>> if (mem_cgroup_oom_synchronize(true))
>> return;
>>
>> + if (fatal_signal_pending(current))
>> + return;
>> +
>> if (!mutex_trylock(&oom_lock))
>> return;
>> out_of_memory(&oc);
>> --
>> 2.32.0
>
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks
2021-10-20 12:41 ` Michal Hocko
@ 2021-10-20 14:21 ` Vasily Averin
2021-10-20 14:57 ` Michal Hocko
0 siblings, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-20 14:21 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel
On 20.10.2021 15:41, Michal Hocko wrote:
> On Wed 20-10-21 15:13:46, Vasily Averin wrote:
>> ToDo: should we keep task_is_dying() in mem_cgroup_out_of_memory() ?
>>
>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>> ---
>> mm/memcontrol.c | 20 +++++++-------------
>> 1 file changed, 7 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 6da5020a8656..74a7379dbac1 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -239,7 +239,7 @@ enum res_type {
>> iter != NULL; \
>> iter = mem_cgroup_iter(NULL, iter, NULL))
>>
>> -static inline bool should_force_charge(void)
>> +static inline bool task_is_dying(void)
>> {
>> return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
>> (current->flags & PF_EXITING);
>> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>> * A few threads which were not waiting at mutex_lock_killable() can
>> * fail to bail out. Therefore, check again after holding oom_lock.
>> */
>> - ret = should_force_charge() || out_of_memory(&oc);
>> + ret = task_is_dying() || out_of_memory(&oc);
>
> Why are you keeping the task_is_dying check here? IIRC I have already
> pointed out that out_of_memory already has some means to do a bypass
> when needed.
It was a misunderstanding.
I've been waiting for your final decision.
I have no good arguments "pro" or strong objection "contra".
However, I prefer to keep task_is_dying() so as not to touch other tasks unnecessarily.
I can't justify why its removal is necessary,
but on the other hand, it shouldn't break anything.
I can drop it if you think it's necessary.
Thank you,
Vasily Averin
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks
2021-10-20 14:21 ` Vasily Averin
@ 2021-10-20 14:57 ` Michal Hocko
2021-10-20 15:20 ` Tetsuo Handa
0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2021-10-20 14:57 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel
On Wed 20-10-21 17:21:33, Vasily Averin wrote:
> On 20.10.2021 15:41, Michal Hocko wrote:
> > On Wed 20-10-21 15:13:46, Vasily Averin wrote:
> >> ToDo: should we keep task_is_dying() in mem_cgroup_out_of_memory() ?
> >>
> >> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> >> ---
> >> mm/memcontrol.c | 20 +++++++-------------
> >> 1 file changed, 7 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 6da5020a8656..74a7379dbac1 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -239,7 +239,7 @@ enum res_type {
> >> iter != NULL; \
> >> iter = mem_cgroup_iter(NULL, iter, NULL))
> >>
> >> -static inline bool should_force_charge(void)
> >> +static inline bool task_is_dying(void)
> >> {
> >> return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
> >> (current->flags & PF_EXITING);
> >> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >> * A few threads which were not waiting at mutex_lock_killable() can
> >> * fail to bail out. Therefore, check again after holding oom_lock.
> >> */
> >> - ret = should_force_charge() || out_of_memory(&oc);
> >> + ret = task_is_dying() || out_of_memory(&oc);
> >
> > Why are you keeping the task_is_dying check here? IIRC I have already
> > pointed out that out_of_memory already has some means to do a bypass
> > when needed.
>
> It was a misunderstanding.
Sorry if I made you confused.
> I've been waiting for your final decision.
>
> I have no good arguments "pro" or strong objection "contra".
> However, I prefer to keep task_is_dying() so as not to touch other tasks unnecessarily.
One argument for removing it from here is the maintainability. Now you
have a memcg specific check which is not in sync with the oom. E.g.
out_of_memory does task_will_free_mem as the very first thing. You are
also automatically excluding oom killer for cases where that might make
a sense.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks
2021-10-20 14:57 ` Michal Hocko
@ 2021-10-20 15:20 ` Tetsuo Handa
2021-10-21 10:03 ` Michal Hocko
0 siblings, 1 reply; 66+ messages in thread
From: Tetsuo Handa @ 2021-10-20 15:20 UTC (permalink / raw)
To: Michal Hocko, Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
cgroups, linux-mm, linux-kernel, kernel
On 2021/10/20 23:57, Michal Hocko wrote:
> One argument for removing it from here is the maintainability. Now you
> have a memcg specific check which is not in sync with the oom. E.g.
> out_of_memory does task_will_free_mem as the very first thing. You are
> also automatically excluding oom killer for cases where that might make
> a sense.
What makes it possible to remove this check?
This check was added because task_will_free_mem() in out_of_memory() does NOT work.
See commit 7775face207922ea ("memcg: killed threads should not invoke memcg OOM killer").
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 3/3] memcg: handle memcg oom failures
2021-10-20 13:02 ` Michal Hocko
@ 2021-10-20 15:46 ` Vasily Averin
2021-10-21 11:49 ` Michal Hocko
0 siblings, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-20 15:46 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel
On 20.10.2021 16:02, Michal Hocko wrote:
> On Wed 20-10-21 15:14:27, Vasily Averin wrote:
>> mem_cgroup_oom() can fail if current task was marked unkillable
>> and oom killer cannot find any victim.
>>
>> Currently we force memcg charge for such allocations,
>> however it allow memcg-limited userspace task in to overuse assigned limits
>> and potentially trigger the global memory shortage.
>
> You should really go into more details whether that is a practical
> problem to handle. OOM_FAILED means that the memcg oom killer couldn't
> find any oom victim so it cannot help with a forward progress. There are
> not that many situations when that can happen. Naming that would be
> really useful.
I've pointed above:
"if current task was marked unkillable and oom killer cannot find any victim."
This may happen when current task cannot be oom-killed because it was marked
unkillable i.e. it have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
and other processes in memcg are either dying, or are kernel threads or are marked unkillable
by the same way. Or when memcg have this process only.
If we always approve such kind of allocation, it can be misused.
Process can mmap a lot of memory,
ant then touch it and generate page fault and make overcharged memory allocations.
Finally it can consume all node memory and trigger global memory shortage on the host.
>> Let's fail the memory charge in such cases.
>>
>> This failure should be somehow recognised in #PF context,
>
> explain why
When #PF cannot allocate memory (due to reason described above), handle_mm_fault returns VM_FAULT_OOM,
then its caller executes pagefault_out_of_memory(). If last one cannot recognize the real reason of this fail,
it expect it was global memory shortage and executed global out_ouf_memory() that can kill random process
or just crash node if sysctl vm.panic_on_oom is set to 1.
Currently pagefault_out_of_memory() knows about possible async memcg OOM and handles it correctly.
However it is not aware that memcg can reject some other allocations, do not recognize the fault
as memcg-related and allows to run global OOM.
>> so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED
>>
>> ToDo: what is the best way to notify pagefault_out_of_memory() about
>> mem_cgroup_out_of_memory failure ?
>
> why don't you simply remove out_of_memory from pagefault_out_of_memory
> and leave it only with the blocking memcg OOM handling? Wouldn't that be a
> more generic solution? Your first patch already goes that way partially.
I clearly understand that global out_of_memory should not be trggired by memcg restrictions.
I clearly understand that dying task will release some memory soon and we can do not run global oom if current task is dying.
However I'm not sure that I can remove out_of_memory at all. At least I do not have good arguments to do it.
> This change is more risky than the first one. If somebody returns
> VM_FAULT_OOM without invoking allocator then it can loop for ever but
> invoking OOM killer in such a situation is equally wrong as the oom
> killer cannot really help, right?
I'm not ready to comment this right now and take time out to think about.
Thank you,
Vasily Averin
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-18 9:04 ` Michal Hocko
2021-10-18 10:05 ` Vasily Averin
@ 2021-10-21 8:03 ` Vasily Averin
2021-10-21 11:49 ` Michal Hocko
1 sibling, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-21 8:03 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
cgroups, linux-mm, linux-kernel, kernel
On 18.10.2021 12:04, Michal Hocko wrote:
> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> [...]
>> How could this happen?
>>
>> User-space task inside the memcg-limited container generated a page fault,
>> its handler do_user_addr_fault() called handle_mm_fault which could not
>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>> out_of_memory() without set of memcg.
> I will be honest that I am not really happy about pagefault_out_of_memory.
> I have tried to remove it in the past. Without much success back then,
> unfortunately[1].
>
> [1] I do not have msg-id so I cannot provide a lore link but google
> pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html
I re-read this discussion and in general I support your position.
As far as I understand your opponents cannot explain why "random kill" is mandatory here,
they are just afraid that it might be useful here and do not want to remove it completely.
Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior.
However I would like to have some choice in this point.
In general we can:
- continue to use "random kill" and rely on the wisdom of the ancestors.
- do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again".
- add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory.
- mark the current task as cycled in #PF and somehow use this mark in allocator
- make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle.
- implement any better ideas,
- use any combination of previous points
We can select required strategy for example via sysctl.
For me "random kill" is worst choice,
Why can't we just kill the looped process instead?
It can be marked as oom-unkillable, so OOM-killer was unable to select it.
However I doubt it means "never kill it", for me it is something like "last possible victim" priority.
Thank you,
Vasily Averin
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks
2021-10-20 15:20 ` Tetsuo Handa
@ 2021-10-21 10:03 ` Michal Hocko
0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2021-10-21 10:03 UTC (permalink / raw)
To: Tetsuo Handa
Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Andrew Morton,
Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
Mel Gorman, cgroups, linux-mm, linux-kernel, kernel
On Thu 21-10-21 00:20:02, Tetsuo Handa wrote:
> On 2021/10/20 23:57, Michal Hocko wrote:
> > One argument for removing it from here is the maintainability. Now you
> > have a memcg specific check which is not in sync with the oom. E.g.
> > out_of_memory does task_will_free_mem as the very first thing. You are
> > also automatically excluding oom killer for cases where that might make
> > a sense.
>
> What makes it possible to remove this check?
> This check was added because task_will_free_mem() in out_of_memory() does NOT work.
> See commit 7775face207922ea ("memcg: killed threads should not invoke memcg OOM killer").
You are right. I've forgot about this and should have checked git blame.
Thanks for bringing this up!
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-21 8:03 ` [PATCH memcg 0/1] false global OOM triggered by memcg-limited task Vasily Averin
@ 2021-10-21 11:49 ` Michal Hocko
2021-10-21 13:24 ` Vasily Averin
0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2021-10-21 11:49 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
cgroups, linux-mm, linux-kernel, kernel
On Thu 21-10-21 11:03:43, Vasily Averin wrote:
> On 18.10.2021 12:04, Michal Hocko wrote:
> > On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> > [...]
> >> How could this happen?
> >>
> >> User-space task inside the memcg-limited container generated a page fault,
> >> its handler do_user_addr_fault() called handle_mm_fault which could not
> >> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
> >> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
> >> out_of_memory() without set of memcg.
>
> > I will be honest that I am not really happy about pagefault_out_of_memory.
> > I have tried to remove it in the past. Without much success back then,
> > unfortunately[1].
> >
> > [1] I do not have msg-id so I cannot provide a lore link but google
> > pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html
>
> I re-read this discussion and in general I support your position.
> As far as I understand your opponents cannot explain why "random kill" is mandatory here,
> they are just afraid that it might be useful here and do not want to remove it completely.
That aligns with my recollection.
> Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior.
>
> However I would like to have some choice in this point.
>
> In general we can:
> - continue to use "random kill" and rely on the wisdom of the ancestors.
I do not follow. Does that mean to preserve existing oom killer from
#PF?
> - do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again".
> - add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory.
Again, not really sure what you mean
> - mark the current task as cycled in #PF and somehow use this mark in allocator
How?
> - make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle.
No! We cannot really kill the task if we could we would have done it by
the oom killer already
> - implement any better ideas,
> - use any combination of previous points
>
> We can select required strategy for example via sysctl.
Absolutely no! How can admin know any better than the kernel?
> For me "random kill" is worst choice,
> Why can't we just kill the looped process instead?
See above.
> It can be marked as oom-unkillable, so OOM-killer was unable to select it.
> However I doubt it means "never kill it", for me it is something like "last possible victim" priority.
It means never kill it because of OOM. If it is retrying because of OOM
then it is effectively the same thing.
The oom killer from the #PF doesn't really provide any clear advantage
these days AFAIK. On the other hand it allows for a very disruptive
behavior. In a worst case it can lead to a system panic if the
VM_FAULT_OOM is not really caused by a memory shortage but rather a
wrong error handling. If a task is looping there without any progress
then it is still kilallable which is a much saner behavior IMHO.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 3/3] memcg: handle memcg oom failures
2021-10-20 15:46 ` Vasily Averin
@ 2021-10-21 11:49 ` Michal Hocko
2021-10-21 15:05 ` Vasily Averin
0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2021-10-21 11:49 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel
On Wed 20-10-21 18:46:56, Vasily Averin wrote:
> On 20.10.2021 16:02, Michal Hocko wrote:
> > On Wed 20-10-21 15:14:27, Vasily Averin wrote:
> >> mem_cgroup_oom() can fail if current task was marked unkillable
> >> and oom killer cannot find any victim.
> >>
> >> Currently we force memcg charge for such allocations,
> >> however it allow memcg-limited userspace task in to overuse assigned limits
> >> and potentially trigger the global memory shortage.
> >
> > You should really go into more details whether that is a practical
> > problem to handle. OOM_FAILED means that the memcg oom killer couldn't
> > find any oom victim so it cannot help with a forward progress. There are
> > not that many situations when that can happen. Naming that would be
> > really useful.
>
> I've pointed above:
> "if current task was marked unkillable and oom killer cannot find any victim."
> This may happen when current task cannot be oom-killed because it was marked
> unkillable i.e. it have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> and other processes in memcg are either dying, or are kernel threads or are marked unkillable
> by the same way. Or when memcg have this process only.
>
> If we always approve such kind of allocation, it can be misused.
> Process can mmap a lot of memory,
> ant then touch it and generate page fault and make overcharged memory allocations.
> Finally it can consume all node memory and trigger global memory shortage on the host.
Yes, this is true but a) OOM_SCORE_ADJ_MIN tasks are excluded from the
OOM handling so they have to be careful with the memory consumption and
b) is this a theoretical or a practical concern.
This is mostly what I wanted to make sure you describe in the changelog.
> >> Let's fail the memory charge in such cases.
> >>
> >> This failure should be somehow recognised in #PF context,
> >
> > explain why
>
> When #PF cannot allocate memory (due to reason described above), handle_mm_fault returns VM_FAULT_OOM,
> then its caller executes pagefault_out_of_memory(). If last one cannot recognize the real reason of this fail,
> it expect it was global memory shortage and executed global out_ouf_memory() that can kill random process
> or just crash node if sysctl vm.panic_on_oom is set to 1.
>
> Currently pagefault_out_of_memory() knows about possible async memcg OOM and handles it correctly.
> However it is not aware that memcg can reject some other allocations, do not recognize the fault
> as memcg-related and allows to run global OOM.
Again something to be added to the changelog.
> >> so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED
> >>
> >> ToDo: what is the best way to notify pagefault_out_of_memory() about
> >> mem_cgroup_out_of_memory failure ?
> >
> > why don't you simply remove out_of_memory from pagefault_out_of_memory
> > and leave it only with the blocking memcg OOM handling? Wouldn't that be a
> > more generic solution? Your first patch already goes that way partially.
>
> I clearly understand that global out_of_memory should not be trggired by memcg restrictions.
> I clearly understand that dying task will release some memory soon and we can do not run global oom if current task is dying.
>
> However I'm not sure that I can remove out_of_memory at all. At least I do not have good arguments to do it.
I do understand that handling a very specific case sounds easier but it
would be better to have a robust fix even if that requires some more
head scratching. So far we have collected several reasons why the it is
bad to trigger oom killer from the #PF path. There is no single argument
to keep it so it sounds like a viable path to pursue. Maybe there are
some very well hidden reasons but those should be documented and this is
a great opportunity to do either of the step.
Moreover if it turns out that there is a regression then this can be
easily reverted and a different, maybe memcg specific, solution can be
implemented.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
2021-10-21 11:49 ` Michal Hocko
@ 2021-10-21 13:24 ` Vasily Averin
0 siblings, 0 replies; 66+ messages in thread
From: Vasily Averin @ 2021-10-21 13:24 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
cgroups, linux-mm, linux-kernel, kernel
On 21.10.2021 14:49, Michal Hocko wrote:
> On Thu 21-10-21 11:03:43, Vasily Averin wrote:
>> On 18.10.2021 12:04, Michal Hocko wrote:
>>> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
>>> [...]
>>>> How could this happen?
>>>>
>>>> User-space task inside the memcg-limited container generated a page fault,
>>>> its handler do_user_addr_fault() called handle_mm_fault which could not
>>>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>>>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>>>> out_of_memory() without set of memcg.
>>
>>> I will be honest that I am not really happy about pagefault_out_of_memory.
>>> I have tried to remove it in the past. Without much success back then,
>>> unfortunately[1].
>>>
>>> [1] I do not have msg-id so I cannot provide a lore link but google
>>> pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html
>>
>> I re-read this discussion and in general I support your position.
>> As far as I understand your opponents cannot explain why "random kill" is mandatory here,
>> they are just afraid that it might be useful here and do not want to remove it completely.
>
> That aligns with my recollection.
>
>> Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior.
>>
>> However I would like to have some choice in this point.
>>
>> In general we can:
>> - continue to use "random kill" and rely on the wisdom of the ancestors.
>
> I do not follow. Does that mean to preserve existing oom killer from
> #PF?
>
>> - do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again".
>> - add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory.
>
> Again, not really sure what you mean
>
>> - mark the current task as cycled in #PF and somehow use this mark in allocator
>
> How?
>
>> - make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle.
>
> No! We cannot really kill the task if we could we would have done it by
> the oom killer already
>
>> - implement any better ideas,
>> - use any combination of previous points
>>
>> We can select required strategy for example via sysctl.
>
> Absolutely no! How can admin know any better than the kernel?
>
>> For me "random kill" is worst choice,
>> Why can't we just kill the looped process instead?
>
> See above.
>
>> It can be marked as oom-unkillable, so OOM-killer was unable to select it.
>> However I doubt it means "never kill it", for me it is something like "last possible victim" priority.
>
> It means never kill it because of OOM. If it is retrying because of OOM
> then it is effectively the same thing.
>
> The oom killer from the #PF doesn't really provide any clear advantage
> these days AFAIK. On the other hand it allows for a very disruptive
> behavior. In a worst case it can lead to a system panic if the
> VM_FAULT_OOM is not really caused by a memory shortage but rather a
> wrong error handling. If a task is looping there without any progress
> then it is still kilallable which is a much saner behavior IMHO.
Let's continue this discussion in "Re: [PATCH memcg 3/3] memcg: handle memcg oom failures" thread.
.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 3/3] memcg: handle memcg oom failures
2021-10-21 11:49 ` Michal Hocko
@ 2021-10-21 15:05 ` Vasily Averin
2021-10-21 16:47 ` Michal Hocko
0 siblings, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-21 15:05 UTC (permalink / raw)
To: Michal Hocko
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel
On 21.10.2021 14:49, Michal Hocko wrote:
> I do understand that handling a very specific case sounds easier but it
> would be better to have a robust fix even if that requires some more
> head scratching. So far we have collected several reasons why the it is
> bad to trigger oom killer from the #PF path. There is no single argument
> to keep it so it sounds like a viable path to pursue. Maybe there are
> some very well hidden reasons but those should be documented and this is
> a great opportunity to do either of the step.
>
> Moreover if it turns out that there is a regression then this can be
> easily reverted and a different, maybe memcg specific, solution can be
> implemented.
Now I'm agree,
however I still have a few open questions.
1) VM_FAULT_OOM may be triggered w/o execution of out_of_memory()
for exampel it can be caused by incorrect vm fault operations,
(a) which can return this error without calling allocator at all.
(b) or which can provide incorrect gfp flags and allocator can fail without execution of out_of_memory.
(c) This may happen on stable/LTS kernels when successful allocation was failed by hit into limit of legacy memcg-kmem contoller.
We'll drop it in upstream kernels, however how to handle it in old kenrels?
We can make sure that out_of_memory or alocator was called by set of some per-task flags.
Can pagefault_out_of_memory() send itself a SIGKILL in all these cases?
If not -- task will be looped.
It is much better than execution of global OOM, however it would be even better to avoid it somehow.
You said: "We cannot really kill the task if we could we would have done it by the oom killer already".
However what to do if we even not tried to use oom-killer? (see (b) and (c))
or if we did not used the allocator at all (see (a))
2) in your patch we just exit from pagefault_out_of_memory(). and restart new #PF.
We can call schedule_timeout() and wait some time before a new #PF restart.
Additionally we can increase this delay in each new cycle.
It helps to save CPU time for other tasks.
What do you think about?
Thank you,
Vasily Averin
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg 3/3] memcg: handle memcg oom failures
2021-10-21 15:05 ` Vasily Averin
@ 2021-10-21 16:47 ` Michal Hocko
2021-10-22 8:10 ` [PATCH memcg v2 0/2] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
[not found] ` <cover.1634889066.git.vvs@virtuozzo.com>
0 siblings, 2 replies; 66+ messages in thread
From: Michal Hocko @ 2021-10-21 16:47 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel
On Thu 21-10-21 18:05:28, Vasily Averin wrote:
> On 21.10.2021 14:49, Michal Hocko wrote:
> > I do understand that handling a very specific case sounds easier but it
> > would be better to have a robust fix even if that requires some more
> > head scratching. So far we have collected several reasons why the it is
> > bad to trigger oom killer from the #PF path. There is no single argument
> > to keep it so it sounds like a viable path to pursue. Maybe there are
> > some very well hidden reasons but those should be documented and this is
> > a great opportunity to do either of the step.
> >
> > Moreover if it turns out that there is a regression then this can be
> > easily reverted and a different, maybe memcg specific, solution can be
> > implemented.
>
> Now I'm agree,
> however I still have a few open questions.
>
> 1) VM_FAULT_OOM may be triggered w/o execution of out_of_memory()
> for exampel it can be caused by incorrect vm fault operations,
> (a) which can return this error without calling allocator at all.
I would argue this to be a bug. How can that particular code tell
whether the system is OOM and the oom killer is the a reasonable measure
to take?
> (b) or which can provide incorrect gfp flags and allocator can fail without execution of out_of_memory.
I am not sure I can see any sensible scenario where pagefault oom killer
would be an appropriate fix for that.
> (c) This may happen on stable/LTS kernels when successful allocation was failed by hit into limit of legacy memcg-kmem contoller.
> We'll drop it in upstream kernels, however how to handle it in old kenrels?
Triggering the global oom killer for legacy kmem charge failure is
clearly wrong. Removing oom killer from #PF would fix that problem.
> We can make sure that out_of_memory or alocator was called by set of some per-task flags.
I am not sure I see how that would be useful other than reporting a
dubious VM_FAULT_OOM usage. I am also not sure how that would be
implemented as allocator can be called several times not to mention that
the allocation itself could have been done from a different context -
e.g. WQ.
> Can pagefault_out_of_memory() send itself a SIGKILL in all these cases?
In principle it can as sending signal is not prohibited. I would argue
it should not though because it is just wrong thing to do in all those
cases.
> If not -- task will be looped.
Yes, but it will be killable from userspace. So this is not an
unrecoverable situation.
> It is much better than execution of global OOM, however it would be even better to avoid it somehow.
How?
> You said: "We cannot really kill the task if we could we would have done it by the oom killer already".
> However what to do if we even not tried to use oom-killer? (see (b) and (c))
> or if we did not used the allocator at all (see (a))
See above
> 2) in your patch we just exit from pagefault_out_of_memory(). and restart new #PF.
> We can call schedule_timeout() and wait some time before a new #PF restart.
> Additionally we can increase this delay in each new cycle.
> It helps to save CPU time for other tasks.
> What do you think about?
I do not have a strong opinion on this. A short sleep makes sense. I am
not sure a more complex implementation is really needed.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH memcg v2 0/2] memcg: prohibit unconditional exceeding the limit of dying tasks
2021-10-21 16:47 ` Michal Hocko
@ 2021-10-22 8:10 ` Vasily Averin
[not found] ` <cover.1634889066.git.vvs@virtuozzo.com>
1 sibling, 0 replies; 66+ messages in thread
From: Vasily Averin @ 2021-10-22 8:10 UTC (permalink / raw)
To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
kernel
Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It can be misused and allow to trigger global OOM from inside
memcg-limited container. On the other hand if memcg fail allocation,
called from inside #PF handler it trigger global OOM from inside
pagefault_out_of_memory().
To prevent these problems this patch set:
1) removes execution of out_of_memory() from pagefault_out_of_memory(),
becasue nobody can explain why it is necessary.
2) allows memcg to fail the allocations of dying/killed tasks.
v2: resplit,
use old patch from Michal Hocko removing out_of_memory() from
pagefault_out_of_memory()
Michal Hocko (1):
mm, oom: do not trigger out_of_memory from the #PF
Vasily Averin (1):
memcg: prohibit unconditional exceeding the limit of dying tasks
mm/memcontrol.c | 27 ++++++++-------------------
mm/oom_kill.c | 23 ++++++++++-------------
2 files changed, 18 insertions(+), 32 deletions(-)
--
2.32.0
^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH memcg v2 1/2] mm, oom: do not trigger out_of_memory from the #PF
[not found] ` <cover.1634889066.git.vvs@virtuozzo.com>
@ 2021-10-22 8:11 ` Vasily Averin
2021-10-22 8:55 ` Michal Hocko
2021-10-22 8:11 ` [PATCH memcg v2 2/2] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
1 sibling, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-22 8:11 UTC (permalink / raw)
To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
kernel
From: Michal Hocko <mhocko@suse.com>
Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory. This can happen for
2 different reasons. a) Memcg is out of memory and we rely on
mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
normal allocation fails.
The later is quite problematic because allocation paths already trigger
out_of_memory and the page allocator tries really hard to not fail
allocations. Anyway, if the OOM killer has been already invoked there
is no reason to invoke it again from the #PF path. Especially when the
OOM condition might be gone by that time and we have no way to find out
other than allocate.
Moreover if the allocation failed and the OOM killer hasn't been
invoked then we are unlikely to do the right thing from the #PF context
because we have already lost the allocation context and restictions and
therefore might oom kill a task from a different NUMA domain.
An allocation might fail also when the current task is the oom victim
and there are no memory reserves left and we should simply bail out
from the #PF rather than invoking out_of_memory.
This all suggests that there is no legitimate reason to trigger
out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
that no #PF path returns with VM_FAULT_OOM without allocation print a
warning that this is happening before we restart the #PF.
[VvS: #PF allocation can hit into limit of cgroup v1 kmem controlle.
This is a local problem related to memcg, however, it causes unnecessary
global OOM kills that are repeated over and over again and escalate into
a real disaster. It was broken long time ago, most likely since
6a1a0d3b625a ("mm: allocate kernel pages to the right memcg").
In upstream the problem will be fixed by removing the outdated kmem limit,
however stable and LTS kernels cannot do it and are still affected.
This patch fixes the problem and should be backported into stable.]
Fixes: 6a1a0d3b625a ("mm: allocate kernel pages to the right memcg")
Cc: stable@vger.kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vasily Averin <vvs@virtuozzo.com>
---
mm/oom_kill.c | 23 ++++++++++-------------
1 file changed, 10 insertions(+), 13 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 831340e7ad8b..f98954befafb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1120,27 +1120,24 @@ bool out_of_memory(struct oom_control *oc)
}
/*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
- * killing is already in progress so do nothing.
+ * The pagefault handler calls here because some allocation has failed. We have
+ * to take care of the memcg OOM here because this is the only safe context without
+ * any locks held but let the oom killer triggered from the allocation context care
+ * about the global OOM.
*/
void pagefault_out_of_memory(void)
{
- struct oom_control oc = {
- .zonelist = NULL,
- .nodemask = NULL,
- .memcg = NULL,
- .gfp_mask = 0,
- .order = 0,
- };
+ static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
if (mem_cgroup_oom_synchronize(true))
return;
- if (!mutex_trylock(&oom_lock))
+ if (fatal_signal_pending(current))
return;
- out_of_memory(&oc);
- mutex_unlock(&oom_lock);
+
+ if (__ratelimit(&pfoom_rs))
+ pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
}
SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
--
2.32.0
^ permalink raw reply related [flat|nested] 66+ messages in thread
* [PATCH memcg v2 2/2] memcg: prohibit unconditional exceeding the limit of dying tasks
[not found] ` <cover.1634889066.git.vvs@virtuozzo.com>
2021-10-22 8:11 ` [PATCH memcg v2 1/2] mm, oom: do not trigger out_of_memory from the #PF Vasily Averin
@ 2021-10-22 8:11 ` Vasily Averin
2021-10-22 9:10 ` Michal Hocko
1 sibling, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-22 8:11 UTC (permalink / raw)
To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
kernel
Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It is assumed that the amount of the memory charged by those
tasks is bound and most of the memory will get released while the task
is exiting. This is resembling a heuristic for the global OOM situation
when tasks get access to memory reserves. There is no global memory
shortage at the memcg level so the memcg heuristic is more relieved.
The above assumption is overly optimistic though. E.g. vmalloc can scale
to really large requests and the heuristic would allow that. We used to
have an early break in the vmalloc allocator for killed tasks but this
has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
the current task is killed""). There are likely other similar code paths
which do not check for fatal signals in an allocation&charge loop.
Also there are some kernel objects charged to a memcg which are not
bound to a process life time.
It has been observed that it is not really hard to trigger these
bypasses and cause global OOM situation.
One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves). This is
certainly possible but it is not really clear how much of an excess is
desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.
This patch is addressing the problem by removing the heuristic
altogether. Bypass is only allowed for requests which either cannot fail
or where the failure is not desirable while excess should be still
limited (e.g. atomic requests). Implementation wise a killed or dying
task fails to charge if it has passed the OOM killer stage. That should
give all forms of reclaim chance to restore the limit before the
failure (ENOMEM) and tell the caller to back off.
In addition, this patch renames should_force_charge() helper
to task_is_dying() because now its use is not associated witch forced
charging.
Fixes: a636b327f731 ("memcg: avoid unnecessary system-wide-oom-killer")
Cc: stable@vger.kernel.org
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
mm/memcontrol.c | 27 ++++++++-------------------
1 file changed, 8 insertions(+), 19 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..87e41c3cac10 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -239,7 +239,7 @@ enum res_type {
iter != NULL; \
iter = mem_cgroup_iter(NULL, iter, NULL))
-static inline bool should_force_charge(void)
+static inline bool task_is_dying(void)
{
return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
(current->flags & PF_EXITING);
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
* A few threads which were not waiting at mutex_lock_killable() can
* fail to bail out. Therefore, check again after holding oom_lock.
*/
- ret = should_force_charge() || out_of_memory(&oc);
+ ret = task_is_dying() || out_of_memory(&oc);
unlock:
mutex_unlock(&oom_lock);
@@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
struct page_counter *counter;
enum oom_status oom_status;
unsigned long nr_reclaimed;
+ bool passed_oom = false;
bool may_swap = true;
bool drained = false;
unsigned long pflags;
@@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (gfp_mask & __GFP_ATOMIC)
goto force;
- /*
- * Unlike in global OOM situations, memcg is not in a physical
- * memory shortage. Allow dying and OOM-killed tasks to
- * bypass the last charges so that they can exit quickly and
- * free their memory.
- */
- if (unlikely(should_force_charge()))
- goto force;
-
/*
* Prevent unbounded recursion when reclaim operations need to
* allocate memory. This might exceed the limits temporarily,
@@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (gfp_mask & __GFP_RETRY_MAYFAIL)
goto nomem;
- if (fatal_signal_pending(current))
- goto force;
+ /* Avoid endless loop for tasks bypassed by the oom killer */
+ if (passed_oom && task_is_dying())
+ goto nomem;
/*
* keep retrying as long as the memcg oom killer is able to make
@@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
*/
oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
get_order(nr_pages * PAGE_SIZE));
- switch (oom_status) {
- case OOM_SUCCESS:
+ if (oom_status == OOM_SUCCESS) {
+ passed_oom = true;
nr_retries = MAX_RECLAIM_RETRIES;
goto retry;
- case OOM_FAILED:
- goto force;
- default:
- goto nomem;
}
nomem:
if (!(gfp_mask & __GFP_NOFAIL))
--
2.32.0
^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [PATCH memcg v2 1/2] mm, oom: do not trigger out_of_memory from the #PF
2021-10-22 8:11 ` [PATCH memcg v2 1/2] mm, oom: do not trigger out_of_memory from the #PF Vasily Averin
@ 2021-10-22 8:55 ` Michal Hocko
0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2021-10-22 8:55 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel
On Fri 22-10-21 11:11:09, Vasily Averin wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> Any allocation failure during the #PF path will return with VM_FAULT_OOM
> which in turn results in pagefault_out_of_memory. This can happen for
> 2 different reasons. a) Memcg is out of memory and we rely on
> mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> normal allocation fails.
>
> The later is quite problematic because allocation paths already trigger
> out_of_memory and the page allocator tries really hard to not fail
> allocations. Anyway, if the OOM killer has been already invoked there
> is no reason to invoke it again from the #PF path. Especially when the
> OOM condition might be gone by that time and we have no way to find out
> other than allocate.
>
> Moreover if the allocation failed and the OOM killer hasn't been
> invoked then we are unlikely to do the right thing from the #PF context
> because we have already lost the allocation context and restictions and
> therefore might oom kill a task from a different NUMA domain.
>
> An allocation might fail also when the current task is the oom victim
> and there are no memory reserves left and we should simply bail out
> from the #PF rather than invoking out_of_memory.
>
> This all suggests that there is no legitimate reason to trigger
> out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
> that no #PF path returns with VM_FAULT_OOM without allocation print a
> warning that this is happening before we restart the #PF.
Thanks for reviving my old patch! I still do think this is the right
direction. I would argue that we should split this into two stages as
already mentioned. First to add a bail out on fatal_signal_pending
because that is a trivial thing to do and it should be obviously safer.
Then we can remove out_of_memory completely. That should be obviously
right thing to do but if it turns out we have missed something we would
need to revert that part.
> [VvS: #PF allocation can hit into limit of cgroup v1 kmem controlle.
> This is a local problem related to memcg, however, it causes unnecessary
> global OOM kills that are repeated over and over again and escalate into
> a real disaster. It was broken long time ago, most likely since
> 6a1a0d3b625a ("mm: allocate kernel pages to the right memcg").
> In upstream the problem will be fixed by removing the outdated kmem limit,
> however stable and LTS kernels cannot do it and are still affected.
> This patch fixes the problem and should be backported into stable.]
>
> Fixes: 6a1a0d3b625a ("mm: allocate kernel pages to the right memcg")
I am not sure the sha1 is the right one but the issue is there since
kmem accounting has been introduced so it should be around that time.
Maybe do not name the specific commit but stick to something like
"This has been broken since kmem accounting has been introduced for
cgroup v1 (3.8). There was no kmem specific reclaim for the separate
limit so the only way to handle kmem hard limit was to return with
ENOMEM."
> Cc: stable@vger.kernel.org
I am still not sure this really needs a stable backport. There are no
bug reports for years so I strongly suspect people are simply not using
kmem hard limit with cgroup v1.
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Vasily Averin <vvs@virtuozzo.com>
Btw. you should be adding your s-o-b here.
> ---
> mm/oom_kill.c | 23 ++++++++++-------------
> 1 file changed, 10 insertions(+), 13 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 831340e7ad8b..f98954befafb 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1120,27 +1120,24 @@ bool out_of_memory(struct oom_control *oc)
> }
>
> /*
> - * The pagefault handler calls here because it is out of memory, so kill a
> - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
> - * killing is already in progress so do nothing.
> + * The pagefault handler calls here because some allocation has failed. We have
> + * to take care of the memcg OOM here because this is the only safe context without
> + * any locks held but let the oom killer triggered from the allocation context care
> + * about the global OOM.
> */
> void pagefault_out_of_memory(void)
> {
> - struct oom_control oc = {
> - .zonelist = NULL,
> - .nodemask = NULL,
> - .memcg = NULL,
> - .gfp_mask = 0,
> - .order = 0,
> - };
> + static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
> + DEFAULT_RATELIMIT_BURST);
>
> if (mem_cgroup_oom_synchronize(true))
> return;
>
> - if (!mutex_trylock(&oom_lock))
> + if (fatal_signal_pending(current))
> return;
> - out_of_memory(&oc);
> - mutex_unlock(&oom_lock);
> +
> + if (__ratelimit(&pfoom_rs))
> + pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
> }
>
> SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> --
> 2.32.0
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg v2 2/2] memcg: prohibit unconditional exceeding the limit of dying tasks
2021-10-22 8:11 ` [PATCH memcg v2 2/2] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
@ 2021-10-22 9:10 ` Michal Hocko
2021-10-23 13:18 ` [PATCH memcg v3 0/3] " Vasily Averin
[not found] ` <cover.1634994605.git.vvs@virtuozzo.com>
0 siblings, 2 replies; 66+ messages in thread
From: Michal Hocko @ 2021-10-22 9:10 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel
On Fri 22-10-21 11:11:29, Vasily Averin wrote:
> Memory cgroup charging allows killed or exiting tasks to exceed the hard
> limit. It is assumed that the amount of the memory charged by those
> tasks is bound and most of the memory will get released while the task
> is exiting. This is resembling a heuristic for the global OOM situation
> when tasks get access to memory reserves. There is no global memory
> shortage at the memcg level so the memcg heuristic is more relieved.
>
> The above assumption is overly optimistic though. E.g. vmalloc can scale
> to really large requests and the heuristic would allow that. We used to
> have an early break in the vmalloc allocator for killed tasks but this
> has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
> the current task is killed""). There are likely other similar code paths
> which do not check for fatal signals in an allocation&charge loop.
> Also there are some kernel objects charged to a memcg which are not
> bound to a process life time.
>
> It has been observed that it is not really hard to trigger these
> bypasses and cause global OOM situation.
>
> One potential way to address these runaways would be to limit the amount
> of excess (similar to the global OOM with limited oom reserves). This is
> certainly possible but it is not really clear how much of an excess is
> desirable and still protects from global OOMs as that would have to
> consider the overall memcg configuration.
>
> This patch is addressing the problem by removing the heuristic
> altogether. Bypass is only allowed for requests which either cannot fail
> or where the failure is not desirable while excess should be still
> limited (e.g. atomic requests). Implementation wise a killed or dying
> task fails to charge if it has passed the OOM killer stage. That should
> give all forms of reclaim chance to restore the limit before the
> failure (ENOMEM) and tell the caller to back off.
>
> In addition, this patch renames should_force_charge() helper
> to task_is_dying() because now its use is not associated witch forced
> charging.
I would explicitly mention that this depends on pagefault_out_of_memory
to not trigger out_of_memory because then a memcg failure can unwind to
VM_FAULT_OOM and cause a global OOM killer.
Maybe it would be even better to fold the removal to this patch so the
dependency is more obvious. I will live that to you.
> Fixes: a636b327f731 ("memcg: avoid unnecessary system-wide-oom-killer")
Fixes tag would be quite hard here. For example you certainly didn't
have a practically unbound vector to go over the hard limit - like
vmalloc. At least not after __GFP_ACCOUNT has been introduced. So I
would just not bother with a Fixes tag at all rather than cause it more
questions than answers.
> Cc: stable@vger.kernel.org
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Other than that looks good to me.
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!
> ---
> mm/memcontrol.c | 27 ++++++++-------------------
> 1 file changed, 8 insertions(+), 19 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6da5020a8656..87e41c3cac10 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -239,7 +239,7 @@ enum res_type {
> iter != NULL; \
> iter = mem_cgroup_iter(NULL, iter, NULL))
>
> -static inline bool should_force_charge(void)
> +static inline bool task_is_dying(void)
> {
> return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
> (current->flags & PF_EXITING);
> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> * A few threads which were not waiting at mutex_lock_killable() can
> * fail to bail out. Therefore, check again after holding oom_lock.
> */
> - ret = should_force_charge() || out_of_memory(&oc);
> + ret = task_is_dying() || out_of_memory(&oc);
>
> unlock:
> mutex_unlock(&oom_lock);
> @@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> struct page_counter *counter;
> enum oom_status oom_status;
> unsigned long nr_reclaimed;
> + bool passed_oom = false;
> bool may_swap = true;
> bool drained = false;
> unsigned long pflags;
> @@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> if (gfp_mask & __GFP_ATOMIC)
> goto force;
>
> - /*
> - * Unlike in global OOM situations, memcg is not in a physical
> - * memory shortage. Allow dying and OOM-killed tasks to
> - * bypass the last charges so that they can exit quickly and
> - * free their memory.
> - */
> - if (unlikely(should_force_charge()))
> - goto force;
> -
> /*
> * Prevent unbounded recursion when reclaim operations need to
> * allocate memory. This might exceed the limits temporarily,
> @@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> if (gfp_mask & __GFP_RETRY_MAYFAIL)
> goto nomem;
>
> - if (fatal_signal_pending(current))
> - goto force;
> + /* Avoid endless loop for tasks bypassed by the oom killer */
> + if (passed_oom && task_is_dying())
> + goto nomem;
>
> /*
> * keep retrying as long as the memcg oom killer is able to make
> @@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> */
> oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
> get_order(nr_pages * PAGE_SIZE));
> - switch (oom_status) {
> - case OOM_SUCCESS:
> + if (oom_status == OOM_SUCCESS) {
> + passed_oom = true;
> nr_retries = MAX_RECLAIM_RETRIES;
> goto retry;
> - case OOM_FAILED:
> - goto force;
> - default:
> - goto nomem;
> }
> nomem:
> if (!(gfp_mask & __GFP_NOFAIL))
> --
> 2.32.0
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH memcg v3 0/3] memcg: prohibit unconditional exceeding the limit of dying tasks
2021-10-22 9:10 ` Michal Hocko
@ 2021-10-23 13:18 ` Vasily Averin
[not found] ` <cover.1634994605.git.vvs@virtuozzo.com>
1 sibling, 0 replies; 66+ messages in thread
From: Vasily Averin @ 2021-10-23 13:18 UTC (permalink / raw)
To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
kernel
Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It can be misused and allow to trigger global OOM from inside
memcg-limited container. On the other hand if memcg fail allocation,
called from inside #PF handler it trigger global OOM from inside
pagefault_out_of_memory().
To prevent these problems this patch set:
a) removes execution of out_of_memory() from pagefault_out_of_memory(),
becasue nobody can explain why it is necessary.
b) allow memcg to fail allocation of dying/killed tasks.
v3: resplit, improved patch descriptions
v2: resplit,
use old patch from Michal Hocko removing out_of_memory() from
pagefault_out_of_memory()
Michal Hocko (1):
mm, oom: do not trigger out_of_memory from the #PF
Vasily Averin (2):
mm, oom: pagefault_out_of_memory: don't force global OOM for dying
tasks
memcg: prohibit unconditional exceeding the limit of dying tasks
mm/memcontrol.c | 27 ++++++++-------------------
mm/oom_kill.c | 23 ++++++++++-------------
2 files changed, 18 insertions(+), 32 deletions(-)
--
2.32.0
^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH memcg v3 1/3] mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks
[not found] ` <cover.1634994605.git.vvs@virtuozzo.com>
@ 2021-10-23 13:19 ` Vasily Averin
2021-10-25 9:27 ` Michal Hocko
2021-10-23 13:20 ` [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF Vasily Averin
2021-10-23 13:20 ` [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
2 siblings, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-23 13:19 UTC (permalink / raw)
To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
kernel
Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory which in own turn
executes out_out_memory() and can kill a random task.
An allocation might fail when the current task is the oom victim
and there are no memory reserves left. The OOM killer is already
handled at the page allocator level for the global OOM and at the
charging level for the memcg one. Both have much more information
about the scope of allocation/charge request. This means that
either the OOM killer has been invoked properly and didn't lead
to the allocation success or it has been skipped because it couldn't
have been invoked. In both cases triggering it from here is pointless
and even harmful.
It makes much more sense to let the killed task die rather than to
wake up an eternally hungry oom-killer and send him to choose a fatter
victim for breakfast.
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
mm/oom_kill.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 831340e7ad8b..1deef8c7a71b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
if (mem_cgroup_oom_synchronize(true))
return;
+ if (fatal_signal_pending(current))
+ return;
+
if (!mutex_trylock(&oom_lock))
return;
out_of_memory(&oc);
--
2.32.0
^ permalink raw reply related [flat|nested] 66+ messages in thread
* [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
[not found] ` <cover.1634994605.git.vvs@virtuozzo.com>
2021-10-23 13:19 ` [PATCH memcg v3 1/3] mm, oom: pagefault_out_of_memory: don't force global OOM for " Vasily Averin
@ 2021-10-23 13:20 ` Vasily Averin
2021-10-23 15:01 ` Tetsuo Handa
2021-10-25 9:34 ` Michal Hocko
2021-10-23 13:20 ` [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
2 siblings, 2 replies; 66+ messages in thread
From: Vasily Averin @ 2021-10-23 13:20 UTC (permalink / raw)
To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
kernel
From: Michal Hocko <mhocko@suse.com>
Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory. This can happen for
2 different reasons. a) Memcg is out of memory and we rely on
mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
normal allocation fails.
The later is quite problematic because allocation paths already trigger
out_of_memory and the page allocator tries really hard to not fail
allocations. Anyway, if the OOM killer has been already invoked there
is no reason to invoke it again from the #PF path. Especially when the
OOM condition might be gone by that time and we have no way to find out
other than allocate.
Moreover if the allocation failed and the OOM killer hasn't been
invoked then we are unlikely to do the right thing from the #PF context
because we have already lost the allocation context and restictions and
therefore might oom kill a task from a different NUMA domain.
This all suggests that there is no legitimate reason to trigger
out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
that no #PF path returns with VM_FAULT_OOM without allocation print a
warning that this is happening before we restart the #PF.
[VvS: #PF allocation can hit into limit of cgroup v1 kmem controller.
This is a local problem related to memcg, however, it causes unnecessary
global OOM kills that are repeated over and over again and escalate into
a real disaster. This has been broken since kmem accounting has been
introduced for cgroup v1 (3.8). There was no kmem specific reclaim
for the separate limit so the only way to handle kmem hard limit
was to return with ENOMEM.
In upstream the problem will be fixed by removing the outdated kmem limit,
however stable and LTS kernels cannot do it and are still affected.
This patch fixes the problem and should be backported into stable/LTS.]
Cc: stable@vger.kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
v2: hook with fatal_signal_pending() has beem moved into a separate patch,
improved patch description, removed "Fixed" mark.
---
mm/oom_kill.c | 22 ++++++++--------------
1 file changed, 8 insertions(+), 14 deletions(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1deef8c7a71b..f98954befafb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1120,19 +1120,15 @@ bool out_of_memory(struct oom_control *oc)
}
/*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
- * killing is already in progress so do nothing.
+ * The pagefault handler calls here because some allocation has failed. We have
+ * to take care of the memcg OOM here because this is the only safe context without
+ * any locks held but let the oom killer triggered from the allocation context care
+ * about the global OOM.
*/
void pagefault_out_of_memory(void)
{
- struct oom_control oc = {
- .zonelist = NULL,
- .nodemask = NULL,
- .memcg = NULL,
- .gfp_mask = 0,
- .order = 0,
- };
+ static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
if (mem_cgroup_oom_synchronize(true))
return;
@@ -1140,10 +1136,8 @@ void pagefault_out_of_memory(void)
if (fatal_signal_pending(current))
return;
- if (!mutex_trylock(&oom_lock))
- return;
- out_of_memory(&oc);
- mutex_unlock(&oom_lock);
+ if (__ratelimit(&pfoom_rs))
+ pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
}
SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
--
2.32.0
^ permalink raw reply related [flat|nested] 66+ messages in thread
* [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks
[not found] ` <cover.1634994605.git.vvs@virtuozzo.com>
2021-10-23 13:19 ` [PATCH memcg v3 1/3] mm, oom: pagefault_out_of_memory: don't force global OOM for " Vasily Averin
2021-10-23 13:20 ` [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF Vasily Averin
@ 2021-10-23 13:20 ` Vasily Averin
2021-10-25 9:36 ` Michal Hocko
2 siblings, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-23 13:20 UTC (permalink / raw)
To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
kernel
Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It is assumed that the amount of the memory charged by those
tasks is bound and most of the memory will get released while the task
is exiting. This is resembling a heuristic for the global OOM situation
when tasks get access to memory reserves. There is no global memory
shortage at the memcg level so the memcg heuristic is more relieved.
The above assumption is overly optimistic though. E.g. vmalloc can scale
to really large requests and the heuristic would allow that. We used to
have an early break in the vmalloc allocator for killed tasks but this
has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
the current task is killed""). There are likely other similar code paths
which do not check for fatal signals in an allocation&charge loop.
Also there are some kernel objects charged to a memcg which are not
bound to a process life time.
It has been observed that it is not really hard to trigger these
bypasses and cause global OOM situation.
One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves). This is
certainly possible but it is not really clear how much of an excess is
desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.
This patch is addressing the problem by removing the heuristic
altogether. Bypass is only allowed for requests which either cannot fail
or where the failure is not desirable while excess should be still
limited (e.g. atomic requests). Implementation wise a killed or dying
task fails to charge if it has passed the OOM killer stage. That should
give all forms of reclaim chance to restore the limit before the
failure (ENOMEM) and tell the caller to back off.
In addition, this patch renames should_force_charge() helper
to task_is_dying() because now its use is not associated witch forced
charging.
This patch depends on pagefault_out_of_memory() to not trigger
out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
and cause a global OOM killer.
Cc: stable@vger.kernel.org
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
mm/memcontrol.c | 27 ++++++++-------------------
1 file changed, 8 insertions(+), 19 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..87e41c3cac10 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -239,7 +239,7 @@ enum res_type {
iter != NULL; \
iter = mem_cgroup_iter(NULL, iter, NULL))
-static inline bool should_force_charge(void)
+static inline bool task_is_dying(void)
{
return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
(current->flags & PF_EXITING);
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
* A few threads which were not waiting at mutex_lock_killable() can
* fail to bail out. Therefore, check again after holding oom_lock.
*/
- ret = should_force_charge() || out_of_memory(&oc);
+ ret = task_is_dying() || out_of_memory(&oc);
unlock:
mutex_unlock(&oom_lock);
@@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
struct page_counter *counter;
enum oom_status oom_status;
unsigned long nr_reclaimed;
+ bool passed_oom = false;
bool may_swap = true;
bool drained = false;
unsigned long pflags;
@@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (gfp_mask & __GFP_ATOMIC)
goto force;
- /*
- * Unlike in global OOM situations, memcg is not in a physical
- * memory shortage. Allow dying and OOM-killed tasks to
- * bypass the last charges so that they can exit quickly and
- * free their memory.
- */
- if (unlikely(should_force_charge()))
- goto force;
-
/*
* Prevent unbounded recursion when reclaim operations need to
* allocate memory. This might exceed the limits temporarily,
@@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (gfp_mask & __GFP_RETRY_MAYFAIL)
goto nomem;
- if (fatal_signal_pending(current))
- goto force;
+ /* Avoid endless loop for tasks bypassed by the oom killer */
+ if (passed_oom && task_is_dying())
+ goto nomem;
/*
* keep retrying as long as the memcg oom killer is able to make
@@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
*/
oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
get_order(nr_pages * PAGE_SIZE));
- switch (oom_status) {
- case OOM_SUCCESS:
+ if (oom_status == OOM_SUCCESS) {
+ passed_oom = true;
nr_retries = MAX_RECLAIM_RETRIES;
goto retry;
- case OOM_FAILED:
- goto force;
- default:
- goto nomem;
}
nomem:
if (!(gfp_mask & __GFP_NOFAIL))
--
2.32.0
^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
2021-10-23 13:20 ` [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF Vasily Averin
@ 2021-10-23 15:01 ` Tetsuo Handa
2021-10-23 19:15 ` Vasily Averin
2021-10-25 8:04 ` Michal Hocko
2021-10-25 9:34 ` Michal Hocko
1 sibling, 2 replies; 66+ messages in thread
From: Tetsuo Handa @ 2021-10-23 15:01 UTC (permalink / raw)
To: Vasily Averin, Michal Hocko
Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
Mel Gorman, cgroups, linux-mm, linux-kernel, kernel,
Johannes Weiner, Vladimir Davydov, Andrew Morton
On 2021/10/23 22:20, Vasily Averin wrote:
> /*
> - * The pagefault handler calls here because it is out of memory, so kill a
> - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
> - * killing is already in progress so do nothing.
> + * The pagefault handler calls here because some allocation has failed. We have
> + * to take care of the memcg OOM here because this is the only safe context without
> + * any locks held but let the oom killer triggered from the allocation context care
> + * about the global OOM.
> */
Excuse me for a stupid question. I consider
if (!mutex_trylock(&oom_lock))
return;
out_of_memory(&oc);
mutex_unlock(&oom_lock);
here as the last resort (safeguard) when neither __alloc_pages_may_oom()
nor mem_cgroup_out_of_memory() can make progress. This patch says
let the oom killer triggered from the allocation context care
about the global OOM.
but what if the OOM killer cannot be invoked from the allocation context?
Is there a guarantee that all memory allocations which might result in
VM_FAULT_OOM can invoke the OOM killer?
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
2021-10-23 15:01 ` Tetsuo Handa
@ 2021-10-23 19:15 ` Vasily Averin
2021-10-25 8:04 ` Michal Hocko
1 sibling, 0 replies; 66+ messages in thread
From: Vasily Averin @ 2021-10-23 19:15 UTC (permalink / raw)
To: Tetsuo Handa, Michal Hocko
Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
Mel Gorman, cgroups, linux-mm, linux-kernel, kernel,
Johannes Weiner, Vladimir Davydov, Andrew Morton
On 23.10.2021 18:01, Tetsuo Handa wrote:
> On 2021/10/23 22:20, Vasily Averin wrote:
>> /*
>> - * The pagefault handler calls here because it is out of memory, so kill a
>> - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
>> - * killing is already in progress so do nothing.
>> + * The pagefault handler calls here because some allocation has failed. We have
>> + * to take care of the memcg OOM here because this is the only safe context without
>> + * any locks held but let the oom killer triggered from the allocation context care
>> + * about the global OOM.
>> */
>
> Excuse me for a stupid question. I consider
>
> if (!mutex_trylock(&oom_lock))
> return;
> out_of_memory(&oc);
> mutex_unlock(&oom_lock);
>
> here as the last resort (safeguard) when neither __alloc_pages_may_oom()
> nor mem_cgroup_out_of_memory() can make progress. This patch says
>
> let the oom killer triggered from the allocation context care
> about the global OOM.
>
> but what if the OOM killer cannot be invoked from the allocation context?
> Is there a guarantee that all memory allocations which might result in
> VM_FAULT_OOM can invoke the OOM killer?
I don't think this question is stupid, since I asked it myself :)
You can find this discussion here:
https://lkml.org/lkml/2021/10/21/900
Let me quote it here too
:> 1) VM_FAULT_OOM may be triggered w/o execution of out_of_memory()
:> for exampel it can be caused by incorrect vm fault operations,
:> (a) which can return this error without calling allocator at all.
:
:I would argue this to be a bug. How can that particular code tell
:whether the system is OOM and the oom killer is the a reasonable measure
:to take?
:
:> (b) or which can provide incorrect gfp flags and allocator can fail without execution of out_of_memory.
:
: I am not sure I can see any sensible scenario where pagefault oom killer
: would be an appropriate fix for that.
:
:> (c) This may happen on stable/LTS kernels when successful allocation was failed by hit into limit of legacy memcg-kmem contoller.
:> We'll drop it in upstream kernels, however how to handle it in old kenrels?
:
:Triggering the global oom killer for legacy kmem charge failure is
:clearly wrong. Removing oom killer from #PF would fix that problem.
I would note: (c) is not theoretical but real life problem, in this case allocation was failed without execution of OOM,
however, it is in this case that execution out_of_memory() from pagefault_out_of_memory() leads to a disaster.
Thank you,
Vasily Averin
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
2021-10-23 15:01 ` Tetsuo Handa
2021-10-23 19:15 ` Vasily Averin
@ 2021-10-25 8:04 ` Michal Hocko
2021-10-26 13:56 ` Tetsuo Handa
1 sibling, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2021-10-25 8:04 UTC (permalink / raw)
To: Tetsuo Handa
Cc: Vasily Averin, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka,
Shakeel Butt, Mel Gorman, cgroups, linux-mm, linux-kernel,
kernel, Johannes Weiner, Vladimir Davydov, Andrew Morton
On Sun 24-10-21 00:01:07, Tetsuo Handa wrote:
> On 2021/10/23 22:20, Vasily Averin wrote:
> > /*
> > - * The pagefault handler calls here because it is out of memory, so kill a
> > - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
> > - * killing is already in progress so do nothing.
> > + * The pagefault handler calls here because some allocation has failed. We have
> > + * to take care of the memcg OOM here because this is the only safe context without
> > + * any locks held but let the oom killer triggered from the allocation context care
> > + * about the global OOM.
> > */
>
> Excuse me for a stupid question. I consider
>
> if (!mutex_trylock(&oom_lock))
> return;
> out_of_memory(&oc);
> mutex_unlock(&oom_lock);
>
> here as the last resort (safeguard) when neither __alloc_pages_may_oom()
> nor mem_cgroup_out_of_memory() can make progress. This patch says
>
> let the oom killer triggered from the allocation context care
> about the global OOM.
>
> but what if the OOM killer cannot be invoked from the allocation context?
> Is there a guarantee that all memory allocations which might result in
> VM_FAULT_OOM can invoke the OOM killer?
I do not think there is any guarantee. This code has meant to be a
safeguard but it turns out to be adding more harm than a safety. There
are several scenarios mentioned in this thread where this would be
counter productive or outright wrong thing to do.
On the other hand it is hard to imagine any legitimate situation where
this would be a right thing to do. Maybe you have something more
specific in mind? What would be the legit code to rely on OOM handling
out of the line (where the details about the allocation scope is lost)?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg v3 1/3] mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks
2021-10-23 13:19 ` [PATCH memcg v3 1/3] mm, oom: pagefault_out_of_memory: don't force global OOM for " Vasily Averin
@ 2021-10-25 9:27 ` Michal Hocko
0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2021-10-25 9:27 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel
On Sat 23-10-21 16:19:28, Vasily Averin wrote:
> Any allocation failure during the #PF path will return with VM_FAULT_OOM
> which in turn results in pagefault_out_of_memory which in own turn
> executes out_out_memory() and can kill a random task.
>
> An allocation might fail when the current task is the oom victim
> and there are no memory reserves left. The OOM killer is already
> handled at the page allocator level for the global OOM and at the
> charging level for the memcg one. Both have much more information
> about the scope of allocation/charge request. This means that
> either the OOM killer has been invoked properly and didn't lead
> to the allocation success or it has been skipped because it couldn't
> have been invoked. In both cases triggering it from here is pointless
> and even harmful.
>
> It makes much more sense to let the killed task die rather than to
> wake up an eternally hungry oom-killer and send him to choose a fatter
> victim for breakfast.
>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!
> ---
> mm/oom_kill.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 831340e7ad8b..1deef8c7a71b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
> if (mem_cgroup_oom_synchronize(true))
> return;
>
> + if (fatal_signal_pending(current))
> + return;
> +
> if (!mutex_trylock(&oom_lock))
> return;
> out_of_memory(&oc);
> --
> 2.32.0
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
2021-10-23 13:20 ` [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF Vasily Averin
2021-10-23 15:01 ` Tetsuo Handa
@ 2021-10-25 9:34 ` Michal Hocko
1 sibling, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2021-10-25 9:34 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel
On Sat 23-10-21 16:20:18, Vasily Averin wrote:
> From: Michal Hocko <mhocko@suse.com>
>
> Any allocation failure during the #PF path will return with VM_FAULT_OOM
> which in turn results in pagefault_out_of_memory. This can happen for
> 2 different reasons. a) Memcg is out of memory and we rely on
> mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> normal allocation fails.
>
> The later is quite problematic because allocation paths already trigger
> out_of_memory and the page allocator tries really hard to not fail
> allocations. Anyway, if the OOM killer has been already invoked there
> is no reason to invoke it again from the #PF path. Especially when the
> OOM condition might be gone by that time and we have no way to find out
> other than allocate.
>
> Moreover if the allocation failed and the OOM killer hasn't been
> invoked then we are unlikely to do the right thing from the #PF context
> because we have already lost the allocation context and restictions and
> therefore might oom kill a task from a different NUMA domain.
>
> This all suggests that there is no legitimate reason to trigger
> out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
> that no #PF path returns with VM_FAULT_OOM without allocation print a
> warning that this is happening before we restart the #PF.
>
> [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller.
> This is a local problem related to memcg, however, it causes unnecessary
> global OOM kills that are repeated over and over again and escalate into
> a real disaster. This has been broken since kmem accounting has been
> introduced for cgroup v1 (3.8). There was no kmem specific reclaim
> for the separate limit so the only way to handle kmem hard limit
> was to return with ENOMEM.
> In upstream the problem will be fixed by removing the outdated kmem limit,
> however stable and LTS kernels cannot do it and are still affected.
> This patch fixes the problem and should be backported into stable/LTS.]
>
> Cc: stable@vger.kernel.org
I would be still careful about backporting to stable trees. At least
wait for a release cycle to catch potential problems before backporting.
The problem with kmem is documented and for quite a lot of time and we
haven't received a single bug report IIRC so this is likely not a real
problem people are facing out there.
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
I think this is the right thing to do. Hopefuly we won't find some
tricky code which would depend on this behavior. If this turns out to be
the case then we will clearly learn about it by the kernel message and
we can to handle that situation more gracefully.
Maybe we will want to update the chengelog to be more specific based on
the review comments but this one should describe the problem quite well
already.
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!
> ---
> v2: hook with fatal_signal_pending() has beem moved into a separate patch,
> improved patch description, removed "Fixed" mark.
> ---
> mm/oom_kill.c | 22 ++++++++--------------
> 1 file changed, 8 insertions(+), 14 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1deef8c7a71b..f98954befafb 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1120,19 +1120,15 @@ bool out_of_memory(struct oom_control *oc)
> }
>
> /*
> - * The pagefault handler calls here because it is out of memory, so kill a
> - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
> - * killing is already in progress so do nothing.
> + * The pagefault handler calls here because some allocation has failed. We have
> + * to take care of the memcg OOM here because this is the only safe context without
> + * any locks held but let the oom killer triggered from the allocation context care
> + * about the global OOM.
> */
> void pagefault_out_of_memory(void)
> {
> - struct oom_control oc = {
> - .zonelist = NULL,
> - .nodemask = NULL,
> - .memcg = NULL,
> - .gfp_mask = 0,
> - .order = 0,
> - };
> + static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
> + DEFAULT_RATELIMIT_BURST);
>
> if (mem_cgroup_oom_synchronize(true))
> return;
> @@ -1140,10 +1136,8 @@ void pagefault_out_of_memory(void)
> if (fatal_signal_pending(current))
> return;
>
> - if (!mutex_trylock(&oom_lock))
> - return;
> - out_of_memory(&oc);
> - mutex_unlock(&oom_lock);
> + if (__ratelimit(&pfoom_rs))
> + pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
> }
>
> SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> --
> 2.32.0
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks
2021-10-23 13:20 ` [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
@ 2021-10-25 9:36 ` Michal Hocko
2021-10-27 22:36 ` Andrew Morton
0 siblings, 1 reply; 66+ messages in thread
From: Michal Hocko @ 2021-10-25 9:36 UTC (permalink / raw)
To: Vasily Averin
Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel
On Sat 23-10-21 16:20:51, Vasily Averin wrote:
> Memory cgroup charging allows killed or exiting tasks to exceed the hard
> limit. It is assumed that the amount of the memory charged by those
> tasks is bound and most of the memory will get released while the task
> is exiting. This is resembling a heuristic for the global OOM situation
> when tasks get access to memory reserves. There is no global memory
> shortage at the memcg level so the memcg heuristic is more relieved.
>
> The above assumption is overly optimistic though. E.g. vmalloc can scale
> to really large requests and the heuristic would allow that. We used to
> have an early break in the vmalloc allocator for killed tasks but this
> has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
> the current task is killed""). There are likely other similar code paths
> which do not check for fatal signals in an allocation&charge loop.
> Also there are some kernel objects charged to a memcg which are not
> bound to a process life time.
>
> It has been observed that it is not really hard to trigger these
> bypasses and cause global OOM situation.
>
> One potential way to address these runaways would be to limit the amount
> of excess (similar to the global OOM with limited oom reserves). This is
> certainly possible but it is not really clear how much of an excess is
> desirable and still protects from global OOMs as that would have to
> consider the overall memcg configuration.
>
> This patch is addressing the problem by removing the heuristic
> altogether. Bypass is only allowed for requests which either cannot fail
> or where the failure is not desirable while excess should be still
> limited (e.g. atomic requests). Implementation wise a killed or dying
> task fails to charge if it has passed the OOM killer stage. That should
> give all forms of reclaim chance to restore the limit before the
> failure (ENOMEM) and tell the caller to back off.
>
> In addition, this patch renames should_force_charge() helper
> to task_is_dying() because now its use is not associated witch forced
> charging.
>
> This patch depends on pagefault_out_of_memory() to not trigger
> out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
> and cause a global OOM killer.
>
> Cc: stable@vger.kernel.org
My view on stable backport is similar to the previous patch. If we want
to have it there then let's wait for some time to see whether there are
any fallouts as this patch depends on the PF_OOM change.
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!
> ---
> mm/memcontrol.c | 27 ++++++++-------------------
> 1 file changed, 8 insertions(+), 19 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6da5020a8656..87e41c3cac10 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -239,7 +239,7 @@ enum res_type {
> iter != NULL; \
> iter = mem_cgroup_iter(NULL, iter, NULL))
>
> -static inline bool should_force_charge(void)
> +static inline bool task_is_dying(void)
> {
> return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
> (current->flags & PF_EXITING);
> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> * A few threads which were not waiting at mutex_lock_killable() can
> * fail to bail out. Therefore, check again after holding oom_lock.
> */
> - ret = should_force_charge() || out_of_memory(&oc);
> + ret = task_is_dying() || out_of_memory(&oc);
>
> unlock:
> mutex_unlock(&oom_lock);
> @@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> struct page_counter *counter;
> enum oom_status oom_status;
> unsigned long nr_reclaimed;
> + bool passed_oom = false;
> bool may_swap = true;
> bool drained = false;
> unsigned long pflags;
> @@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> if (gfp_mask & __GFP_ATOMIC)
> goto force;
>
> - /*
> - * Unlike in global OOM situations, memcg is not in a physical
> - * memory shortage. Allow dying and OOM-killed tasks to
> - * bypass the last charges so that they can exit quickly and
> - * free their memory.
> - */
> - if (unlikely(should_force_charge()))
> - goto force;
> -
> /*
> * Prevent unbounded recursion when reclaim operations need to
> * allocate memory. This might exceed the limits temporarily,
> @@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> if (gfp_mask & __GFP_RETRY_MAYFAIL)
> goto nomem;
>
> - if (fatal_signal_pending(current))
> - goto force;
> + /* Avoid endless loop for tasks bypassed by the oom killer */
> + if (passed_oom && task_is_dying())
> + goto nomem;
>
> /*
> * keep retrying as long as the memcg oom killer is able to make
> @@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> */
> oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
> get_order(nr_pages * PAGE_SIZE));
> - switch (oom_status) {
> - case OOM_SUCCESS:
> + if (oom_status == OOM_SUCCESS) {
> + passed_oom = true;
> nr_retries = MAX_RECLAIM_RETRIES;
> goto retry;
> - case OOM_FAILED:
> - goto force;
> - default:
> - goto nomem;
> }
> nomem:
> if (!(gfp_mask & __GFP_NOFAIL))
> --
> 2.32.0
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
2021-10-25 8:04 ` Michal Hocko
@ 2021-10-26 13:56 ` Tetsuo Handa
2021-10-26 14:07 ` Michal Hocko
0 siblings, 1 reply; 66+ messages in thread
From: Tetsuo Handa @ 2021-10-26 13:56 UTC (permalink / raw)
To: Michal Hocko
Cc: Vasily Averin, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka,
Shakeel Butt, Mel Gorman, cgroups, linux-mm, linux-kernel,
kernel, Johannes Weiner, Vladimir Davydov, Andrew Morton
On 2021/10/25 17:04, Michal Hocko wrote:
> I do not think there is any guarantee. This code has meant to be a
> safeguard but it turns out to be adding more harm than a safety. There
> are several scenarios mentioned in this thread where this would be
> counter productive or outright wrong thing to do.
Setting PR_IO_FLUSHER via prctl(PR_SET_IO_FLUSHER) + hitting legacy kmem
charge limit might be an unexpected combination?
>
> On the other hand it is hard to imagine any legitimate situation where
> this would be a right thing to do. Maybe you have something more
> specific in mind? What would be the legit code to rely on OOM handling
> out of the line (where the details about the allocation scope is lost)?
I don't have specific scenario, but I feel that it might be a chance to
retry killable vmalloc(). Commit b8c8a338f75e ("Revert "vmalloc: back off
when the current task is killed"") was 4.5 years ago, and fuzz testing found
many bugs triggered by memory allocation fault injection. Thus, I think that
the direction is going towards "we can fail memory allocation upon SIGKILL
(rather than worrying about depleting memory reserves and/or escalating to
global OOM killer invocations)". Most memory allocation requests which
allocate memory for userspace process are willing to give up upon SIGKILL.
Like you are trying to add NOFS, NOIO, NOFAIL support to vmalloc(), you could
consider KILLABLE support as well. Of course, direct reclaim makes it difficult
to immediately give up upon SIGKILL, but killable allocation sounds still nice
even if best-effort basis.
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
2021-10-26 13:56 ` Tetsuo Handa
@ 2021-10-26 14:07 ` Michal Hocko
0 siblings, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2021-10-26 14:07 UTC (permalink / raw)
To: Tetsuo Handa
Cc: Vasily Averin, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka,
Shakeel Butt, Mel Gorman, cgroups, linux-mm, linux-kernel,
kernel, Johannes Weiner, Vladimir Davydov, Andrew Morton
On Tue 26-10-21 22:56:44, Tetsuo Handa wrote:
> On 2021/10/25 17:04, Michal Hocko wrote:
> > I do not think there is any guarantee. This code has meant to be a
> > safeguard but it turns out to be adding more harm than a safety. There
> > are several scenarios mentioned in this thread where this would be
> > counter productive or outright wrong thing to do.
>
> Setting PR_IO_FLUSHER via prctl(PR_SET_IO_FLUSHER) + hitting legacy kmem
> charge limit might be an unexpected combination?
I am not sure I follow or why PR_SET_IO_FLUSHER should be relevant. But
triggering the global OOM killer on kmem charge limit failure is
certainly not the right thing to do. Quite opposite because this would
be effectivelly a global DoS as a result of a local memory constrain.
> > On the other hand it is hard to imagine any legitimate situation where
> > this would be a right thing to do. Maybe you have something more
> > specific in mind? What would be the legit code to rely on OOM handling
> > out of the line (where the details about the allocation scope is lost)?
>
> I don't have specific scenario, but I feel that it might be a chance to
> retry killable vmalloc(). Commit b8c8a338f75e ("Revert "vmalloc: back off
> when the current task is killed"") was 4.5 years ago, and fuzz testing found
> many bugs triggered by memory allocation fault injection. Thus, I think that
> the direction is going towards "we can fail memory allocation upon SIGKILL
> (rather than worrying about depleting memory reserves and/or escalating to
> global OOM killer invocations)". Most memory allocation requests which
> allocate memory for userspace process are willing to give up upon SIGKILL.
>
> Like you are trying to add NOFS, NOIO, NOFAIL support to vmalloc(), you could
> consider KILLABLE support as well. Of course, direct reclaim makes it difficult
> to immediately give up upon SIGKILL, but killable allocation sounds still nice
> even if best-effort basis.
This is all fine but I am not sure how this is realated to this patch.
The previous patch already gives up in pagefault_out_of_memory on fatal
signal pending. So this code is not really reachable.
Also alowing more allocations to fail doesn't really suggest that we
should trigger OOM killer from #PF. I would argue that the opposite is
the case actually. Or I just haven't understood your concern?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks
2021-10-25 9:36 ` Michal Hocko
@ 2021-10-27 22:36 ` Andrew Morton
2021-10-28 7:22 ` Vasily Averin
2021-10-29 7:58 ` Michal Hocko
0 siblings, 2 replies; 66+ messages in thread
From: Andrew Morton @ 2021-10-27 22:36 UTC (permalink / raw)
To: Michal Hocko
Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel,
Greg Kroah-Hartman
On Mon, 25 Oct 2021 11:36:41 +0200 Michal Hocko <mhocko@suse.com> wrote:
> My view on stable backport is similar to the previous patch. If we want
> to have it there then let's wait for some time to see whether there are
> any fallouts as this patch depends on the PF_OOM change.
It's strange that [1/3] doesn't have cc:stable, but [2/3] and [3/3] do
not. What is the thinking here?
I expect we'd be OK with merging these into 5.16-rc1. This still gives
another couple of months in -rc to shake out any problems. But I
suspect the -stable maintainers will merge and release the patches
before they are released in 5.16.
In which case an alternative would be not to mark these patches
cc:stable and to somehow remember to ask the -stable maintainers to
merge them after 5.16 has been on the streets for a suitable period.
Greg, thoughts?
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks
2021-10-27 22:36 ` Andrew Morton
@ 2021-10-28 7:22 ` Vasily Averin
2021-10-29 7:46 ` Greg Kroah-Hartman
2021-10-29 7:58 ` Michal Hocko
1 sibling, 1 reply; 66+ messages in thread
From: Vasily Averin @ 2021-10-28 7:22 UTC (permalink / raw)
To: Andrew Morton, Michal Hocko
Cc: Johannes Weiner, Vladimir Davydov, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel,
Greg Kroah-Hartman
On 28.10.2021 01:36, Andrew Morton wrote:
> On Mon, 25 Oct 2021 11:36:41 +0200 Michal Hocko <mhocko@suse.com> wrote:
>
>> My view on stable backport is similar to the previous patch. If we want
>> to have it there then let's wait for some time to see whether there are
>> any fallouts as this patch depends on the PF_OOM change.
>
> It's strange that [1/3] doesn't have cc:stable, but [2/3] and [3/3] do
> not. What is the thinking here?
My fault, I missed it.
All 3 patches should be backported,
I did it already to stables kernels since 4.4 and I'm ready to submit it in demand.
> I expect we'd be OK with merging these into 5.16-rc1. This still gives
> another couple of months in -rc to shake out any problems. But I
> suspect the -stable maintainers will merge and release the patches
> before they are released in 5.16.
>
> In which case an alternative would be not to mark these patches
> cc:stable and to somehow remember to ask the -stable maintainers to
> merge them after 5.16 has been on the streets for a suitable period.
>
> Greg, thoughts?
If you wish I can remind Greg in a month or even after 5.17 release.
Thank you,
Vasily Averin
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks
2021-10-28 7:22 ` Vasily Averin
@ 2021-10-29 7:46 ` Greg Kroah-Hartman
0 siblings, 0 replies; 66+ messages in thread
From: Greg Kroah-Hartman @ 2021-10-29 7:46 UTC (permalink / raw)
To: Vasily Averin
Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Vladimir Davydov,
Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
kernel
On Thu, Oct 28, 2021 at 10:22:56AM +0300, Vasily Averin wrote:
> On 28.10.2021 01:36, Andrew Morton wrote:
> > On Mon, 25 Oct 2021 11:36:41 +0200 Michal Hocko <mhocko@suse.com> wrote:
> >
> >> My view on stable backport is similar to the previous patch. If we want
> >> to have it there then let's wait for some time to see whether there are
> >> any fallouts as this patch depends on the PF_OOM change.
> >
> > It's strange that [1/3] doesn't have cc:stable, but [2/3] and [3/3] do
> > not. What is the thinking here?
>
> My fault, I missed it.
> All 3 patches should be backported,
> I did it already to stables kernels since 4.4 and I'm ready to submit it in demand.
>
> > I expect we'd be OK with merging these into 5.16-rc1. This still gives
> > another couple of months in -rc to shake out any problems. But I
> > suspect the -stable maintainers will merge and release the patches
> > before they are released in 5.16.
> >
> > In which case an alternative would be not to mark these patches
> > cc:stable and to somehow remember to ask the -stable maintainers to
> > merge them after 5.16 has been on the streets for a suitable period.
> >
> > Greg, thoughts?
>
> If you wish I can remind Greg in a month or even after 5.17 release.
Please remind us then, otherwise I will not remember :)
thanks,
greg k-h
^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks
2021-10-27 22:36 ` Andrew Morton
2021-10-28 7:22 ` Vasily Averin
@ 2021-10-29 7:58 ` Michal Hocko
1 sibling, 0 replies; 66+ messages in thread
From: Michal Hocko @ 2021-10-29 7:58 UTC (permalink / raw)
To: Andrew Morton
Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Roman Gushchin,
Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel,
Greg Kroah-Hartman
On Wed 27-10-21 15:36:08, Andrew Morton wrote:
> On Mon, 25 Oct 2021 11:36:41 +0200 Michal Hocko <mhocko@suse.com> wrote:
>
> > My view on stable backport is similar to the previous patch. If we want
> > to have it there then let's wait for some time to see whether there are
> > any fallouts as this patch depends on the PF_OOM change.
>
> It's strange that [1/3] doesn't have cc:stable, but [2/3] and [3/3] do
> not. What is the thinking here?
>
> I expect we'd be OK with merging these into 5.16-rc1. This still gives
> another couple of months in -rc to shake out any problems. But I
> suspect the -stable maintainers will merge and release the patches
> before they are released in 5.16.
>
> In which case an alternative would be not to mark these patches
> cc:stable and to somehow remember to ask the -stable maintainers to
> merge them after 5.16 has been on the streets for a suitable period.
My take on stable backports is http://lkml.kernel.org/r/YXZ6FMzJLEz4TA2d@dhcp22.suse.cz
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 66+ messages in thread
end of thread, other threads:[~2021-10-29 7:59 UTC | newest]
Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-18 8:13 [PATCH memcg 0/1] false global OOM triggered by memcg-limited task Vasily Averin
2021-10-18 9:04 ` Michal Hocko
2021-10-18 10:05 ` Vasily Averin
2021-10-18 10:12 ` Vasily Averin
2021-10-18 11:53 ` Michal Hocko
[not found] ` <27dc0c49-a0d6-875b-49c6-0ef5c0cc3ac8@virtuozzo.com>
2021-10-18 12:27 ` Michal Hocko
2021-10-18 15:07 ` Shakeel Butt
2021-10-18 16:51 ` Michal Hocko
2021-10-18 17:13 ` Shakeel Butt
2021-10-18 18:52 ` Vasily Averin
2021-10-18 19:18 ` Vasily Averin
2021-10-19 5:34 ` Shakeel Butt
2021-10-19 5:33 ` Shakeel Butt
2021-10-19 6:42 ` Vasily Averin
2021-10-19 8:47 ` Michal Hocko
2021-10-19 6:30 ` Vasily Averin
2021-10-19 8:49 ` Michal Hocko
2021-10-19 10:30 ` Vasily Averin
2021-10-19 11:54 ` Michal Hocko
2021-10-19 12:04 ` Michal Hocko
2021-10-19 13:26 ` Vasily Averin
2021-10-19 14:13 ` Michal Hocko
2021-10-19 14:19 ` Michal Hocko
2021-10-19 19:09 ` Vasily Averin
2021-10-20 8:07 ` [PATCH memcg v4] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
2021-10-20 8:43 ` Michal Hocko
2021-10-20 12:11 ` [PATCH memcg RFC 0/3] " Vasily Averin
[not found] ` <cover.1634730787.git.vvs@virtuozzo.com>
2021-10-20 12:12 ` [PATCH memcg 1/3] mm: do not firce global OOM from inside " Vasily Averin
2021-10-20 12:33 ` Michal Hocko
2021-10-20 13:52 ` Vasily Averin
2021-10-20 12:13 ` [PATCH memcg 2/3] memcg: remove charge forcinig for " Vasily Averin
2021-10-20 12:41 ` Michal Hocko
2021-10-20 14:21 ` Vasily Averin
2021-10-20 14:57 ` Michal Hocko
2021-10-20 15:20 ` Tetsuo Handa
2021-10-21 10:03 ` Michal Hocko
2021-10-20 12:14 ` [PATCH memcg 3/3] memcg: handle memcg oom failures Vasily Averin
2021-10-20 13:02 ` Michal Hocko
2021-10-20 15:46 ` Vasily Averin
2021-10-21 11:49 ` Michal Hocko
2021-10-21 15:05 ` Vasily Averin
2021-10-21 16:47 ` Michal Hocko
2021-10-22 8:10 ` [PATCH memcg v2 0/2] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
[not found] ` <cover.1634889066.git.vvs@virtuozzo.com>
2021-10-22 8:11 ` [PATCH memcg v2 1/2] mm, oom: do not trigger out_of_memory from the #PF Vasily Averin
2021-10-22 8:55 ` Michal Hocko
2021-10-22 8:11 ` [PATCH memcg v2 2/2] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
2021-10-22 9:10 ` Michal Hocko
2021-10-23 13:18 ` [PATCH memcg v3 0/3] " Vasily Averin
[not found] ` <cover.1634994605.git.vvs@virtuozzo.com>
2021-10-23 13:19 ` [PATCH memcg v3 1/3] mm, oom: pagefault_out_of_memory: don't force global OOM for " Vasily Averin
2021-10-25 9:27 ` Michal Hocko
2021-10-23 13:20 ` [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF Vasily Averin
2021-10-23 15:01 ` Tetsuo Handa
2021-10-23 19:15 ` Vasily Averin
2021-10-25 8:04 ` Michal Hocko
2021-10-26 13:56 ` Tetsuo Handa
2021-10-26 14:07 ` Michal Hocko
2021-10-25 9:34 ` Michal Hocko
2021-10-23 13:20 ` [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
2021-10-25 9:36 ` Michal Hocko
2021-10-27 22:36 ` Andrew Morton
2021-10-28 7:22 ` Vasily Averin
2021-10-29 7:46 ` Greg Kroah-Hartman
2021-10-29 7:58 ` Michal Hocko
2021-10-21 8:03 ` [PATCH memcg 0/1] false global OOM triggered by memcg-limited task Vasily Averin
2021-10-21 11:49 ` Michal Hocko
2021-10-21 13:24 ` Vasily Averin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).