All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18  8:13 ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-18  8:13 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, cgroups, linux-mm, linux-kernel, kernel

[-- Attachment #1: Type: text/plain, Size: 2770 bytes --]

While checking the patches fixed broken memcg accounting in vmalloc I found
another issue: a false global OOM triggered by memcg-limited user space task.

I executed vmalloc-eater inside a memcg limited LXC container in a loop, checked
that it does not consume host memory beyond the assigned limit, triggers memcg OOM
and generates "Memory cgroup out of memory" messages.  Everything was as expected.

However I was surprised to find quite rare global OOM messages too.
I set sysctl vm.panic_on_oom to 1, repeated the test and successfully
crashed the node.

Dmesg showed that global OOM was detected on 16 GB node with ~10 GB of free memory.

 syz-executor invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=1000
 CPU: 2 PID: 15307 Comm: syz-executor Kdump: loaded Not tainted 5.15.0-rc4+ #55
 Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
 Call Trace:
  dump_stack_lvl+0x57/0x72
  dump_header+0x4a/0x2c1
  out_of_memory.cold+0xa/0x7e
  pagefault_out_of_memory+0x46/0x60
  exc_page_fault+0x79/0x2b0
  asm_exc_page_fault+0x1e/0x30
...
 Mem-Info:
 Node 0 DMA: 0*4kB 0*8kB <...> = 13296kB
 Node 0 DMA32: 705*4kB (UM) <...> = 2586964kB
 Node 0 Normal: 2743*4kB (UME) <...> = 6904828kB
...
 4095866 pages RAM
...
 Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled

Full dmesg can be found in attached file.

How could this happen?

User-space task inside the memcg-limited container generated a page fault,
its handler do_user_addr_fault() called handle_mm_fault which could not
allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
Then do_user_addr_fault() called pagefault_out_of_memory() which executed
out_of_memory() without set of memcg.

Partially this problem depends on one of my recent patches, disabled unlimited
memory allocation for dying tasks. However I think the problem can happen
on non-killed tasks too, for example because of kmem limit.

At present do_user_addr_fault() does not know why page allocation was failed,
i.e. was it global or memcg OOM. I propose to save this information in new flag
on task_struct. It can be set in case of memcg restrictons in
obj_cgroup_charge_pages() (for memory controller) and in try_charge_memcg() 
(for kmem controller).  Then it can be used in mem_cgroup_oom_synchronize()
called inside pagefault_out_of_memory():
in case of memcg-related restrictions it will not trigger fake global OOM and
returns to user space which will retry the fault or kill the process if it got
a fatal signal.

Thank you,
	Vasily Averin

Vasily Averin (1):
  memcg: prevent false global OOM trigggerd by memcg limited task.

 include/linux/sched.h |  1 +
 mm/memcontrol.c       | 12 +++++++++---
 2 files changed, 10 insertions(+), 3 deletions(-)

-- 
2.32.0


[-- Attachment #2: dmesg-oom.txt --]
[-- Type: text/plain, Size: 25650 bytes --]

[59622.176098] syz-executor invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=1000
[59622.178633] CPU: 2 PID: 15307 Comm: syz-executor Kdump: loaded Not tainted 5.15.0-rc4+ #55
[59622.180840] Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
[59622.182562] Call Trace:
[59622.183525]  dump_stack_lvl+0x57/0x72
[59622.184782]  dump_header+0x4a/0x2c1
[59622.185929]  oom_kill_process.cold+0xb/0x10
[59622.187203]  out_of_memory+0x229/0x5b0
[59622.188399]  mem_cgroup_out_of_memory+0x111/0x130
[59622.189773]  try_charge_memcg+0x693/0x720
[59622.191013]  ? kvm_sched_clock_read+0x14/0x30
[59622.192318]  charge_memcg+0x57/0x170
[59622.193482]  mem_cgroup_swapin_charge_page+0x99/0x1d0
[59622.194932]  do_swap_page+0x916/0xbf0
[59622.196110]  ? __lock_acquire+0x3b3/0x1e00
[59622.197377]  __handle_mm_fault+0xa5f/0x14e0
[59622.198649]  ? lock_acquire+0xc4/0x2e0
[59622.199847]  handle_mm_fault+0x149/0x3f0
[59622.201071]  do_user_addr_fault+0x1f4/0x6c0
[59622.202346]  exc_page_fault+0x79/0x2b0
[59622.203545]  ? asm_exc_page_fault+0x8/0x30
[59622.204802]  asm_exc_page_fault+0x1e/0x30
[59622.206043] RIP: 0033:0x41a948
[59622.207115] Code: 64 83 0c 25 08 03 00 00 10 64 48 8b 3c 25 00 03 00 00 e8 8b fe ff ff f4 66 2e 0f 1f 84 00 00 00 00 00 f7 c7 02 00 00 00 75 27 <64> 8b 04 25 08 03 00 00 41 89 c3 41 83 e3 fd f0 64 44 0f b1 1c 25
[59622.211557] RSP: 002b:00007ffd10ebf6e8 EFLAGS: 00010246
[59622.213058] RAX: 0000000000000000 RBX: 000000000119cf40 RCX: 000000000041b331
[59622.214919] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[59622.216772] RBP: 000000000119d940 R08: 0000000000000000 R09: 0000000000000000
[59622.218623] R10: 00007ffd10ebf7c0 R11: 0000000000000293 R12: 00000000038dc0e1
[59622.220467] R13: 00000000038dbdc3 R14: 20c49ba5e353f7cf R15: ffffffffffffffff
[59622.222367] memory: usage 524268kB, limit 524288kB, failcnt 28608
[59622.224112] memory+swap: usage 554776kB, limit 9007199254740988kB, failcnt 0
[59622.225982] kmem: usage 524168kB, limit 9007199254740988kB, failcnt 0
[59622.227740] Memory cgroup stats for /lxc.payload.test:
[59622.234613] anon 0
file 122880
kernel_stack 1294336
pagetables 3665920
percpu 490896
sock 0
shmem 0
file_mapped 12288
file_dirty 0
file_writeback 0
swapcached 49008640
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 0
active_anon 0
inactive_file 110592
active_file 0
unevictable 0
slab_reclaimable 1513744
slab_unreclaimable 14117008
slab 15630752
workingset_refault_anon 3800
workingset_refault_file 24974
workingset_activate_anon 923
workingset_activate_file 414
workingset_restore_anon 422
workingset_restore_file 190
workingset_nodereclaim 74
pgfault 120040
pgmajfault 4766
pgrefill 15999
pgscan 486263
pgsteal 44646
pgactivate 13066
pgdeactivate 14277
pglazyfree 60
pglazyfreed 34
thp_fault_alloc 0
thp_collapse_alloc 0
[59622.265206] Tasks state (memory values in pages):
[59622.266454] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[59622.268394] [  14696]     0 14696    56226      690    90112      145             0 bash
[59622.270246] [  14746]     0 14746   262198        0   266240     2900             0 syz-execprog
[59622.272217] [  14764]     0 14764    12552        0    36864       16             0 syz-executor
[59622.274217] [  14765]     0 14765    12552        0    36864       16             0 syz-executor
[59622.276165] [  14769]     0 14769    12552        0    36864       16             0 syz-executor
[59622.278091] [  14772]     0 14772    12552        0    36864       17             0 syz-executor
[59622.280023] [  14775]     0 14775    12552        0    36864       17             0 syz-executor
[59622.281949] [  14777]     0 14777    12552        0    36864       16             0 syz-executor
[59622.283847] [  14781]     0 14781    12552        0    36864       16             0 syz-executor
[59622.285737] [  14783]     0 14783    12552        0    36864       17             0 syz-executor
[59622.287673] [  14787]     0 14787    12552        0    36864       16             0 syz-executor
[59622.287968] systemd-journald[648]: Compressed data object 720 -> 393 using ZSTD
[59622.289564] [  14800]     0 14800    12552        0    36864       16             0 syz-executor
[59622.293144] [  14803]     0 14803    12552        0    36864       16             0 syz-executor
[59622.295054] [  14805]     0 14805    12552        0    36864       16             0 syz-executor
[59622.296928] [  14812]     0 14812    12552        0    36864       17             0 syz-executor
[59622.298796] [  14814]     0 14814    12552        0    36864       16             0 syz-executor
[59622.300663] [  14817]     0 14817    12552        0    36864       17             0 syz-executor
[59622.302528] [  14820]     0 14820    12552        0    36864       16             0 syz-executor
[59622.304399] [  14779]     0 14779    12551        0    49152       37             0 syz-executor
[59622.306271] [  15280]     0 15279    12584        0    69632        0          1000 syz-executor
[59622.308144] [  14767]     0 14767    12551        0    49152       41             0 syz-executor
[59622.310046] [  15282]     0 15281    12584        0    69632        0          1000 syz-executor
[59622.311926] [  14766]     0 14766    12551        0    49152       44             0 syz-executor
[59622.313812] [  15286]     0 15285    12584        0    65536        0          1000 syz-executor
[59622.315704] [  14782]     0 14782    12551        0    49152       35             0 syz-executor
[59622.317621] [  15284]     0 15283    12584        0    65536        0          1000 syz-executor
[59622.319570] [  14784]     0 14784    12551        0    49152       36             0 syz-executor
[59622.321496] [  15288]     0 15287    12617        0    65536        0          1000 syz-executor
[59622.323454] [  14776]     0 14776    12551        0    49152       34             0 syz-executor
[59622.325374] [  15291]     0 15289    12584        0    69632        4          1000 syz-executor
[59622.327293] [  14813]     0 14813    12551        0    49152       33             0 syz-executor
[59622.329216] [  15295]     0 15294    12584        0    65536        0          1000 syz-executor
[59622.331139] [  14819]     0 14819    12551        0    49152       32             0 syz-executor
[59622.333068] [  15293]     0 15292    12584        0    65536        0          1000 syz-executor
[59622.334991] [  14821]     0 14821    12551        0    49152       31             0 syz-executor
[59622.336921] [  15297]     0 15296    12584        0    65536        0          1000 syz-executor
[59622.338875] [  14831]     0 14831    12551        0    49152       42             0 syz-executor
[59622.340811] [  15299]     0 15298    12617        0    65536        0          1000 syz-executor
[59622.342752] [  14804]     0 14804    12551        0    49152       38             0 syz-executor
[59622.344700] [  15301]     0 15300    12584        0    69632        0          1000 syz-executor
[59622.346649] [  14816]     0 14816    12551        0    49152       27             0 syz-executor
[59622.348604] [  15306]     0 15305    12584        0    69632        0          1000 syz-executor
[59622.350615] [  14774]     0 14774    12551        0    49152       30             0 syz-executor
[59622.352606] [  15304]     0 15303    12584        0    65536        0          1000 syz-executor
[59622.354575] [  14807]     0 14807    12551        0    49152       25             0 syz-executor
[59622.356531] [  15307]     0 15307    12584        0    69632       37          1000 syz-executor
[59622.358481] [  14801]     0 14801    12551        0    49152       23             0 syz-executor
[59622.360418] [  15261]     0 15260    12618        0    65536       31          1000 syz-executor
[59622.362355] [  14788]     0 14788    12551        0    49152       24             0 syz-executor
[59622.364291] [  15263]     0 15262    12618        0    69632        0          1000 syz-executor
[59622.366269] [  14503]     0 14503    25205     1712   241664      235             0 systemd-journal
[59622.368288] [  14736]     0 14736     1637      398    61440       32             0 agetty
[59622.370161] [  14737]     0 14737     1637      383    61440       31             0 agetty
[59622.372037] [  14743]     0 14743     1637      398    61440       31             0 agetty
[59622.373939] [  14744]     0 14744     1637      387    57344       31             0 agetty
[59622.375849] [  14512]     0 14512    22648     1593   221184      233             0 systemd-logind
[59622.377858] [  14513]    81 14513    13516      801   155648      138          -900 dbus-daemon
[59622.379867] [  14734]     0 14734    52899      856   180224      440             0 rsyslogd
[59622.381813] [  14735]     0 14735     1637      343    61440       31             0 agetty
[59622.383701] [  14741]     0 14741     5724      572    94208      239             0 crond
[59622.385568] [  14086]     0 14086    25345     2120   233472      520             0 systemd
[59622.387470] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=lxc.payload.test,mems_allowed=0,oom_memcg=/lxc.payload.test,task_memcg=/lxc.payload.test,task=syz-executor,pid=15307,uid=0
[59622.391402] Memory cgroup out of memory: Killed process 15307 (syz-executor) total-vm:50336kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:68kB oom_score_adj:1000
[59622.395544] oom_reaper: reaped process 15307 (syz-executor), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[59622.411181] syz-executor invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=1000
[59622.414272] CPU: 2 PID: 15307 Comm: syz-executor Kdump: loaded Not tainted 5.15.0-rc4+ #55
[59622.416247] Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
[59622.417950] Call Trace:
[59622.418901]  dump_stack_lvl+0x57/0x72
[59622.420067]  dump_header+0x4a/0x2c1
[59622.421200]  out_of_memory.cold+0xa/0x7e
[59622.422415]  pagefault_out_of_memory+0x46/0x60
[59622.423784]  exc_page_fault+0x79/0x2b0
[59622.424966]  ? asm_exc_page_fault+0x8/0x30
[59622.426207]  asm_exc_page_fault+0x1e/0x30
[59622.427426] RIP: 0033:0x41a948
[59622.428485] Code: Unable to access opcode bytes at RIP 0x41a91e.
[59622.430064] RSP: 002b:00007ffd10ebf6e8 EFLAGS: 00010246
[59622.431513] RAX: 0000000000000000 RBX: 000000000119cf40 RCX: 000000000041b331
[59622.433299] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[59622.435090] RBP: 000000000119d940 R08: 0000000000000000 R09: 0000000000000000
[59622.436875] R10: 00007ffd10ebf7c0 R11: 0000000000000293 R12: 00000000038dc0e1
[59622.438691] R13: 00000000038dbdc3 R14: 20c49ba5e353f7cf R15: ffffffffffffffff
[59622.440547] Mem-Info:
[59622.441516] active_anon:279 inactive_anon:24996 isolated_anon:0
 active_file:395743 inactive_file:848017 isolated_file:0
 unevictable:0 dirty:159 writeback:21
 slab_reclaimable:40993 slab_unreclaimable:45236
 mapped:66928 shmem:286 pagetables:2477 bounce:0
 kernel_misc_reclaimable:0
 free:2376272 free_pcp:7579 free_cma:0
[59622.451730] Node 0 active_anon:1116kB inactive_anon:99984kB active_file:1582972kB inactive_file:3392068kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:267712kB dirty:832kB writeback:84kB shmem:1144kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:5488kB pagetables:9908kB all_unreclaimable? no
[59622.458826] Node 0 DMA free:13296kB min:64kB low:80kB high:96kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[59622.465131] lowmem_reserve[]: 0 2706 15265 15265 15265
[59622.466746] Node 0 DMA32 free:2586964kB min:11968kB low:14960kB high:17952kB reserved_highatomic:0KB active_anon:0kB inactive_anon:20kB active_file:27084kB inactive_file:172344kB unevictable:0kB writepending:0kB present:3129200kB managed:2801520kB mlocked:0kB bounce:0kB free_pcp:6120kB local_pcp:1004kB free_cma:0kB
[59622.473718] lowmem_reserve[]: 0 0 12559 12559 12559
[59622.475311] Node 0 Normal free:6904828kB min:55544kB low:69428kB high:83312kB reserved_highatomic:0KB active_anon:1116kB inactive_anon:99964kB active_file:1555888kB inactive_file:3219724kB unevictable:0kB writepending:916kB present:13238272kB managed:12871060kB mlocked:0kB bounce:0kB free_pcp:24068kB local_pcp:692kB free_cma:0kB
[59622.482647] lowmem_reserve[]: 0 0 0 0 0
[59622.484096] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (U) 0*1024kB 2*2048kB (UM) 2*4096kB (M) = 13296kB
[59622.488044] Node 0 DMA32: 705*4kB (UM) 624*8kB (UME) 497*16kB (UME) 594*32kB (UME) 532*64kB (UME) 49*128kB (UME) 36*256kB (UE) 36*512kB (UE) 20*1024kB (U) 17*2048kB (UE) 593*4096kB (U) = 2586964kB
[59622.492687] Node 0 Normal: 2743*4kB (UME) 2084*8kB (UME) 536*16kB (UME) 1496*32kB (UME) 1148*64kB (ME) 171*128kB (UME) 115*256kB (M) 94*512kB (M) 98*1024kB (UME) 163*2048kB (UME) 1517*4096kB (UM) = 6904828kB
[59622.497553] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[59622.499898] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[59622.502197] 1244068 total pagecache pages
[59622.503688] 4 pages in swap cache
[59622.505046] Swap cache stats: add 54921, delete 54922, find 137/9740
[59622.506956] Free swap  = 8357116kB
[59622.508352] Total swap = 8388604kB
[59622.510564] 4095866 pages RAM
[59622.512379] 0 pages HighMem/MovableOnly
[59622.513853] 173881 pages reserved
[59622.515220] 0 pages cma reserved
[59622.516581] 0 pages hwpoisoned
[59622.517897] Tasks state (memory values in pages):
[59622.519524] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[59622.521895] [    648]     0   648    71491    45126   552960        0          -250 systemd-journal
[59622.524550] [    666]     0   666    11976     3110   102400        0         -1000 systemd-udevd
[59622.526944] [    797]   193   797    10732     5087   122880        0             0 systemd-resolve
[59622.529352] [    798]     0   798    26617      605    69632        0         -1000 auditd
[59622.531635] [    800]     0   800     2077     1056    49152        0             0 sedispatch
[59622.533963] [    821]     0   821    19874      761    57344        0             0 irqbalance
[59622.536297] [    823]     0   823     2921      533    45056        0             0 mcelog
[59622.538590] [    824]   996   824   666787     5873   217088        0             0 polkitd
[59622.540882] [    825]     0   825    20273      951    53248        0             0 qemu-ga
[59622.543158] [    826]     0   826   205022    10950  1179648        0             0 rsyslogd
[59622.545449] [    827]     0   827    12186     2965   102400        0             0 sssd
[59622.547678] [    828]     0   828     4502     2175    73728        0             0 systemd-homed
[59622.550047] [    829]     0   829     4466     2118    69632        0             0 systemd-machine
[59622.552466] [    830]    81   830     2530     1083    69632        0          -900 dbus-broker-lau
[59622.554852] [    835]     0   835    68167     3994   139264        0             0 abrtd
[59622.557060] [    839]   982   839    23747      967    69632        0             0 chronyd
[59622.559289] [    840]    81   840     1525      919    49152        0          -900 dbus-broker
[59622.561576] [    843]     0   843    12718     3366   110592        0             0 sssd_be
[59622.563803] [    847]     0   847   177006     5860   655360        0             0 abrt-dump-journ
[59622.566207] [    849]     0   849   174956     6978   827392        0             0 abrt-dump-journ
[59622.568518] [    850]     0   850    15570     9663   143360        0             0 sssd_nss
[59622.570712] [    857]     0   857     4966     2724    81920        0             0 systemd-logind
[59622.572980] [    860]     0   860    79033     2793   118784        0             0 ModemManager
[59622.575247] [    861]     0   861    34022    10210   155648        0             0 firewalld
[59622.577428] [    867]     0   867    66866     5087   147456        0             0 NetworkManager
[59622.579690] [    881]     0   881     7636     2072    73728        0         -1000 sshd
[59622.581790] [    887]     0   887    13477      801    77824        0             0 gssproxy
[59622.583916] [    941]     0   941     5246      715    61440        0             0 atd
[59622.585957] [    943]     0   943     4474      892    65536        0             0 crond
[59622.588007] [    961]     0   961     2408      455    49152        0             0 agetty
[59622.590051] [    962]     0   962     3050      458    57344        0             0 agetty
[59622.592069] [   1018]   991  1018     6645      533    65536        0             0 dnsmasq
[59622.594114] [   1060]     0  1060    11399     2828    90112        0             0 sshd
[59622.596075] [   1459]     0  1459     4387     1924    73728        0             0 systemd-userdbd
[59622.598197] [   1488]  1000  1488     7862     3570    90112        0             0 systemd
[59622.600204] [   1490]  1000  1490    12058     1458   102400        0             0 (sd-pam)
[59622.602218] [   1555]  1000  1555    11399     1367    81920        0             0 sshd
[59622.604170] [   1561]  1000  1561     5188     1443    61440        0             0 bash
[59622.606168] [   2078]   991  2078     6645      532    61440        0             0 dnsmasq
[59622.608154] [   2080]     0  2080     6638      100    61440        0             0 dnsmasq
[59622.610161] [   6747]  1000  6747    10102     2101    94208        0             0 su
[59622.612074] [   6751]     0  6751     4504     1424    65536        0             0 bash
[59622.614022] [  14077]     0 14077     4470     1927    73728        0             0 systemd-userwor
[59622.616135] [  14078]     0 14078     4470     1908    77824        0             0 systemd-userwor
[59622.618252] [  14080]     0 14080     4470     1929    69632        0             0 systemd-userwor
[59622.620358] [  14085]     0 14085     2069      642    57344        0             0 lxc-start
[59622.622379] [  14086]     0 14086    25345     2120   233472      520             0 systemd
[59622.624401] [  14503]     0 14503    25205     1712   241664      235             0 systemd-journal
[59622.626610] [  14512]     0 14512    22648     1593   221184      233             0 systemd-logind
[59622.628684] [  14513]    81 14513    13516      801   155648      138          -900 dbus-daemon
[59622.630714] [  14694]     0 14694     2072     1104    49152        0             0 3
[59622.632591] [  14696]     0 14696    56226      690    90112      145             0 bash
[59622.634510] [  14734]     0 14734    52899      856   180224      440             0 rsyslogd
[59622.636496] [  14735]     0 14735     1637      343    61440       31             0 agetty
[59622.638468] [  14736]     0 14736     1637      398    61440       32             0 agetty
[59622.640417] [  14737]     0 14737     1637      383    61440       31             0 agetty
[59622.642356] [  14741]     0 14741     5724      572    94208      239             0 crond
[59622.644286] [  14743]     0 14743     1637      398    61440       31             0 agetty
[59622.646226] [  14744]     0 14744     1637      387    57344       31             0 agetty
[59622.648164] [  14746]     0 14746   262198        0   266240     2900             0 syz-execprog
[59622.650200] [  14764]     0 14764    12552        0    36864       16             0 syz-executor
[59622.652255] [  14765]     0 14765    12552        0    36864       16             0 syz-executor
[59622.654283] [  14766]     0 14766    12551        0    49152       44             0 syz-executor
[59622.656306] [  14767]     0 14767    12551        0    49152       41             0 syz-executor
[59622.658318] [  14769]     0 14769    12552        0    36864       16             0 syz-executor
[59622.660328] [  14772]     0 14772    12552        0    36864       17             0 syz-executor
[59622.662330] [  14774]     0 14774    12551        0    49152       30             0 syz-executor
[59622.664337] [  14775]     0 14775    12552        0    36864       17             0 syz-executor
[59622.666456] [  14776]     0 14776    12551        0    49152       34             0 syz-executor
[59622.668449] [  14777]     0 14777    12552        0    36864       16             0 syz-executor
[59622.670459] [  14779]     0 14779    12551        0    49152       37             0 syz-executor
[59622.672445] [  14781]     0 14781    12552        0    36864       16             0 syz-executor
[59622.674449] [  14782]     0 14782    12551        0    49152       35             0 syz-executor
[59622.676423] [  14783]     0 14783    12552        0    36864       17             0 syz-executor
[59622.678418] [  14784]     0 14784    12551        0    49152       36             0 syz-executor
[59622.680430] [  14787]     0 14787    12552        0    36864       16             0 syz-executor
[59622.682398] [  14788]     0 14788    12551        0    49152       24             0 syz-executor
[59622.684359] [  14800]     0 14800    12552        0    36864       16             0 syz-executor
[59622.686357] [  14801]     0 14801    12551        0    49152       23             0 syz-executor
[59622.688449] [  14803]     0 14803    12552        0    36864       16             0 syz-executor
[59622.690434] [  14804]     0 14804    12551        0    49152       38             0 syz-executor
[59622.692466] [  14805]     0 14805    12552        0    36864       16             0 syz-executor
[59622.694436] [  14807]     0 14807    12551        0    49152       25             0 syz-executor
[59622.696378] [  14812]     0 14812    12552        0    36864       17             0 syz-executor
[59622.698313] [  14813]     0 14813    12551        0    49152       33             0 syz-executor
[59622.700252] [  14814]     0 14814    12552        0    36864       16             0 syz-executor
[59622.702186] [  14816]     0 14816    12551        0    49152       27             0 syz-executor
[59622.704126] [  14817]     0 14817    12552        0    36864       17             0 syz-executor
[59622.706053] [  14819]     0 14819    12551        0    49152       32             0 syz-executor
[59622.707978] [  14820]     0 14820    12552        0    36864       16             0 syz-executor
[59622.709923] [  14821]     0 14821    12551        0    49152       31             0 syz-executor
[59622.711847] [  14831]     0 14831    12551        0    49152       42             0 syz-executor
[59622.713773] [  15261]     0 15260    12618        0    65536       31          1000 syz-executor
[59622.715691] [  15263]     0 15262    12618        0    69632        0          1000 syz-executor
[59622.717604] [  15280]     0 15279    12584        0    69632        0          1000 syz-executor
[59622.719506] [  15282]     0 15281    12584        0    69632        0          1000 syz-executor
[59622.721401] [  15284]     0 15283    12584        0    65536        0          1000 syz-executor
[59622.723316] [  15286]     0 15285    12584        0    65536        0          1000 syz-executor
[59622.725243] [  15288]     0 15287    12617        0    65536        0          1000 syz-executor
[59622.727121] [  15291]     0 15289    12584        0    69632        4          1000 syz-executor
[59622.728993] [  15293]     0 15292    12584        0    65536        0          1000 syz-executor
[59622.730857] [  15295]     0 15294    12584        0    65536        0          1000 syz-executor
[59622.732722] [  15297]     0 15296    12584        0    65536        0          1000 syz-executor
[59622.734587] [  15299]     0 15298    12617        0    65536        0          1000 syz-executor
[59622.736441] [  15301]     0 15300    12584        0    69632        0          1000 syz-executor
[59622.738319] [  15304]     0 15303    12584        0    65536        0          1000 syz-executor
[59622.740174] [  15306]     0 15305    12584        0    69632        0          1000 syz-executor
[59622.742022] [  15307]     0 15307    12584        0    69632        7          1000 syz-executor
[59622.743869] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
[59622.745603] CPU: 2 PID: 15307 Comm: syz-executor Kdump: loaded Not tainted 5.15.0-rc4+ #55
[59622.747339] Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
[59622.748817] Call Trace:
[59622.749557]  dump_stack_lvl+0x57/0x72
[59622.750520]  panic+0xff/0x2ea
[59622.751390]  out_of_memory.cold+0x2f/0x7e
[59622.752413]  pagefault_out_of_memory+0x46/0x60
[59622.753522]  exc_page_fault+0x79/0x2b0
[59622.754504]  ? asm_exc_page_fault+0x8/0x30
[59622.755549]  asm_exc_page_fault+0x1e/0x30
[59622.756579] RIP: 0033:0x41a948
[59622.757441] Code: Unable to access opcode bytes at RIP 0x41a91e.
[59622.758831] RSP: 002b:00007ffd10ebf6e8 EFLAGS: 00010246
[59622.760086] RAX: 0000000000000000 RBX: 000000000119cf40 RCX: 000000000041b331
[59622.761687] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[59622.763287] RBP: 000000000119d940 R08: 0000000000000000 R09: 0000000000000000
[59622.764888] R10: 00007ffd10ebf7c0 R11: 0000000000000293 R12: 00000000038dc0e1
[59622.766512] R13: 00000000038dbdc3 R14: 20c49ba5e353f7cf R15: ffffffffffffffff

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18  8:13 ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-18  8:13 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

[-- Attachment #1: Type: text/plain, Size: 2770 bytes --]

While checking the patches fixed broken memcg accounting in vmalloc I found
another issue: a false global OOM triggered by memcg-limited user space task.

I executed vmalloc-eater inside a memcg limited LXC container in a loop, checked
that it does not consume host memory beyond the assigned limit, triggers memcg OOM
and generates "Memory cgroup out of memory" messages.  Everything was as expected.

However I was surprised to find quite rare global OOM messages too.
I set sysctl vm.panic_on_oom to 1, repeated the test and successfully
crashed the node.

Dmesg showed that global OOM was detected on 16 GB node with ~10 GB of free memory.

 syz-executor invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=1000
 CPU: 2 PID: 15307 Comm: syz-executor Kdump: loaded Not tainted 5.15.0-rc4+ #55
 Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
 Call Trace:
  dump_stack_lvl+0x57/0x72
  dump_header+0x4a/0x2c1
  out_of_memory.cold+0xa/0x7e
  pagefault_out_of_memory+0x46/0x60
  exc_page_fault+0x79/0x2b0
  asm_exc_page_fault+0x1e/0x30
...
 Mem-Info:
 Node 0 DMA: 0*4kB 0*8kB <...> = 13296kB
 Node 0 DMA32: 705*4kB (UM) <...> = 2586964kB
 Node 0 Normal: 2743*4kB (UME) <...> = 6904828kB
...
 4095866 pages RAM
...
 Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled

Full dmesg can be found in attached file.

How could this happen?

User-space task inside the memcg-limited container generated a page fault,
its handler do_user_addr_fault() called handle_mm_fault which could not
allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
Then do_user_addr_fault() called pagefault_out_of_memory() which executed
out_of_memory() without set of memcg.

Partially this problem depends on one of my recent patches, disabled unlimited
memory allocation for dying tasks. However I think the problem can happen
on non-killed tasks too, for example because of kmem limit.

At present do_user_addr_fault() does not know why page allocation was failed,
i.e. was it global or memcg OOM. I propose to save this information in new flag
on task_struct. It can be set in case of memcg restrictons in
obj_cgroup_charge_pages() (for memory controller) and in try_charge_memcg() 
(for kmem controller).  Then it can be used in mem_cgroup_oom_synchronize()
called inside pagefault_out_of_memory():
in case of memcg-related restrictions it will not trigger fake global OOM and
returns to user space which will retry the fault or kill the process if it got
a fatal signal.

Thank you,
	Vasily Averin

Vasily Averin (1):
  memcg: prevent false global OOM trigggerd by memcg limited task.

 include/linux/sched.h |  1 +
 mm/memcontrol.c       | 12 +++++++++---
 2 files changed, 10 insertions(+), 3 deletions(-)

-- 
2.32.0


[-- Attachment #2: dmesg-oom.txt --]
[-- Type: text/plain, Size: 25650 bytes --]

[59622.176098] syz-executor invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=1000
[59622.178633] CPU: 2 PID: 15307 Comm: syz-executor Kdump: loaded Not tainted 5.15.0-rc4+ #55
[59622.180840] Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
[59622.182562] Call Trace:
[59622.183525]  dump_stack_lvl+0x57/0x72
[59622.184782]  dump_header+0x4a/0x2c1
[59622.185929]  oom_kill_process.cold+0xb/0x10
[59622.187203]  out_of_memory+0x229/0x5b0
[59622.188399]  mem_cgroup_out_of_memory+0x111/0x130
[59622.189773]  try_charge_memcg+0x693/0x720
[59622.191013]  ? kvm_sched_clock_read+0x14/0x30
[59622.192318]  charge_memcg+0x57/0x170
[59622.193482]  mem_cgroup_swapin_charge_page+0x99/0x1d0
[59622.194932]  do_swap_page+0x916/0xbf0
[59622.196110]  ? __lock_acquire+0x3b3/0x1e00
[59622.197377]  __handle_mm_fault+0xa5f/0x14e0
[59622.198649]  ? lock_acquire+0xc4/0x2e0
[59622.199847]  handle_mm_fault+0x149/0x3f0
[59622.201071]  do_user_addr_fault+0x1f4/0x6c0
[59622.202346]  exc_page_fault+0x79/0x2b0
[59622.203545]  ? asm_exc_page_fault+0x8/0x30
[59622.204802]  asm_exc_page_fault+0x1e/0x30
[59622.206043] RIP: 0033:0x41a948
[59622.207115] Code: 64 83 0c 25 08 03 00 00 10 64 48 8b 3c 25 00 03 00 00 e8 8b fe ff ff f4 66 2e 0f 1f 84 00 00 00 00 00 f7 c7 02 00 00 00 75 27 <64> 8b 04 25 08 03 00 00 41 89 c3 41 83 e3 fd f0 64 44 0f b1 1c 25
[59622.211557] RSP: 002b:00007ffd10ebf6e8 EFLAGS: 00010246
[59622.213058] RAX: 0000000000000000 RBX: 000000000119cf40 RCX: 000000000041b331
[59622.214919] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[59622.216772] RBP: 000000000119d940 R08: 0000000000000000 R09: 0000000000000000
[59622.218623] R10: 00007ffd10ebf7c0 R11: 0000000000000293 R12: 00000000038dc0e1
[59622.220467] R13: 00000000038dbdc3 R14: 20c49ba5e353f7cf R15: ffffffffffffffff
[59622.222367] memory: usage 524268kB, limit 524288kB, failcnt 28608
[59622.224112] memory+swap: usage 554776kB, limit 9007199254740988kB, failcnt 0
[59622.225982] kmem: usage 524168kB, limit 9007199254740988kB, failcnt 0
[59622.227740] Memory cgroup stats for /lxc.payload.test:
[59622.234613] anon 0
file 122880
kernel_stack 1294336
pagetables 3665920
percpu 490896
sock 0
shmem 0
file_mapped 12288
file_dirty 0
file_writeback 0
swapcached 49008640
anon_thp 0
file_thp 0
shmem_thp 0
inactive_anon 0
active_anon 0
inactive_file 110592
active_file 0
unevictable 0
slab_reclaimable 1513744
slab_unreclaimable 14117008
slab 15630752
workingset_refault_anon 3800
workingset_refault_file 24974
workingset_activate_anon 923
workingset_activate_file 414
workingset_restore_anon 422
workingset_restore_file 190
workingset_nodereclaim 74
pgfault 120040
pgmajfault 4766
pgrefill 15999
pgscan 486263
pgsteal 44646
pgactivate 13066
pgdeactivate 14277
pglazyfree 60
pglazyfreed 34
thp_fault_alloc 0
thp_collapse_alloc 0
[59622.265206] Tasks state (memory values in pages):
[59622.266454] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[59622.268394] [  14696]     0 14696    56226      690    90112      145             0 bash
[59622.270246] [  14746]     0 14746   262198        0   266240     2900             0 syz-execprog
[59622.272217] [  14764]     0 14764    12552        0    36864       16             0 syz-executor
[59622.274217] [  14765]     0 14765    12552        0    36864       16             0 syz-executor
[59622.276165] [  14769]     0 14769    12552        0    36864       16             0 syz-executor
[59622.278091] [  14772]     0 14772    12552        0    36864       17             0 syz-executor
[59622.280023] [  14775]     0 14775    12552        0    36864       17             0 syz-executor
[59622.281949] [  14777]     0 14777    12552        0    36864       16             0 syz-executor
[59622.283847] [  14781]     0 14781    12552        0    36864       16             0 syz-executor
[59622.285737] [  14783]     0 14783    12552        0    36864       17             0 syz-executor
[59622.287673] [  14787]     0 14787    12552        0    36864       16             0 syz-executor
[59622.287968] systemd-journald[648]: Compressed data object 720 -> 393 using ZSTD
[59622.289564] [  14800]     0 14800    12552        0    36864       16             0 syz-executor
[59622.293144] [  14803]     0 14803    12552        0    36864       16             0 syz-executor
[59622.295054] [  14805]     0 14805    12552        0    36864       16             0 syz-executor
[59622.296928] [  14812]     0 14812    12552        0    36864       17             0 syz-executor
[59622.298796] [  14814]     0 14814    12552        0    36864       16             0 syz-executor
[59622.300663] [  14817]     0 14817    12552        0    36864       17             0 syz-executor
[59622.302528] [  14820]     0 14820    12552        0    36864       16             0 syz-executor
[59622.304399] [  14779]     0 14779    12551        0    49152       37             0 syz-executor
[59622.306271] [  15280]     0 15279    12584        0    69632        0          1000 syz-executor
[59622.308144] [  14767]     0 14767    12551        0    49152       41             0 syz-executor
[59622.310046] [  15282]     0 15281    12584        0    69632        0          1000 syz-executor
[59622.311926] [  14766]     0 14766    12551        0    49152       44             0 syz-executor
[59622.313812] [  15286]     0 15285    12584        0    65536        0          1000 syz-executor
[59622.315704] [  14782]     0 14782    12551        0    49152       35             0 syz-executor
[59622.317621] [  15284]     0 15283    12584        0    65536        0          1000 syz-executor
[59622.319570] [  14784]     0 14784    12551        0    49152       36             0 syz-executor
[59622.321496] [  15288]     0 15287    12617        0    65536        0          1000 syz-executor
[59622.323454] [  14776]     0 14776    12551        0    49152       34             0 syz-executor
[59622.325374] [  15291]     0 15289    12584        0    69632        4          1000 syz-executor
[59622.327293] [  14813]     0 14813    12551        0    49152       33             0 syz-executor
[59622.329216] [  15295]     0 15294    12584        0    65536        0          1000 syz-executor
[59622.331139] [  14819]     0 14819    12551        0    49152       32             0 syz-executor
[59622.333068] [  15293]     0 15292    12584        0    65536        0          1000 syz-executor
[59622.334991] [  14821]     0 14821    12551        0    49152       31             0 syz-executor
[59622.336921] [  15297]     0 15296    12584        0    65536        0          1000 syz-executor
[59622.338875] [  14831]     0 14831    12551        0    49152       42             0 syz-executor
[59622.340811] [  15299]     0 15298    12617        0    65536        0          1000 syz-executor
[59622.342752] [  14804]     0 14804    12551        0    49152       38             0 syz-executor
[59622.344700] [  15301]     0 15300    12584        0    69632        0          1000 syz-executor
[59622.346649] [  14816]     0 14816    12551        0    49152       27             0 syz-executor
[59622.348604] [  15306]     0 15305    12584        0    69632        0          1000 syz-executor
[59622.350615] [  14774]     0 14774    12551        0    49152       30             0 syz-executor
[59622.352606] [  15304]     0 15303    12584        0    65536        0          1000 syz-executor
[59622.354575] [  14807]     0 14807    12551        0    49152       25             0 syz-executor
[59622.356531] [  15307]     0 15307    12584        0    69632       37          1000 syz-executor
[59622.358481] [  14801]     0 14801    12551        0    49152       23             0 syz-executor
[59622.360418] [  15261]     0 15260    12618        0    65536       31          1000 syz-executor
[59622.362355] [  14788]     0 14788    12551        0    49152       24             0 syz-executor
[59622.364291] [  15263]     0 15262    12618        0    69632        0          1000 syz-executor
[59622.366269] [  14503]     0 14503    25205     1712   241664      235             0 systemd-journal
[59622.368288] [  14736]     0 14736     1637      398    61440       32             0 agetty
[59622.370161] [  14737]     0 14737     1637      383    61440       31             0 agetty
[59622.372037] [  14743]     0 14743     1637      398    61440       31             0 agetty
[59622.373939] [  14744]     0 14744     1637      387    57344       31             0 agetty
[59622.375849] [  14512]     0 14512    22648     1593   221184      233             0 systemd-logind
[59622.377858] [  14513]    81 14513    13516      801   155648      138          -900 dbus-daemon
[59622.379867] [  14734]     0 14734    52899      856   180224      440             0 rsyslogd
[59622.381813] [  14735]     0 14735     1637      343    61440       31             0 agetty
[59622.383701] [  14741]     0 14741     5724      572    94208      239             0 crond
[59622.385568] [  14086]     0 14086    25345     2120   233472      520             0 systemd
[59622.387470] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=lxc.payload.test,mems_allowed=0,oom_memcg=/lxc.payload.test,task_memcg=/lxc.payload.test,task=syz-executor,pid=15307,uid=0
[59622.391402] Memory cgroup out of memory: Killed process 15307 (syz-executor) total-vm:50336kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:68kB oom_score_adj:1000
[59622.395544] oom_reaper: reaped process 15307 (syz-executor), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[59622.411181] syz-executor invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=1000
[59622.414272] CPU: 2 PID: 15307 Comm: syz-executor Kdump: loaded Not tainted 5.15.0-rc4+ #55
[59622.416247] Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
[59622.417950] Call Trace:
[59622.418901]  dump_stack_lvl+0x57/0x72
[59622.420067]  dump_header+0x4a/0x2c1
[59622.421200]  out_of_memory.cold+0xa/0x7e
[59622.422415]  pagefault_out_of_memory+0x46/0x60
[59622.423784]  exc_page_fault+0x79/0x2b0
[59622.424966]  ? asm_exc_page_fault+0x8/0x30
[59622.426207]  asm_exc_page_fault+0x1e/0x30
[59622.427426] RIP: 0033:0x41a948
[59622.428485] Code: Unable to access opcode bytes at RIP 0x41a91e.
[59622.430064] RSP: 002b:00007ffd10ebf6e8 EFLAGS: 00010246
[59622.431513] RAX: 0000000000000000 RBX: 000000000119cf40 RCX: 000000000041b331
[59622.433299] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[59622.435090] RBP: 000000000119d940 R08: 0000000000000000 R09: 0000000000000000
[59622.436875] R10: 00007ffd10ebf7c0 R11: 0000000000000293 R12: 00000000038dc0e1
[59622.438691] R13: 00000000038dbdc3 R14: 20c49ba5e353f7cf R15: ffffffffffffffff
[59622.440547] Mem-Info:
[59622.441516] active_anon:279 inactive_anon:24996 isolated_anon:0
 active_file:395743 inactive_file:848017 isolated_file:0
 unevictable:0 dirty:159 writeback:21
 slab_reclaimable:40993 slab_unreclaimable:45236
 mapped:66928 shmem:286 pagetables:2477 bounce:0
 kernel_misc_reclaimable:0
 free:2376272 free_pcp:7579 free_cma:0
[59622.451730] Node 0 active_anon:1116kB inactive_anon:99984kB active_file:1582972kB inactive_file:3392068kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:267712kB dirty:832kB writeback:84kB shmem:1144kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:5488kB pagetables:9908kB all_unreclaimable? no
[59622.458826] Node 0 DMA free:13296kB min:64kB low:80kB high:96kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[59622.465131] lowmem_reserve[]: 0 2706 15265 15265 15265
[59622.466746] Node 0 DMA32 free:2586964kB min:11968kB low:14960kB high:17952kB reserved_highatomic:0KB active_anon:0kB inactive_anon:20kB active_file:27084kB inactive_file:172344kB unevictable:0kB writepending:0kB present:3129200kB managed:2801520kB mlocked:0kB bounce:0kB free_pcp:6120kB local_pcp:1004kB free_cma:0kB
[59622.473718] lowmem_reserve[]: 0 0 12559 12559 12559
[59622.475311] Node 0 Normal free:6904828kB min:55544kB low:69428kB high:83312kB reserved_highatomic:0KB active_anon:1116kB inactive_anon:99964kB active_file:1555888kB inactive_file:3219724kB unevictable:0kB writepending:916kB present:13238272kB managed:12871060kB mlocked:0kB bounce:0kB free_pcp:24068kB local_pcp:692kB free_cma:0kB
[59622.482647] lowmem_reserve[]: 0 0 0 0 0
[59622.484096] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 1*512kB (U) 0*1024kB 2*2048kB (UM) 2*4096kB (M) = 13296kB
[59622.488044] Node 0 DMA32: 705*4kB (UM) 624*8kB (UME) 497*16kB (UME) 594*32kB (UME) 532*64kB (UME) 49*128kB (UME) 36*256kB (UE) 36*512kB (UE) 20*1024kB (U) 17*2048kB (UE) 593*4096kB (U) = 2586964kB
[59622.492687] Node 0 Normal: 2743*4kB (UME) 2084*8kB (UME) 536*16kB (UME) 1496*32kB (UME) 1148*64kB (ME) 171*128kB (UME) 115*256kB (M) 94*512kB (M) 98*1024kB (UME) 163*2048kB (UME) 1517*4096kB (UM) = 6904828kB
[59622.497553] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[59622.499898] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[59622.502197] 1244068 total pagecache pages
[59622.503688] 4 pages in swap cache
[59622.505046] Swap cache stats: add 54921, delete 54922, find 137/9740
[59622.506956] Free swap  = 8357116kB
[59622.508352] Total swap = 8388604kB
[59622.510564] 4095866 pages RAM
[59622.512379] 0 pages HighMem/MovableOnly
[59622.513853] 173881 pages reserved
[59622.515220] 0 pages cma reserved
[59622.516581] 0 pages hwpoisoned
[59622.517897] Tasks state (memory values in pages):
[59622.519524] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[59622.521895] [    648]     0   648    71491    45126   552960        0          -250 systemd-journal
[59622.524550] [    666]     0   666    11976     3110   102400        0         -1000 systemd-udevd
[59622.526944] [    797]   193   797    10732     5087   122880        0             0 systemd-resolve
[59622.529352] [    798]     0   798    26617      605    69632        0         -1000 auditd
[59622.531635] [    800]     0   800     2077     1056    49152        0             0 sedispatch
[59622.533963] [    821]     0   821    19874      761    57344        0             0 irqbalance
[59622.536297] [    823]     0   823     2921      533    45056        0             0 mcelog
[59622.538590] [    824]   996   824   666787     5873   217088        0             0 polkitd
[59622.540882] [    825]     0   825    20273      951    53248        0             0 qemu-ga
[59622.543158] [    826]     0   826   205022    10950  1179648        0             0 rsyslogd
[59622.545449] [    827]     0   827    12186     2965   102400        0             0 sssd
[59622.547678] [    828]     0   828     4502     2175    73728        0             0 systemd-homed
[59622.550047] [    829]     0   829     4466     2118    69632        0             0 systemd-machine
[59622.552466] [    830]    81   830     2530     1083    69632        0          -900 dbus-broker-lau
[59622.554852] [    835]     0   835    68167     3994   139264        0             0 abrtd
[59622.557060] [    839]   982   839    23747      967    69632        0             0 chronyd
[59622.559289] [    840]    81   840     1525      919    49152        0          -900 dbus-broker
[59622.561576] [    843]     0   843    12718     3366   110592        0             0 sssd_be
[59622.563803] [    847]     0   847   177006     5860   655360        0             0 abrt-dump-journ
[59622.566207] [    849]     0   849   174956     6978   827392        0             0 abrt-dump-journ
[59622.568518] [    850]     0   850    15570     9663   143360        0             0 sssd_nss
[59622.570712] [    857]     0   857     4966     2724    81920        0             0 systemd-logind
[59622.572980] [    860]     0   860    79033     2793   118784        0             0 ModemManager
[59622.575247] [    861]     0   861    34022    10210   155648        0             0 firewalld
[59622.577428] [    867]     0   867    66866     5087   147456        0             0 NetworkManager
[59622.579690] [    881]     0   881     7636     2072    73728        0         -1000 sshd
[59622.581790] [    887]     0   887    13477      801    77824        0             0 gssproxy
[59622.583916] [    941]     0   941     5246      715    61440        0             0 atd
[59622.585957] [    943]     0   943     4474      892    65536        0             0 crond
[59622.588007] [    961]     0   961     2408      455    49152        0             0 agetty
[59622.590051] [    962]     0   962     3050      458    57344        0             0 agetty
[59622.592069] [   1018]   991  1018     6645      533    65536        0             0 dnsmasq
[59622.594114] [   1060]     0  1060    11399     2828    90112        0             0 sshd
[59622.596075] [   1459]     0  1459     4387     1924    73728        0             0 systemd-userdbd
[59622.598197] [   1488]  1000  1488     7862     3570    90112        0             0 systemd
[59622.600204] [   1490]  1000  1490    12058     1458   102400        0             0 (sd-pam)
[59622.602218] [   1555]  1000  1555    11399     1367    81920        0             0 sshd
[59622.604170] [   1561]  1000  1561     5188     1443    61440        0             0 bash
[59622.606168] [   2078]   991  2078     6645      532    61440        0             0 dnsmasq
[59622.608154] [   2080]     0  2080     6638      100    61440        0             0 dnsmasq
[59622.610161] [   6747]  1000  6747    10102     2101    94208        0             0 su
[59622.612074] [   6751]     0  6751     4504     1424    65536        0             0 bash
[59622.614022] [  14077]     0 14077     4470     1927    73728        0             0 systemd-userwor
[59622.616135] [  14078]     0 14078     4470     1908    77824        0             0 systemd-userwor
[59622.618252] [  14080]     0 14080     4470     1929    69632        0             0 systemd-userwor
[59622.620358] [  14085]     0 14085     2069      642    57344        0             0 lxc-start
[59622.622379] [  14086]     0 14086    25345     2120   233472      520             0 systemd
[59622.624401] [  14503]     0 14503    25205     1712   241664      235             0 systemd-journal
[59622.626610] [  14512]     0 14512    22648     1593   221184      233             0 systemd-logind
[59622.628684] [  14513]    81 14513    13516      801   155648      138          -900 dbus-daemon
[59622.630714] [  14694]     0 14694     2072     1104    49152        0             0 3
[59622.632591] [  14696]     0 14696    56226      690    90112      145             0 bash
[59622.634510] [  14734]     0 14734    52899      856   180224      440             0 rsyslogd
[59622.636496] [  14735]     0 14735     1637      343    61440       31             0 agetty
[59622.638468] [  14736]     0 14736     1637      398    61440       32             0 agetty
[59622.640417] [  14737]     0 14737     1637      383    61440       31             0 agetty
[59622.642356] [  14741]     0 14741     5724      572    94208      239             0 crond
[59622.644286] [  14743]     0 14743     1637      398    61440       31             0 agetty
[59622.646226] [  14744]     0 14744     1637      387    57344       31             0 agetty
[59622.648164] [  14746]     0 14746   262198        0   266240     2900             0 syz-execprog
[59622.650200] [  14764]     0 14764    12552        0    36864       16             0 syz-executor
[59622.652255] [  14765]     0 14765    12552        0    36864       16             0 syz-executor
[59622.654283] [  14766]     0 14766    12551        0    49152       44             0 syz-executor
[59622.656306] [  14767]     0 14767    12551        0    49152       41             0 syz-executor
[59622.658318] [  14769]     0 14769    12552        0    36864       16             0 syz-executor
[59622.660328] [  14772]     0 14772    12552        0    36864       17             0 syz-executor
[59622.662330] [  14774]     0 14774    12551        0    49152       30             0 syz-executor
[59622.664337] [  14775]     0 14775    12552        0    36864       17             0 syz-executor
[59622.666456] [  14776]     0 14776    12551        0    49152       34             0 syz-executor
[59622.668449] [  14777]     0 14777    12552        0    36864       16             0 syz-executor
[59622.670459] [  14779]     0 14779    12551        0    49152       37             0 syz-executor
[59622.672445] [  14781]     0 14781    12552        0    36864       16             0 syz-executor
[59622.674449] [  14782]     0 14782    12551        0    49152       35             0 syz-executor
[59622.676423] [  14783]     0 14783    12552        0    36864       17             0 syz-executor
[59622.678418] [  14784]     0 14784    12551        0    49152       36             0 syz-executor
[59622.680430] [  14787]     0 14787    12552        0    36864       16             0 syz-executor
[59622.682398] [  14788]     0 14788    12551        0    49152       24             0 syz-executor
[59622.684359] [  14800]     0 14800    12552        0    36864       16             0 syz-executor
[59622.686357] [  14801]     0 14801    12551        0    49152       23             0 syz-executor
[59622.688449] [  14803]     0 14803    12552        0    36864       16             0 syz-executor
[59622.690434] [  14804]     0 14804    12551        0    49152       38             0 syz-executor
[59622.692466] [  14805]     0 14805    12552        0    36864       16             0 syz-executor
[59622.694436] [  14807]     0 14807    12551        0    49152       25             0 syz-executor
[59622.696378] [  14812]     0 14812    12552        0    36864       17             0 syz-executor
[59622.698313] [  14813]     0 14813    12551        0    49152       33             0 syz-executor
[59622.700252] [  14814]     0 14814    12552        0    36864       16             0 syz-executor
[59622.702186] [  14816]     0 14816    12551        0    49152       27             0 syz-executor
[59622.704126] [  14817]     0 14817    12552        0    36864       17             0 syz-executor
[59622.706053] [  14819]     0 14819    12551        0    49152       32             0 syz-executor
[59622.707978] [  14820]     0 14820    12552        0    36864       16             0 syz-executor
[59622.709923] [  14821]     0 14821    12551        0    49152       31             0 syz-executor
[59622.711847] [  14831]     0 14831    12551        0    49152       42             0 syz-executor
[59622.713773] [  15261]     0 15260    12618        0    65536       31          1000 syz-executor
[59622.715691] [  15263]     0 15262    12618        0    69632        0          1000 syz-executor
[59622.717604] [  15280]     0 15279    12584        0    69632        0          1000 syz-executor
[59622.719506] [  15282]     0 15281    12584        0    69632        0          1000 syz-executor
[59622.721401] [  15284]     0 15283    12584        0    65536        0          1000 syz-executor
[59622.723316] [  15286]     0 15285    12584        0    65536        0          1000 syz-executor
[59622.725243] [  15288]     0 15287    12617        0    65536        0          1000 syz-executor
[59622.727121] [  15291]     0 15289    12584        0    69632        4          1000 syz-executor
[59622.728993] [  15293]     0 15292    12584        0    65536        0          1000 syz-executor
[59622.730857] [  15295]     0 15294    12584        0    65536        0          1000 syz-executor
[59622.732722] [  15297]     0 15296    12584        0    65536        0          1000 syz-executor
[59622.734587] [  15299]     0 15298    12617        0    65536        0          1000 syz-executor
[59622.736441] [  15301]     0 15300    12584        0    69632        0          1000 syz-executor
[59622.738319] [  15304]     0 15303    12584        0    65536        0          1000 syz-executor
[59622.740174] [  15306]     0 15305    12584        0    69632        0          1000 syz-executor
[59622.742022] [  15307]     0 15307    12584        0    69632        7          1000 syz-executor
[59622.743869] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
[59622.745603] CPU: 2 PID: 15307 Comm: syz-executor Kdump: loaded Not tainted 5.15.0-rc4+ #55
[59622.747339] Hardware name: Virtuozzo KVM, BIOS 1.11.0-2.vz7.4 04/01/2014
[59622.748817] Call Trace:
[59622.749557]  dump_stack_lvl+0x57/0x72
[59622.750520]  panic+0xff/0x2ea
[59622.751390]  out_of_memory.cold+0x2f/0x7e
[59622.752413]  pagefault_out_of_memory+0x46/0x60
[59622.753522]  exc_page_fault+0x79/0x2b0
[59622.754504]  ? asm_exc_page_fault+0x8/0x30
[59622.755549]  asm_exc_page_fault+0x1e/0x30
[59622.756579] RIP: 0033:0x41a948
[59622.757441] Code: Unable to access opcode bytes at RIP 0x41a91e.
[59622.758831] RSP: 002b:00007ffd10ebf6e8 EFLAGS: 00010246
[59622.760086] RAX: 0000000000000000 RBX: 000000000119cf40 RCX: 000000000041b331
[59622.761687] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[59622.763287] RBP: 000000000119d940 R08: 0000000000000000 R09: 0000000000000000
[59622.764888] R10: 00007ffd10ebf7c0 R11: 0000000000000293 R12: 00000000038dc0e1
[59622.766512] R13: 00000000038dbdc3 R14: 20c49ba5e353f7cf R15: ffffffffffffffff

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18  9:04   ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-18  9:04 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups, linux-mm, linux-kernel, kernel

On Mon 18-10-21 11:13:52, Vasily Averin wrote:
[...]
> How could this happen?
> 
> User-space task inside the memcg-limited container generated a page fault,
> its handler do_user_addr_fault() called handle_mm_fault which could not
> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
> out_of_memory() without set of memcg.
> 
> Partially this problem depends on one of my recent patches, disabled unlimited
> memory allocation for dying tasks. However I think the problem can happen
> on non-killed tasks too, for example because of kmem limit.

Could you be more specific on how this can happen without your patch? I
have to say I haven't realized this side effect when discussing it.

I will be honest that I am not really happy about pagefault_out_of_memory.
I have tried to remove it in the past. Without much success back then,
unfortunately[1]. 
Maybe we should get rid of it finally. The OOM is always triggered from
inside the allocator where we have much more infromation about the
allocation context. A first step would be to skip pagefault_out_of_memory
for killed or exiting processes.

[1] I do not have msg-id so I cannot provide a lore link but google
pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18  9:04   ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-18  9:04 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Mon 18-10-21 11:13:52, Vasily Averin wrote:
[...]
> How could this happen?
> 
> User-space task inside the memcg-limited container generated a page fault,
> its handler do_user_addr_fault() called handle_mm_fault which could not
> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
> out_of_memory() without set of memcg.
> 
> Partially this problem depends on one of my recent patches, disabled unlimited
> memory allocation for dying tasks. However I think the problem can happen
> on non-killed tasks too, for example because of kmem limit.

Could you be more specific on how this can happen without your patch? I
have to say I haven't realized this side effect when discussing it.

I will be honest that I am not really happy about pagefault_out_of_memory.
I have tried to remove it in the past. Without much success back then,
unfortunately[1]. 
Maybe we should get rid of it finally. The OOM is always triggered from
inside the allocator where we have much more infromation about the
allocation context. A first step would be to skip pagefault_out_of_memory
for killed or exiting processes.

[1] I do not have msg-id so I cannot provide a lore link but google
pointed me to https://www.mail-archive.com/linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg1400402.html
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18 10:05     ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-18 10:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups, linux-mm, linux-kernel, kernel

On 18.10.2021 12:04, Michal Hocko wrote:
> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> [...]
>> How could this happen?
>>
>> User-space task inside the memcg-limited container generated a page fault,
>> its handler do_user_addr_fault() called handle_mm_fault which could not
>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>> out_of_memory() without set of memcg.
>>
>> Partially this problem depends on one of my recent patches, disabled unlimited
>> memory allocation for dying tasks. However I think the problem can happen
>> on non-killed tasks too, for example because of kmem limit.
> 
> Could you be more specific on how this can happen without your patch? I
> have to say I haven't realized this side effect when discussing it.

We can reach obj_cgroup_charge_pages() for example via

do_user_addr_fault
handle_mm_fault
__handle_mm_fault
p4d_alloc
__p4d_alloc
p4d_alloc_one
get_zeroed_page
__get_free_pages
alloc_pages
__alloc_pages
__memcg_kmem_charge_page
obj_cgroup_charge_pages

Here we call try_charge_memcg() that return success and approve the allocation,
however then we hit into kmem limit and fail the allocation.

If required I can try to search how try_charge_memcg() can reject page allocation 
of non-dying task too.

> I will be honest that I am not really happy about pagefault_out_of_memory.
> I have tried to remove it in the past. Without much success back then,
> unfortunately[1]. 
> Maybe we should get rid of it finally. The OOM is always triggered from
> inside the allocator where we have much more infromation about the
> allocation context. A first step would be to skip pagefault_out_of_memory
> for killed or exiting processes.

I like this idea, however it may be not enough, at least in scenario described above.

> [1] I do not have msg-id so I cannot provide a lore link but google
> pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html

Thank you, I'll read this discussion.
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18 10:05     ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-18 10:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On 18.10.2021 12:04, Michal Hocko wrote:
> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> [...]
>> How could this happen?
>>
>> User-space task inside the memcg-limited container generated a page fault,
>> its handler do_user_addr_fault() called handle_mm_fault which could not
>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>> out_of_memory() without set of memcg.
>>
>> Partially this problem depends on one of my recent patches, disabled unlimited
>> memory allocation for dying tasks. However I think the problem can happen
>> on non-killed tasks too, for example because of kmem limit.
> 
> Could you be more specific on how this can happen without your patch? I
> have to say I haven't realized this side effect when discussing it.

We can reach obj_cgroup_charge_pages() for example via

do_user_addr_fault
handle_mm_fault
__handle_mm_fault
p4d_alloc
__p4d_alloc
p4d_alloc_one
get_zeroed_page
__get_free_pages
alloc_pages
__alloc_pages
__memcg_kmem_charge_page
obj_cgroup_charge_pages

Here we call try_charge_memcg() that return success and approve the allocation,
however then we hit into kmem limit and fail the allocation.

If required I can try to search how try_charge_memcg() can reject page allocation 
of non-dying task too.

> I will be honest that I am not really happy about pagefault_out_of_memory.
> I have tried to remove it in the past. Without much success back then,
> unfortunately[1]. 
> Maybe we should get rid of it finally. The OOM is always triggered from
> inside the allocator where we have much more infromation about the
> allocation context. A first step would be to skip pagefault_out_of_memory
> for killed or exiting processes.

I like this idea, however it may be not enough, at least in scenario described above.

> [1] I do not have msg-id so I cannot provide a lore link but google
> pointed me to https://www.mail-archive.com/linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg1400402.html

Thank you, I'll read this discussion.
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18 10:12       ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-18 10:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups, linux-mm, linux-kernel, kernel

On 18.10.2021 13:05, Vasily Averin wrote:
> On 18.10.2021 12:04, Michal Hocko wrote:
>> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
>>> Partially this problem depends on one of my recent patches, disabled unlimited
>>> memory allocation for dying tasks. However I think the problem can happen
>>> on non-killed tasks too, for example because of kmem limit.
>>
>> Could you be more specific on how this can happen without your patch? I
>> have to say I haven't realized this side effect when discussing it.
> 
> We can reach obj_cgroup_charge_pages() for example via
> 
> do_user_addr_fault
> handle_mm_fault
> __handle_mm_fault
> p4d_alloc
> __p4d_alloc
> p4d_alloc_one
> get_zeroed_page
> __get_free_pages
> alloc_pages
> __alloc_pages
> __memcg_kmem_charge_page
> obj_cgroup_charge_pages
> 
> Here we call try_charge_memcg() that return success and approve the allocation,
> however then we hit into kmem limit and fail the allocation.

btw. in OpenVZ kernels we trying to cleanup the memory in when task hit kmem limit,
therefore we moved kmem limit check into try_charge_memcg.

Are any improvements for kmem controller interesting for upstream?

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18 10:12       ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-18 10:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On 18.10.2021 13:05, Vasily Averin wrote:
> On 18.10.2021 12:04, Michal Hocko wrote:
>> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
>>> Partially this problem depends on one of my recent patches, disabled unlimited
>>> memory allocation for dying tasks. However I think the problem can happen
>>> on non-killed tasks too, for example because of kmem limit.
>>
>> Could you be more specific on how this can happen without your patch? I
>> have to say I haven't realized this side effect when discussing it.
> 
> We can reach obj_cgroup_charge_pages() for example via
> 
> do_user_addr_fault
> handle_mm_fault
> __handle_mm_fault
> p4d_alloc
> __p4d_alloc
> p4d_alloc_one
> get_zeroed_page
> __get_free_pages
> alloc_pages
> __alloc_pages
> __memcg_kmem_charge_page
> obj_cgroup_charge_pages
> 
> Here we call try_charge_memcg() that return success and approve the allocation,
> however then we hit into kmem limit and fail the allocation.

btw. in OpenVZ kernels we trying to cleanup the memory in when task hit kmem limit,
therefore we moved kmem limit check into try_charge_memcg.

Are any improvements for kmem controller interesting for upstream?

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18 11:53       ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-18 11:53 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups, linux-mm, linux-kernel, kernel

On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> On 18.10.2021 12:04, Michal Hocko wrote:
> > On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> > [...]
> >> How could this happen?
> >>
> >> User-space task inside the memcg-limited container generated a page fault,
> >> its handler do_user_addr_fault() called handle_mm_fault which could not
> >> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
> >> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
> >> out_of_memory() without set of memcg.
> >>
> >> Partially this problem depends on one of my recent patches, disabled unlimited
> >> memory allocation for dying tasks. However I think the problem can happen
> >> on non-killed tasks too, for example because of kmem limit.
> > 
> > Could you be more specific on how this can happen without your patch? I
> > have to say I haven't realized this side effect when discussing it.
> 
> We can reach obj_cgroup_charge_pages() for example via
> 
> do_user_addr_fault
> handle_mm_fault
> __handle_mm_fault
> p4d_alloc
> __p4d_alloc
> p4d_alloc_one
> get_zeroed_page
> __get_free_pages
> alloc_pages
> __alloc_pages
> __memcg_kmem_charge_page
> obj_cgroup_charge_pages
> 
> Here we call try_charge_memcg() that return success and approve the allocation,
> however then we hit into kmem limit and fail the allocation.

Just to make sure I understand this would be for the v1 kmem explicit
limit, correct?

> If required I can try to search how try_charge_memcg() can reject page allocation 
> of non-dying task too.

Yes.

> > I will be honest that I am not really happy about pagefault_out_of_memory.
> > I have tried to remove it in the past. Without much success back then,
> > unfortunately[1]. 
> > Maybe we should get rid of it finally. The OOM is always triggered from
> > inside the allocator where we have much more infromation about the
> > allocation context. A first step would be to skip pagefault_out_of_memory
> > for killed or exiting processes.
> 
> I like this idea, however it may be not enough, at least in scenario described above.

I original patch has removed the oom killer completely.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18 11:53       ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-18 11:53 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> On 18.10.2021 12:04, Michal Hocko wrote:
> > On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> > [...]
> >> How could this happen?
> >>
> >> User-space task inside the memcg-limited container generated a page fault,
> >> its handler do_user_addr_fault() called handle_mm_fault which could not
> >> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
> >> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
> >> out_of_memory() without set of memcg.
> >>
> >> Partially this problem depends on one of my recent patches, disabled unlimited
> >> memory allocation for dying tasks. However I think the problem can happen
> >> on non-killed tasks too, for example because of kmem limit.
> > 
> > Could you be more specific on how this can happen without your patch? I
> > have to say I haven't realized this side effect when discussing it.
> 
> We can reach obj_cgroup_charge_pages() for example via
> 
> do_user_addr_fault
> handle_mm_fault
> __handle_mm_fault
> p4d_alloc
> __p4d_alloc
> p4d_alloc_one
> get_zeroed_page
> __get_free_pages
> alloc_pages
> __alloc_pages
> __memcg_kmem_charge_page
> obj_cgroup_charge_pages
> 
> Here we call try_charge_memcg() that return success and approve the allocation,
> however then we hit into kmem limit and fail the allocation.

Just to make sure I understand this would be for the v1 kmem explicit
limit, correct?

> If required I can try to search how try_charge_memcg() can reject page allocation 
> of non-dying task too.

Yes.

> > I will be honest that I am not really happy about pagefault_out_of_memory.
> > I have tried to remove it in the past. Without much success back then,
> > unfortunately[1]. 
> > Maybe we should get rid of it finally. The OOM is always triggered from
> > inside the allocator where we have much more infromation about the
> > allocation context. A first step would be to skip pagefault_out_of_memory
> > for killed or exiting processes.
> 
> I like this idea, however it may be not enough, at least in scenario described above.

I original patch has removed the oom killer completely.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18 12:27           ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-18 12:27 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups, linux-mm, linux-kernel, kernel

[restore the cc list]

On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> On 18.10.2021 14:53, Michal Hocko wrote:
> > On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >> On 18.10.2021 12:04, Michal Hocko wrote:
> >> Here we call try_charge_memcg() that return success and approve the allocation,
> >> however then we hit into kmem limit and fail the allocation.
> > 
> > Just to make sure I understand this would be for the v1 kmem explicit
> > limit, correct?
> 
> yes, I mean this limit.

OK, thanks for the clarification. This is a known problem. Have a look
at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
status since 2019 without any actual report sugested by the kernel
message. Maybe we should try and remove it and see whether that prompts
some pushback.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18 12:27           ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-18 12:27 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

[restore the cc list]

On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> On 18.10.2021 14:53, Michal Hocko wrote:
> > On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >> On 18.10.2021 12:04, Michal Hocko wrote:
> >> Here we call try_charge_memcg() that return success and approve the allocation,
> >> however then we hit into kmem limit and fail the allocation.
> > 
> > Just to make sure I understand this would be for the v1 kmem explicit
> > limit, correct?
> 
> yes, I mean this limit.

OK, thanks for the clarification. This is a known problem. Have a look
at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
status since 2019 without any actual report sugested by the kernel
message. Maybe we should try and remove it and see whether that prompts
some pushback.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18 15:07             ` Shakeel Butt
  0 siblings, 0 replies; 131+ messages in thread
From: Shakeel Butt @ 2021-10-18 15:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
	Cgroups, Linux MM, LKML, kernel

On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
>
> [restore the cc list]
>
> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> > On 18.10.2021 14:53, Michal Hocko wrote:
> > > On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> > >> On 18.10.2021 12:04, Michal Hocko wrote:
> > >> Here we call try_charge_memcg() that return success and approve the allocation,
> > >> however then we hit into kmem limit and fail the allocation.
> > >
> > > Just to make sure I understand this would be for the v1 kmem explicit
> > > limit, correct?
> >
> > yes, I mean this limit.
>
> OK, thanks for the clarification. This is a known problem. Have a look
> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> status since 2019 without any actual report sugested by the kernel
> message. Maybe we should try and remove it and see whether that prompts
> some pushback.
>

Yes, I think now should be the right time to take the next step for
deprecation of kmem limits:
https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18 15:07             ` Shakeel Butt
  0 siblings, 0 replies; 131+ messages in thread
From: Shakeel Butt @ 2021-10-18 15:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
	Cgroups, Linux MM, LKML, kernel-GEFAQzZX7r8dnm+yROfE0A

On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
>
> [restore the cc list]
>
> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> > On 18.10.2021 14:53, Michal Hocko wrote:
> > > On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> > >> On 18.10.2021 12:04, Michal Hocko wrote:
> > >> Here we call try_charge_memcg() that return success and approve the allocation,
> > >> however then we hit into kmem limit and fail the allocation.
> > >
> > > Just to make sure I understand this would be for the v1 kmem explicit
> > > limit, correct?
> >
> > yes, I mean this limit.
>
> OK, thanks for the clarification. This is a known problem. Have a look
> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> status since 2019 without any actual report sugested by the kernel
> message. Maybe we should try and remove it and see whether that prompts
> some pushback.
>

Yes, I think now should be the right time to take the next step for
deprecation of kmem limits:
https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18 16:51               ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-18 16:51 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
	Cgroups, Linux MM, LKML, kernel

On Mon 18-10-21 08:07:20, Shakeel Butt wrote:
> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > [restore the cc list]
> >
> > On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> > > On 18.10.2021 14:53, Michal Hocko wrote:
> > > > On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> > > >> On 18.10.2021 12:04, Michal Hocko wrote:
> > > >> Here we call try_charge_memcg() that return success and approve the allocation,
> > > >> however then we hit into kmem limit and fail the allocation.
> > > >
> > > > Just to make sure I understand this would be for the v1 kmem explicit
> > > > limit, correct?
> > >
> > > yes, I mean this limit.
> >
> > OK, thanks for the clarification. This is a known problem. Have a look
> > at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> > kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> > status since 2019 without any actual report sugested by the kernel
> > message. Maybe we should try and remove it and see whether that prompts
> > some pushback.
> >
> 
> Yes, I think now should be the right time to take the next step for
> deprecation of kmem limits:
> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/

I completely forgot about your patch.  Anyway, it usually takes us years
to deprecate something so let's stick with it and consider 2 years as
years ;)

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18 16:51               ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-18 16:51 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
	Cgroups, Linux MM, LKML, kernel-GEFAQzZX7r8dnm+yROfE0A

On Mon 18-10-21 08:07:20, Shakeel Butt wrote:
> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> >
> > [restore the cc list]
> >
> > On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> > > On 18.10.2021 14:53, Michal Hocko wrote:
> > > > On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> > > >> On 18.10.2021 12:04, Michal Hocko wrote:
> > > >> Here we call try_charge_memcg() that return success and approve the allocation,
> > > >> however then we hit into kmem limit and fail the allocation.
> > > >
> > > > Just to make sure I understand this would be for the v1 kmem explicit
> > > > limit, correct?
> > >
> > > yes, I mean this limit.
> >
> > OK, thanks for the clarification. This is a known problem. Have a look
> > at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> > kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> > status since 2019 without any actual report sugested by the kernel
> > message. Maybe we should try and remove it and see whether that prompts
> > some pushback.
> >
> 
> Yes, I think now should be the right time to take the next step for
> deprecation of kmem limits:
> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/

I completely forgot about your patch.  Anyway, it usually takes us years
to deprecate something so let's stick with it and consider 2 years as
years ;)

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
  2021-10-18 16:51               ` Michal Hocko
  (?)
@ 2021-10-18 17:13               ` Shakeel Butt
  -1 siblings, 0 replies; 131+ messages in thread
From: Shakeel Butt @ 2021-10-18 17:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
	Cgroups, Linux MM, LKML, kernel

On Mon, Oct 18, 2021 at 9:51 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 18-10-21 08:07:20, Shakeel Butt wrote:
> > On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > [restore the cc list]
> > >
> > > On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> > > > On 18.10.2021 14:53, Michal Hocko wrote:
> > > > > On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> > > > >> On 18.10.2021 12:04, Michal Hocko wrote:
> > > > >> Here we call try_charge_memcg() that return success and approve the allocation,
> > > > >> however then we hit into kmem limit and fail the allocation.
> > > > >
> > > > > Just to make sure I understand this would be for the v1 kmem explicit
> > > > > limit, correct?
> > > >
> > > > yes, I mean this limit.
> > >
> > > OK, thanks for the clarification. This is a known problem. Have a look
> > > at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> > > kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> > > status since 2019 without any actual report sugested by the kernel
> > > message. Maybe we should try and remove it and see whether that prompts
> > > some pushback.
> > >
> >
> > Yes, I think now should be the right time to take the next step for
> > deprecation of kmem limits:
> > https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
>
> I completely forgot about your patch.  Anyway, it usually takes us years
> to deprecate something so let's stick with it and consider 2 years as
> years ;)
>

Sure, I will rebase and resend.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18 18:52               ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-18 18:52 UTC (permalink / raw)
  To: Shakeel Butt, Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Mel Gorman, Cgroups, Linux MM,
	LKML, kernel

On 18.10.2021 18:07, Shakeel Butt wrote:
> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
>>
>> [restore the cc list]
>>
>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
>>> On 18.10.2021 14:53, Michal Hocko wrote:
>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
>>>>> however then we hit into kmem limit and fail the allocation.
>>>>
>>>> Just to make sure I understand this would be for the v1 kmem explicit
>>>> limit, correct?
>>>
>>> yes, I mean this limit.
>>
>> OK, thanks for the clarification. This is a known problem. Have a look
>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
>> status since 2019 without any actual report sugested by the kernel
>> message. Maybe we should try and remove it and see whether that prompts
>> some pushback.
>>
> 
> Yes, I think now should be the right time to take the next step for
> deprecation of kmem limits:
> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/

Are you going to push it to stable kernels too?

Thank you,
	Vasily Averin


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18 18:52               ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-18 18:52 UTC (permalink / raw)
  To: Shakeel Butt, Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Mel Gorman, Cgroups, Linux MM,
	LKML, kernel-GEFAQzZX7r8dnm+yROfE0A

On 18.10.2021 18:07, Shakeel Butt wrote:
> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
>>
>> [restore the cc list]
>>
>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
>>> On 18.10.2021 14:53, Michal Hocko wrote:
>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
>>>>> however then we hit into kmem limit and fail the allocation.
>>>>
>>>> Just to make sure I understand this would be for the v1 kmem explicit
>>>> limit, correct?
>>>
>>> yes, I mean this limit.
>>
>> OK, thanks for the clarification. This is a known problem. Have a look
>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
>> status since 2019 without any actual report sugested by the kernel
>> message. Maybe we should try and remove it and see whether that prompts
>> some pushback.
>>
> 
> Yes, I think now should be the right time to take the next step for
> deprecation of kmem limits:
> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/

Are you going to push it to stable kernels too?

Thank you,
	Vasily Averin


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18 19:18                 ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-18 19:18 UTC (permalink / raw)
  To: Shakeel Butt, Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Mel Gorman, Cgroups, Linux MM,
	LKML, kernel

On 18.10.2021 21:52, Vasily Averin wrote:
> On 18.10.2021 18:07, Shakeel Butt wrote:
>> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
>>>
>>> [restore the cc list]
>>>
>>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
>>>> On 18.10.2021 14:53, Michal Hocko wrote:
>>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
>>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
>>>>>> however then we hit into kmem limit and fail the allocation.
>>>>>
>>>>> Just to make sure I understand this would be for the v1 kmem explicit
>>>>> limit, correct?
>>>>
>>>> yes, I mean this limit.
>>>
>>> OK, thanks for the clarification. This is a known problem. Have a look
>>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
>>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
>>> status since 2019 without any actual report sugested by the kernel
>>> message. Maybe we should try and remove it and see whether that prompts
>>> some pushback.
>>>
>>
>> Yes, I think now should be the right time to take the next step for
>> deprecation of kmem limits:
>> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
> 
> Are you going to push it to stable kernels too?

Btw CONFIG_MEMCG_KMEM=y is set both in RHEL8 kernels and in ubuntu 20.04 LTS kernel 5.11.0-37.



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-18 19:18                 ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-18 19:18 UTC (permalink / raw)
  To: Shakeel Butt, Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Mel Gorman, Cgroups, Linux MM,
	LKML, kernel-GEFAQzZX7r8dnm+yROfE0A

On 18.10.2021 21:52, Vasily Averin wrote:
> On 18.10.2021 18:07, Shakeel Butt wrote:
>> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
>>>
>>> [restore the cc list]
>>>
>>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
>>>> On 18.10.2021 14:53, Michal Hocko wrote:
>>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
>>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
>>>>>> however then we hit into kmem limit and fail the allocation.
>>>>>
>>>>> Just to make sure I understand this would be for the v1 kmem explicit
>>>>> limit, correct?
>>>>
>>>> yes, I mean this limit.
>>>
>>> OK, thanks for the clarification. This is a known problem. Have a look
>>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
>>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
>>> status since 2019 without any actual report sugested by the kernel
>>> message. Maybe we should try and remove it and see whether that prompts
>>> some pushback.
>>>
>>
>> Yes, I think now should be the right time to take the next step for
>> deprecation of kmem limits:
>> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/
> 
> Are you going to push it to stable kernels too?

Btw CONFIG_MEMCG_KMEM=y is set both in RHEL8 kernels and in ubuntu 20.04 LTS kernel 5.11.0-37.



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19  5:33                 ` Shakeel Butt
  0 siblings, 0 replies; 131+ messages in thread
From: Shakeel Butt @ 2021-10-19  5:33 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
	Cgroups, Linux MM, LKML, kernel

On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> On 18.10.2021 18:07, Shakeel Butt wrote:
> > On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
> >>
> >> [restore the cc list]
> >>
> >> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> >>> On 18.10.2021 14:53, Michal Hocko wrote:
> >>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >>>>> On 18.10.2021 12:04, Michal Hocko wrote:
> >>>>> Here we call try_charge_memcg() that return success and approve the allocation,
> >>>>> however then we hit into kmem limit and fail the allocation.
> >>>>
> >>>> Just to make sure I understand this would be for the v1 kmem explicit
> >>>> limit, correct?
> >>>
> >>> yes, I mean this limit.
> >>
> >> OK, thanks for the clarification. This is a known problem. Have a look
> >> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> >> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> >> status since 2019 without any actual report sugested by the kernel
> >> message. Maybe we should try and remove it and see whether that prompts
> >> some pushback.
> >>
> >
> > Yes, I think now should be the right time to take the next step for
> > deprecation of kmem limits:
> > https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
>
> Are you going to push it to stable kernels too?
>

Not really. Is there a reason I should? More exposure to catch breakage?

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19  5:33                 ` Shakeel Butt
  0 siblings, 0 replies; 131+ messages in thread
From: Shakeel Butt @ 2021-10-19  5:33 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
	Cgroups, Linux MM, LKML, kernel-GEFAQzZX7r8dnm+yROfE0A

On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>
> On 18.10.2021 18:07, Shakeel Butt wrote:
> > On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> >>
> >> [restore the cc list]
> >>
> >> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> >>> On 18.10.2021 14:53, Michal Hocko wrote:
> >>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >>>>> On 18.10.2021 12:04, Michal Hocko wrote:
> >>>>> Here we call try_charge_memcg() that return success and approve the allocation,
> >>>>> however then we hit into kmem limit and fail the allocation.
> >>>>
> >>>> Just to make sure I understand this would be for the v1 kmem explicit
> >>>> limit, correct?
> >>>
> >>> yes, I mean this limit.
> >>
> >> OK, thanks for the clarification. This is a known problem. Have a look
> >> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> >> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> >> status since 2019 without any actual report sugested by the kernel
> >> message. Maybe we should try and remove it and see whether that prompts
> >> some pushback.
> >>
> >
> > Yes, I think now should be the right time to take the next step for
> > deprecation of kmem limits:
> > https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/
>
> Are you going to push it to stable kernels too?
>

Not really. Is there a reason I should? More exposure to catch breakage?

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19  5:34                   ` Shakeel Butt
  0 siblings, 0 replies; 131+ messages in thread
From: Shakeel Butt @ 2021-10-19  5:34 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
	Cgroups, Linux MM, LKML, kernel

On Mon, Oct 18, 2021 at 12:19 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> On 18.10.2021 21:52, Vasily Averin wrote:
> > On 18.10.2021 18:07, Shakeel Butt wrote:
> >> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
> >>>
> >>> [restore the cc list]
> >>>
> >>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> >>>> On 18.10.2021 14:53, Michal Hocko wrote:
> >>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
> >>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
> >>>>>> however then we hit into kmem limit and fail the allocation.
> >>>>>
> >>>>> Just to make sure I understand this would be for the v1 kmem explicit
> >>>>> limit, correct?
> >>>>
> >>>> yes, I mean this limit.
> >>>
> >>> OK, thanks for the clarification. This is a known problem. Have a look
> >>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> >>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> >>> status since 2019 without any actual report sugested by the kernel
> >>> message. Maybe we should try and remove it and see whether that prompts
> >>> some pushback.
> >>>
> >>
> >> Yes, I think now should be the right time to take the next step for
> >> deprecation of kmem limits:
> >> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
> >
> > Are you going to push it to stable kernels too?
>
> Btw CONFIG_MEMCG_KMEM=y is set both in RHEL8 kernels and in ubuntu 20.04 LTS kernel 5.11.0-37.
>

CONFIG_MEMCG_KMEM is orthogonal to setting kmem limits. We are not
disabling the kmem accounting.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19  5:34                   ` Shakeel Butt
  0 siblings, 0 replies; 131+ messages in thread
From: Shakeel Butt @ 2021-10-19  5:34 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
	Cgroups, Linux MM, LKML, kernel-GEFAQzZX7r8dnm+yROfE0A

On Mon, Oct 18, 2021 at 12:19 PM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>
> On 18.10.2021 21:52, Vasily Averin wrote:
> > On 18.10.2021 18:07, Shakeel Butt wrote:
> >> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> >>>
> >>> [restore the cc list]
> >>>
> >>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> >>>> On 18.10.2021 14:53, Michal Hocko wrote:
> >>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
> >>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
> >>>>>> however then we hit into kmem limit and fail the allocation.
> >>>>>
> >>>>> Just to make sure I understand this would be for the v1 kmem explicit
> >>>>> limit, correct?
> >>>>
> >>>> yes, I mean this limit.
> >>>
> >>> OK, thanks for the clarification. This is a known problem. Have a look
> >>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> >>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> >>> status since 2019 without any actual report sugested by the kernel
> >>> message. Maybe we should try and remove it and see whether that prompts
> >>> some pushback.
> >>>
> >>
> >> Yes, I think now should be the right time to take the next step for
> >> deprecation of kmem limits:
> >> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/
> >
> > Are you going to push it to stable kernels too?
>
> Btw CONFIG_MEMCG_KMEM=y is set both in RHEL8 kernels and in ubuntu 20.04 LTS kernel 5.11.0-37.
>

CONFIG_MEMCG_KMEM is orthogonal to setting kmem limits. We are not
disabling the kmem accounting.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19  6:30         ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-19  6:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups, linux-mm, linux-kernel, kernel

On 18.10.2021 14:53, Michal Hocko wrote:
> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>> On 18.10.2021 12:04, Michal Hocko wrote:
>>> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
>>> [...]
>>>> How could this happen?
>>>>
>>>> User-space task inside the memcg-limited container generated a page fault,
>>>> its handler do_user_addr_fault() called handle_mm_fault which could not
>>>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>>>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>>>> out_of_memory() without set of memcg.
>>>>
>>>> Partially this problem depends on one of my recent patches, disabled unlimited
>>>> memory allocation for dying tasks. However I think the problem can happen
>>>> on non-killed tasks too, for example because of kmem limit.
>>>
>>> Could you be more specific on how this can happen without your patch? I
>>> have to say I haven't realized this side effect when discussing it.

>> If required I can try to search how try_charge_memcg() can reject page allocation 
>> of non-dying task too.
> 
> Yes.

Now I think that such failure was very unlikely (w/o my patch and kmem limit).
I cannot exclude it completely, because I did not finished this review and perhaps I missed something,
but I checked most part of code and found nothing.

With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
a) due to fatal signal
b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)

To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.

However all these cases can be successfully handled by my new patch
"memcg: prevent false global OOM triggered by memcg limited task"
and I think it is better solution.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19  6:30         ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-19  6:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On 18.10.2021 14:53, Michal Hocko wrote:
> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>> On 18.10.2021 12:04, Michal Hocko wrote:
>>> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
>>> [...]
>>>> How could this happen?
>>>>
>>>> User-space task inside the memcg-limited container generated a page fault,
>>>> its handler do_user_addr_fault() called handle_mm_fault which could not
>>>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>>>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>>>> out_of_memory() without set of memcg.
>>>>
>>>> Partially this problem depends on one of my recent patches, disabled unlimited
>>>> memory allocation for dying tasks. However I think the problem can happen
>>>> on non-killed tasks too, for example because of kmem limit.
>>>
>>> Could you be more specific on how this can happen without your patch? I
>>> have to say I haven't realized this side effect when discussing it.

>> If required I can try to search how try_charge_memcg() can reject page allocation 
>> of non-dying task too.
> 
> Yes.

Now I think that such failure was very unlikely (w/o my patch and kmem limit).
I cannot exclude it completely, because I did not finished this review and perhaps I missed something,
but I checked most part of code and found nothing.

With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
a) due to fatal signal
b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)

To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.

However all these cases can be successfully handled by my new patch
"memcg: prevent false global OOM triggered by memcg limited task"
and I think it is better solution.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19  6:42                   ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-19  6:42 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
	Cgroups, Linux MM, LKML, kernel

On 19.10.2021 08:33, Shakeel Butt wrote:
> On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <vvs@virtuozzo.com> wrote:
>>
>> On 18.10.2021 18:07, Shakeel Butt wrote:
>>> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
>>>>
>>>> [restore the cc list]
>>>>
>>>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
>>>>> On 18.10.2021 14:53, Michal Hocko wrote:
>>>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>>>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
>>>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
>>>>>>> however then we hit into kmem limit and fail the allocation.
>>>>>>
>>>>>> Just to make sure I understand this would be for the v1 kmem explicit
>>>>>> limit, correct?
>>>>>
>>>>> yes, I mean this limit.
>>>>
>>>> OK, thanks for the clarification. This is a known problem. Have a look
>>>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
>>>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
>>>> status since 2019 without any actual report sugested by the kernel
>>>> message. Maybe we should try and remove it and see whether that prompts
>>>> some pushback.
>>>
>>> Yes, I think now should be the right time to take the next step for
>>> deprecation of kmem limits:
>>> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
>>
>> Are you going to push it to stable kernels too?
> 
> Not really. Is there a reason I should? More exposure to catch breakage?

There is a problem: kmem limit can trigger fake global OOM.
To fix it in upstream you can remove kmem controller.

However how to handle this problem in stable and LTS kernels?
My current patch resolves the problem too, and it can be backported.
However I afraid nobody will do it if teh patch will not be approved in upsteam.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19  6:42                   ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-19  6:42 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
	Cgroups, Linux MM, LKML, kernel-GEFAQzZX7r8dnm+yROfE0A

On 19.10.2021 08:33, Shakeel Butt wrote:
> On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
>>
>> On 18.10.2021 18:07, Shakeel Butt wrote:
>>> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
>>>>
>>>> [restore the cc list]
>>>>
>>>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
>>>>> On 18.10.2021 14:53, Michal Hocko wrote:
>>>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
>>>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
>>>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
>>>>>>> however then we hit into kmem limit and fail the allocation.
>>>>>>
>>>>>> Just to make sure I understand this would be for the v1 kmem explicit
>>>>>> limit, correct?
>>>>>
>>>>> yes, I mean this limit.
>>>>
>>>> OK, thanks for the clarification. This is a known problem. Have a look
>>>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
>>>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
>>>> status since 2019 without any actual report sugested by the kernel
>>>> message. Maybe we should try and remove it and see whether that prompts
>>>> some pushback.
>>>
>>> Yes, I think now should be the right time to take the next step for
>>> deprecation of kmem limits:
>>> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/
>>
>> Are you going to push it to stable kernels too?
> 
> Not really. Is there a reason I should? More exposure to catch breakage?

There is a problem: kmem limit can trigger fake global OOM.
To fix it in upstream you can remove kmem controller.

However how to handle this problem in stable and LTS kernels?
My current patch resolves the problem too, and it can be backported.
However I afraid nobody will do it if teh patch will not be approved in upsteam.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19  8:47                     ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-19  8:47 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Shakeel Butt, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
	Cgroups, Linux MM, LKML, kernel

On Tue 19-10-21 09:42:38, Vasily Averin wrote:
> On 19.10.2021 08:33, Shakeel Butt wrote:
> > On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <vvs@virtuozzo.com> wrote:
> >>
> >> On 18.10.2021 18:07, Shakeel Butt wrote:
> >>> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko@suse.com> wrote:
> >>>>
> >>>> [restore the cc list]
> >>>>
> >>>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> >>>>> On 18.10.2021 14:53, Michal Hocko wrote:
> >>>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >>>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
> >>>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
> >>>>>>> however then we hit into kmem limit and fail the allocation.
> >>>>>>
> >>>>>> Just to make sure I understand this would be for the v1 kmem explicit
> >>>>>> limit, correct?
> >>>>>
> >>>>> yes, I mean this limit.
> >>>>
> >>>> OK, thanks for the clarification. This is a known problem. Have a look
> >>>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> >>>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> >>>> status since 2019 without any actual report sugested by the kernel
> >>>> message. Maybe we should try and remove it and see whether that prompts
> >>>> some pushback.
> >>>
> >>> Yes, I think now should be the right time to take the next step for
> >>> deprecation of kmem limits:
> >>> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb@google.com/
> >>
> >> Are you going to push it to stable kernels too?
> > 
> > Not really. Is there a reason I should? More exposure to catch breakage?
> 
> There is a problem: kmem limit can trigger fake global OOM.
> To fix it in upstream you can remove kmem controller.
> 
> However how to handle this problem in stable and LTS kernels?

I do not see any bug reports coming in and I strongly suspect this is
because the functionality is simply not used wildly enough or in the
mode where it would matter (U != 0, K < U from our documentation).
If there are relevant usecases for this setup then we really want to
hear about those because we do not want to break userspace. Handling
that setup would far from trivial and the global oom killer is not the
only one of those.

> My current patch resolves the problem too, and it can be backported.
> However I afraid nobody will do it if teh patch will not be approved in upsteam.

I do not think your current approach is the right one. We do not really
want yet another flag to tell we are in a memcg OOM. We already have
one.

A proper way is to deal with the pagefault oom killer trigger but that
might be just too risky for stable kernels. 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19  8:47                     ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-19  8:47 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Shakeel Butt, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Mel Gorman,
	Cgroups, Linux MM, LKML, kernel-GEFAQzZX7r8dnm+yROfE0A

On Tue 19-10-21 09:42:38, Vasily Averin wrote:
> On 19.10.2021 08:33, Shakeel Butt wrote:
> > On Mon, Oct 18, 2021 at 11:52 AM Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org> wrote:
> >>
> >> On 18.10.2021 18:07, Shakeel Butt wrote:
> >>> On Mon, Oct 18, 2021 at 5:27 AM Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> >>>>
> >>>> [restore the cc list]
> >>>>
> >>>> On Mon 18-10-21 15:14:26, Vasily Averin wrote:
> >>>>> On 18.10.2021 14:53, Michal Hocko wrote:
> >>>>>> On Mon 18-10-21 13:05:35, Vasily Averin wrote:
> >>>>>>> On 18.10.2021 12:04, Michal Hocko wrote:
> >>>>>>> Here we call try_charge_memcg() that return success and approve the allocation,
> >>>>>>> however then we hit into kmem limit and fail the allocation.
> >>>>>>
> >>>>>> Just to make sure I understand this would be for the v1 kmem explicit
> >>>>>> limit, correct?
> >>>>>
> >>>>> yes, I mean this limit.
> >>>>
> >>>> OK, thanks for the clarification. This is a known problem. Have a look
> >>>> at I think we consider that one to 0158115f702b ("memcg, kmem: deprecate
> >>>> kmem.limit_in_bytes"). We are reporting the deprecated and to-be removed
> >>>> status since 2019 without any actual report sugested by the kernel
> >>>> message. Maybe we should try and remove it and see whether that prompts
> >>>> some pushback.
> >>>
> >>> Yes, I think now should be the right time to take the next step for
> >>> deprecation of kmem limits:
> >>> https://lore.kernel.org/all/20201118175726.2453120-1-shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org/
> >>
> >> Are you going to push it to stable kernels too?
> > 
> > Not really. Is there a reason I should? More exposure to catch breakage?
> 
> There is a problem: kmem limit can trigger fake global OOM.
> To fix it in upstream you can remove kmem controller.
> 
> However how to handle this problem in stable and LTS kernels?

I do not see any bug reports coming in and I strongly suspect this is
because the functionality is simply not used wildly enough or in the
mode where it would matter (U != 0, K < U from our documentation).
If there are relevant usecases for this setup then we really want to
hear about those because we do not want to break userspace. Handling
that setup would far from trivial and the global oom killer is not the
only one of those.

> My current patch resolves the problem too, and it can be backported.
> However I afraid nobody will do it if teh patch will not be approved in upsteam.

I do not think your current approach is the right one. We do not really
want yet another flag to tell we are in a memcg OOM. We already have
one.

A proper way is to deal with the pagefault oom killer trigger but that
might be just too risky for stable kernels. 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19  8:49           ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-19  8:49 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups, linux-mm, linux-kernel, kernel

On Tue 19-10-21 09:30:18, Vasily Averin wrote:
[...]
> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> a) due to fatal signal
> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
> 
> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.

How is b) possible without current being killed? Do we allow remote
charging?

> However all these cases can be successfully handled by my new patch
> "memcg: prevent false global OOM triggered by memcg limited task"
> and I think it is better solution.

I have already replied to your approach in other email. Sorry our
replies have crossed.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19  8:49           ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-19  8:49 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Tue 19-10-21 09:30:18, Vasily Averin wrote:
[...]
> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> a) due to fatal signal
> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
> 
> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.

How is b) possible without current being killed? Do we allow remote
charging?

> However all these cases can be successfully handled by my new patch
> "memcg: prevent false global OOM triggered by memcg limited task"
> and I think it is better solution.

I have already replied to your approach in other email. Sorry our
replies have crossed.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19 10:30             ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-19 10:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups, linux-mm, linux-kernel, kernel

On 19.10.2021 11:49, Michal Hocko wrote:
> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> [...]
>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
>> a) due to fatal signal
>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
>>
>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.

> How is b) possible without current being killed? Do we allow remote
> charging?

out_of_memory for memcg_oom
 select_bad_process
  mem_cgroup_scan_tasks
   oom_evaluate_task
    oom_badness

        /*
         * Do not even consider tasks which are explicitly marked oom
         * unkillable or have been already oom reaped or the are in
         * the middle of vfork
         */
        adj = (long)p->signal->oom_score_adj;
        if (adj == OOM_SCORE_ADJ_MIN ||
                        test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
                        in_vfork(p)) {
                task_unlock(p);
                return LONG_MIN;
        }

This time we handle userspace page fault, so we cannot be kenrel thread,
and cannot be in_vfork().
However task can be marked as oom unkillable, 
i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN

It can be set via oom_score_adj_write().

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19 10:30             ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-19 10:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On 19.10.2021 11:49, Michal Hocko wrote:
> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> [...]
>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
>> a) due to fatal signal
>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
>>
>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.

> How is b) possible without current being killed? Do we allow remote
> charging?

out_of_memory for memcg_oom
 select_bad_process
  mem_cgroup_scan_tasks
   oom_evaluate_task
    oom_badness

        /*
         * Do not even consider tasks which are explicitly marked oom
         * unkillable or have been already oom reaped or the are in
         * the middle of vfork
         */
        adj = (long)p->signal->oom_score_adj;
        if (adj == OOM_SCORE_ADJ_MIN ||
                        test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
                        in_vfork(p)) {
                task_unlock(p);
                return LONG_MIN;
        }

This time we handle userspace page fault, so we cannot be kenrel thread,
and cannot be in_vfork().
However task can be marked as oom unkillable, 
i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN

It can be set via oom_score_adj_write().

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19 11:54               ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-19 11:54 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups, linux-mm, linux-kernel, kernel

On Tue 19-10-21 13:30:06, Vasily Averin wrote:
> On 19.10.2021 11:49, Michal Hocko wrote:
> > On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> > [...]
> >> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> >> a) due to fatal signal
> >> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
> >>
> >> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> >> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
> 
> > How is b) possible without current being killed? Do we allow remote
> > charging?
> 
> out_of_memory for memcg_oom
>  select_bad_process
>   mem_cgroup_scan_tasks
>    oom_evaluate_task
>     oom_badness
> 
>         /*
>          * Do not even consider tasks which are explicitly marked oom
>          * unkillable or have been already oom reaped or the are in
>          * the middle of vfork
>          */
>         adj = (long)p->signal->oom_score_adj;
>         if (adj == OOM_SCORE_ADJ_MIN ||
>                         test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
>                         in_vfork(p)) {
>                 task_unlock(p);
>                 return LONG_MIN;
>         }
> 
> This time we handle userspace page fault, so we cannot be kenrel thread,
> and cannot be in_vfork().
> However task can be marked as oom unkillable, 
> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN

You are right. I am not sure there is a way out of this though. The task
can only retry for ever in this case. There is nothing actionable here.
We cannot kill the task and there is no other way to release the memory.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19 11:54               ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-19 11:54 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Tue 19-10-21 13:30:06, Vasily Averin wrote:
> On 19.10.2021 11:49, Michal Hocko wrote:
> > On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> > [...]
> >> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> >> a) due to fatal signal
> >> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
> >>
> >> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> >> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
> 
> > How is b) possible without current being killed? Do we allow remote
> > charging?
> 
> out_of_memory for memcg_oom
>  select_bad_process
>   mem_cgroup_scan_tasks
>    oom_evaluate_task
>     oom_badness
> 
>         /*
>          * Do not even consider tasks which are explicitly marked oom
>          * unkillable or have been already oom reaped or the are in
>          * the middle of vfork
>          */
>         adj = (long)p->signal->oom_score_adj;
>         if (adj == OOM_SCORE_ADJ_MIN ||
>                         test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
>                         in_vfork(p)) {
>                 task_unlock(p);
>                 return LONG_MIN;
>         }
> 
> This time we handle userspace page fault, so we cannot be kenrel thread,
> and cannot be in_vfork().
> However task can be marked as oom unkillable, 
> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN

You are right. I am not sure there is a way out of this though. The task
can only retry for ever in this case. There is nothing actionable here.
We cannot kill the task and there is no other way to release the memory.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19 12:04                 ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-19 12:04 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups, linux-mm, linux-kernel, kernel

On Tue 19-10-21 13:54:42, Michal Hocko wrote:
> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
> > On 19.10.2021 11:49, Michal Hocko wrote:
> > > On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> > > [...]
> > >> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> > >> a) due to fatal signal
> > >> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
> > >>
> > >> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> > >> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
> > 
> > > How is b) possible without current being killed? Do we allow remote
> > > charging?
> > 
> > out_of_memory for memcg_oom
> >  select_bad_process
> >   mem_cgroup_scan_tasks
> >    oom_evaluate_task
> >     oom_badness
> > 
> >         /*
> >          * Do not even consider tasks which are explicitly marked oom
> >          * unkillable or have been already oom reaped or the are in
> >          * the middle of vfork
> >          */
> >         adj = (long)p->signal->oom_score_adj;
> >         if (adj == OOM_SCORE_ADJ_MIN ||
> >                         test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
> >                         in_vfork(p)) {
> >                 task_unlock(p);
> >                 return LONG_MIN;
> >         }
> > 
> > This time we handle userspace page fault, so we cannot be kenrel thread,
> > and cannot be in_vfork().
> > However task can be marked as oom unkillable, 
> > i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> 
> You are right. I am not sure there is a way out of this though. The task
> can only retry for ever in this case. There is nothing actionable here.
> We cannot kill the task and there is no other way to release the memory.

Btw. don't we force the charge in that case?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19 12:04                 ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-19 12:04 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Tue 19-10-21 13:54:42, Michal Hocko wrote:
> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
> > On 19.10.2021 11:49, Michal Hocko wrote:
> > > On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> > > [...]
> > >> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> > >> a) due to fatal signal
> > >> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
> > >>
> > >> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> > >> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
> > 
> > > How is b) possible without current being killed? Do we allow remote
> > > charging?
> > 
> > out_of_memory for memcg_oom
> >  select_bad_process
> >   mem_cgroup_scan_tasks
> >    oom_evaluate_task
> >     oom_badness
> > 
> >         /*
> >          * Do not even consider tasks which are explicitly marked oom
> >          * unkillable or have been already oom reaped or the are in
> >          * the middle of vfork
> >          */
> >         adj = (long)p->signal->oom_score_adj;
> >         if (adj == OOM_SCORE_ADJ_MIN ||
> >                         test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
> >                         in_vfork(p)) {
> >                 task_unlock(p);
> >                 return LONG_MIN;
> >         }
> > 
> > This time we handle userspace page fault, so we cannot be kenrel thread,
> > and cannot be in_vfork().
> > However task can be marked as oom unkillable, 
> > i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> 
> You are right. I am not sure there is a way out of this though. The task
> can only retry for ever in this case. There is nothing actionable here.
> We cannot kill the task and there is no other way to release the memory.

Btw. don't we force the charge in that case?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19 13:26                   ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-19 13:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups, linux-mm, linux-kernel, kernel

On 19.10.2021 15:04, Michal Hocko wrote:
> On Tue 19-10-21 13:54:42, Michal Hocko wrote:
>> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
>>> On 19.10.2021 11:49, Michal Hocko wrote:
>>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
>>>> [...]
>>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
>>>>> a) due to fatal signal
>>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
>>>>>
>>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
>>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
>>>
>>>> How is b) possible without current being killed? Do we allow remote
>>>> charging?
>>>
>>> out_of_memory for memcg_oom
>>>  select_bad_process
>>>   mem_cgroup_scan_tasks
>>>    oom_evaluate_task
>>>     oom_badness
>>>
>>>         /*
>>>          * Do not even consider tasks which are explicitly marked oom
>>>          * unkillable or have been already oom reaped or the are in
>>>          * the middle of vfork
>>>          */
>>>         adj = (long)p->signal->oom_score_adj;
>>>         if (adj == OOM_SCORE_ADJ_MIN ||
>>>                         test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
>>>                         in_vfork(p)) {
>>>                 task_unlock(p);
>>>                 return LONG_MIN;
>>>         }
>>>
>>> This time we handle userspace page fault, so we cannot be kenrel thread,
>>> and cannot be in_vfork().
>>> However task can be marked as oom unkillable, 
>>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
>>
>> You are right. I am not sure there is a way out of this though. The task
>> can only retry for ever in this case. There is nothing actionable here.
>> We cannot kill the task and there is no other way to release the memory.
> 
> Btw. don't we force the charge in that case?

We should force charge for allocation from inside page fault handler,
to prevent endless cycle in retried page faults.
However we should not do it for allocations from task context,
to prevent memcg-limited vmalloc-eaters from to consume all host memory.

Also I would like to return to the following hunk.
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = should_force_charge() || out_of_memory(&oc);
+	ret = task_is_dying() || out_of_memory(&oc);
 
 unlock:
 	mutex_unlock(&oom_lock);

Now I think it's better to keep task_is_dying() check here.
if task is dying, it is not necessary to push other task to free the memory.
We broke vmalloc cycle already, so it looks like nothing should prevent us
from returning to userspace, handle fatal signal, exit and free the memory.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19 13:26                   ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-19 13:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On 19.10.2021 15:04, Michal Hocko wrote:
> On Tue 19-10-21 13:54:42, Michal Hocko wrote:
>> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
>>> On 19.10.2021 11:49, Michal Hocko wrote:
>>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
>>>> [...]
>>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
>>>>> a) due to fatal signal
>>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
>>>>>
>>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
>>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
>>>
>>>> How is b) possible without current being killed? Do we allow remote
>>>> charging?
>>>
>>> out_of_memory for memcg_oom
>>>  select_bad_process
>>>   mem_cgroup_scan_tasks
>>>    oom_evaluate_task
>>>     oom_badness
>>>
>>>         /*
>>>          * Do not even consider tasks which are explicitly marked oom
>>>          * unkillable or have been already oom reaped or the are in
>>>          * the middle of vfork
>>>          */
>>>         adj = (long)p->signal->oom_score_adj;
>>>         if (adj == OOM_SCORE_ADJ_MIN ||
>>>                         test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
>>>                         in_vfork(p)) {
>>>                 task_unlock(p);
>>>                 return LONG_MIN;
>>>         }
>>>
>>> This time we handle userspace page fault, so we cannot be kenrel thread,
>>> and cannot be in_vfork().
>>> However task can be marked as oom unkillable, 
>>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
>>
>> You are right. I am not sure there is a way out of this though. The task
>> can only retry for ever in this case. There is nothing actionable here.
>> We cannot kill the task and there is no other way to release the memory.
> 
> Btw. don't we force the charge in that case?

We should force charge for allocation from inside page fault handler,
to prevent endless cycle in retried page faults.
However we should not do it for allocations from task context,
to prevent memcg-limited vmalloc-eaters from to consume all host memory.

Also I would like to return to the following hunk.
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = should_force_charge() || out_of_memory(&oc);
+	ret = task_is_dying() || out_of_memory(&oc);
 
 unlock:
 	mutex_unlock(&oom_lock);

Now I think it's better to keep task_is_dying() check here.
if task is dying, it is not necessary to push other task to free the memory.
We broke vmalloc cycle already, so it looks like nothing should prevent us
from returning to userspace, handle fatal signal, exit and free the memory.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19 14:13                     ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-19 14:13 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups, linux-mm, linux-kernel, kernel

On Tue 19-10-21 16:26:50, Vasily Averin wrote:
> On 19.10.2021 15:04, Michal Hocko wrote:
> > On Tue 19-10-21 13:54:42, Michal Hocko wrote:
> >> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
> >>> On 19.10.2021 11:49, Michal Hocko wrote:
> >>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> >>>> [...]
> >>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> >>>>> a) due to fatal signal
> >>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
> >>>>>
> >>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> >>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
> >>>
> >>>> How is b) possible without current being killed? Do we allow remote
> >>>> charging?
> >>>
> >>> out_of_memory for memcg_oom
> >>>  select_bad_process
> >>>   mem_cgroup_scan_tasks
> >>>    oom_evaluate_task
> >>>     oom_badness
> >>>
> >>>         /*
> >>>          * Do not even consider tasks which are explicitly marked oom
> >>>          * unkillable or have been already oom reaped or the are in
> >>>          * the middle of vfork
> >>>          */
> >>>         adj = (long)p->signal->oom_score_adj;
> >>>         if (adj == OOM_SCORE_ADJ_MIN ||
> >>>                         test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
> >>>                         in_vfork(p)) {
> >>>                 task_unlock(p);
> >>>                 return LONG_MIN;
> >>>         }
> >>>
> >>> This time we handle userspace page fault, so we cannot be kenrel thread,
> >>> and cannot be in_vfork().
> >>> However task can be marked as oom unkillable, 
> >>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> >>
> >> You are right. I am not sure there is a way out of this though. The task
> >> can only retry for ever in this case. There is nothing actionable here.
> >> We cannot kill the task and there is no other way to release the memory.
> > 
> > Btw. don't we force the charge in that case?
> 
> We should force charge for allocation from inside page fault handler,
> to prevent endless cycle in retried page faults.
> However we should not do it for allocations from task context,
> to prevent memcg-limited vmalloc-eaters from to consume all host memory.

I don't see a big difference between those two. Because the #PF could
result into the very same situation depleting all the memory by
overcharging. A different behavior just leads to a confusion and
unexpected behavior. E.g. in the past we only triggered memcg OOM killer
from the #PF path and failed the charge otherwise. That is something
different but it shows problems we haven't anticipated and had user
visible problems. See 29ef680ae7c2 ("memcg, oom: move out_of_memory back
to the charge path").

> Also I would like to return to the following hunk.
> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * A few threads which were not waiting at mutex_lock_killable() can
>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>  	 */
> -	ret = should_force_charge() || out_of_memory(&oc);
> +	ret = task_is_dying() || out_of_memory(&oc);
>  
>  unlock:
>  	mutex_unlock(&oom_lock);
> 
> Now I think it's better to keep task_is_dying() check here.
> if task is dying, it is not necessary to push other task to free the memory.
> We broke vmalloc cycle already, so it looks like nothing should prevent us
> from returning to userspace, handle fatal signal, exit and free the memory.

That patch has to be discuss in its full length. There were other
details I have brought up AFAIU.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19 14:13                     ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-19 14:13 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Tue 19-10-21 16:26:50, Vasily Averin wrote:
> On 19.10.2021 15:04, Michal Hocko wrote:
> > On Tue 19-10-21 13:54:42, Michal Hocko wrote:
> >> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
> >>> On 19.10.2021 11:49, Michal Hocko wrote:
> >>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
> >>>> [...]
> >>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
> >>>>> a) due to fatal signal
> >>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
> >>>>>
> >>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
> >>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
> >>>
> >>>> How is b) possible without current being killed? Do we allow remote
> >>>> charging?
> >>>
> >>> out_of_memory for memcg_oom
> >>>  select_bad_process
> >>>   mem_cgroup_scan_tasks
> >>>    oom_evaluate_task
> >>>     oom_badness
> >>>
> >>>         /*
> >>>          * Do not even consider tasks which are explicitly marked oom
> >>>          * unkillable or have been already oom reaped or the are in
> >>>          * the middle of vfork
> >>>          */
> >>>         adj = (long)p->signal->oom_score_adj;
> >>>         if (adj == OOM_SCORE_ADJ_MIN ||
> >>>                         test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
> >>>                         in_vfork(p)) {
> >>>                 task_unlock(p);
> >>>                 return LONG_MIN;
> >>>         }
> >>>
> >>> This time we handle userspace page fault, so we cannot be kenrel thread,
> >>> and cannot be in_vfork().
> >>> However task can be marked as oom unkillable, 
> >>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> >>
> >> You are right. I am not sure there is a way out of this though. The task
> >> can only retry for ever in this case. There is nothing actionable here.
> >> We cannot kill the task and there is no other way to release the memory.
> > 
> > Btw. don't we force the charge in that case?
> 
> We should force charge for allocation from inside page fault handler,
> to prevent endless cycle in retried page faults.
> However we should not do it for allocations from task context,
> to prevent memcg-limited vmalloc-eaters from to consume all host memory.

I don't see a big difference between those two. Because the #PF could
result into the very same situation depleting all the memory by
overcharging. A different behavior just leads to a confusion and
unexpected behavior. E.g. in the past we only triggered memcg OOM killer
from the #PF path and failed the charge otherwise. That is something
different but it shows problems we haven't anticipated and had user
visible problems. See 29ef680ae7c2 ("memcg, oom: move out_of_memory back
to the charge path").

> Also I would like to return to the following hunk.
> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * A few threads which were not waiting at mutex_lock_killable() can
>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>  	 */
> -	ret = should_force_charge() || out_of_memory(&oc);
> +	ret = task_is_dying() || out_of_memory(&oc);
>  
>  unlock:
>  	mutex_unlock(&oom_lock);
> 
> Now I think it's better to keep task_is_dying() check here.
> if task is dying, it is not necessary to push other task to free the memory.
> We broke vmalloc cycle already, so it looks like nothing should prevent us
> from returning to userspace, handle fatal signal, exit and free the memory.

That patch has to be discuss in its full length. There were other
details I have brought up AFAIU.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19 14:19                       ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-19 14:19 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups, linux-mm, linux-kernel, kernel

On Tue 19-10-21 16:13:19, Michal Hocko wrote:
[...]
> That patch has to be discuss in its full length. There were other
> details I have brought up AFAIU.

And btw. thanks for walking through that maze! It is far from trivial
with side effects all over the place which far from obvious and heavily
underdocumented.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19 14:19                       ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-19 14:19 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Tue 19-10-21 16:13:19, Michal Hocko wrote:
[...]
> That patch has to be discuss in its full length. There were other
> details I have brought up AFAIU.

And btw. thanks for walking through that maze! It is far from trivial
with side effects all over the place which far from obvious and heavily
underdocumented.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19 19:09                       ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-19 19:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups, linux-mm, linux-kernel, kernel

On 19.10.2021 17:13, Michal Hocko wrote:
> On Tue 19-10-21 16:26:50, Vasily Averin wrote:
>> On 19.10.2021 15:04, Michal Hocko wrote:
>>> On Tue 19-10-21 13:54:42, Michal Hocko wrote:
>>>> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
>>>>> On 19.10.2021 11:49, Michal Hocko wrote:
>>>>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
>>>>>> [...]
>>>>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
>>>>>>> a) due to fatal signal
>>>>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
>>>>>>>
>>>>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
>>>>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
>>>>>
>>>>>> How is b) possible without current being killed? Do we allow remote
>>>>>> charging?
>>>>>
>>>>> out_of_memory for memcg_oom
>>>>>  select_bad_process
>>>>>   mem_cgroup_scan_tasks
>>>>>    oom_evaluate_task
>>>>>     oom_badness
>>>>>
>>>>>         /*
>>>>>          * Do not even consider tasks which are explicitly marked oom
>>>>>          * unkillable or have been already oom reaped or the are in
>>>>>          * the middle of vfork
>>>>>          */
>>>>>         adj = (long)p->signal->oom_score_adj;
>>>>>         if (adj == OOM_SCORE_ADJ_MIN ||
>>>>>                         test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
>>>>>                         in_vfork(p)) {
>>>>>                 task_unlock(p);
>>>>>                 return LONG_MIN;
>>>>>         }
>>>>>
>>>>> This time we handle userspace page fault, so we cannot be kenrel thread,
>>>>> and cannot be in_vfork().
>>>>> However task can be marked as oom unkillable, 
>>>>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
>>>>
>>>> You are right. I am not sure there is a way out of this though. The task
>>>> can only retry for ever in this case. There is nothing actionable here.
>>>> We cannot kill the task and there is no other way to release the memory.
>>>
>>> Btw. don't we force the charge in that case?
>>
>> We should force charge for allocation from inside page fault handler,
>> to prevent endless cycle in retried page faults.
>> However we should not do it for allocations from task context,
>> to prevent memcg-limited vmalloc-eaters from to consume all host memory.
> 
> I don't see a big difference between those two. Because the #PF could
> result into the very same situation depleting all the memory by
> overcharging. A different behavior just leads to a confusion and
> unexpected behavior. E.g. in the past we only triggered memcg OOM killer
> from the #PF path and failed the charge otherwise. That is something
> different but it shows problems we haven't anticipated and had user
> visible problems. See 29ef680ae7c2 ("memcg, oom: move out_of_memory back
> to the charge path").

In this case I think we should fail this allocation.
It's better do not allow overcharge, neither in #PF not in regular allocations.

However this failure will trigger false global OOM in pagefault_out_of_memory(),
and we need to find some way to prevent it.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-19 19:09                       ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-19 19:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On 19.10.2021 17:13, Michal Hocko wrote:
> On Tue 19-10-21 16:26:50, Vasily Averin wrote:
>> On 19.10.2021 15:04, Michal Hocko wrote:
>>> On Tue 19-10-21 13:54:42, Michal Hocko wrote:
>>>> On Tue 19-10-21 13:30:06, Vasily Averin wrote:
>>>>> On 19.10.2021 11:49, Michal Hocko wrote:
>>>>>> On Tue 19-10-21 09:30:18, Vasily Averin wrote:
>>>>>> [...]
>>>>>>> With my patch ("memcg: prohibit unconditional exceeding the limit of dying tasks") try_charge_memcg() can fail:
>>>>>>> a) due to fatal signal
>>>>>>> b) when mem_cgroup_oom -> mem_cgroup_out_of_memory -> out_of_memory() returns false (when select_bad_process() found nothing)
>>>>>>>
>>>>>>> To handle a) we can follow to your suggestion and skip excution of out_of_memory() in pagefault_out_of memory()
>>>>>>> To handle b) we can go to retry: if mem_cgroup_oom() return OOM_FAILED.
>>>>>
>>>>>> How is b) possible without current being killed? Do we allow remote
>>>>>> charging?
>>>>>
>>>>> out_of_memory for memcg_oom
>>>>>  select_bad_process
>>>>>   mem_cgroup_scan_tasks
>>>>>    oom_evaluate_task
>>>>>     oom_badness
>>>>>
>>>>>         /*
>>>>>          * Do not even consider tasks which are explicitly marked oom
>>>>>          * unkillable or have been already oom reaped or the are in
>>>>>          * the middle of vfork
>>>>>          */
>>>>>         adj = (long)p->signal->oom_score_adj;
>>>>>         if (adj == OOM_SCORE_ADJ_MIN ||
>>>>>                         test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
>>>>>                         in_vfork(p)) {
>>>>>                 task_unlock(p);
>>>>>                 return LONG_MIN;
>>>>>         }
>>>>>
>>>>> This time we handle userspace page fault, so we cannot be kenrel thread,
>>>>> and cannot be in_vfork().
>>>>> However task can be marked as oom unkillable, 
>>>>> i.e. have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
>>>>
>>>> You are right. I am not sure there is a way out of this though. The task
>>>> can only retry for ever in this case. There is nothing actionable here.
>>>> We cannot kill the task and there is no other way to release the memory.
>>>
>>> Btw. don't we force the charge in that case?
>>
>> We should force charge for allocation from inside page fault handler,
>> to prevent endless cycle in retried page faults.
>> However we should not do it for allocations from task context,
>> to prevent memcg-limited vmalloc-eaters from to consume all host memory.
> 
> I don't see a big difference between those two. Because the #PF could
> result into the very same situation depleting all the memory by
> overcharging. A different behavior just leads to a confusion and
> unexpected behavior. E.g. in the past we only triggered memcg OOM killer
> from the #PF path and failed the charge otherwise. That is something
> different but it shows problems we haven't anticipated and had user
> visible problems. See 29ef680ae7c2 ("memcg, oom: move out_of_memory back
> to the charge path").

In this case I think we should fail this allocation.
It's better do not allow overcharge, neither in #PF not in regular allocations.

However this failure will trigger false global OOM in pagefault_out_of_memory(),
and we need to find some way to prevent it.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH memcg v4] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-20  8:07                         ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-20  8:07 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
	kernel

Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It is assumed that the amount of the memory charged by those
tasks is bound and most of the memory will get released while the task
is exiting. This is resembling a heuristic for the global OOM situation
when tasks get access to memory reserves. There is no global memory
shortage at the memcg level so the memcg heuristic is more relieved.

The above assumption is overly optimistic though. E.g. vmalloc can scale
to really large requests and the heuristic would allow that. We used to
have an early break in the vmalloc allocator for killed tasks but this
has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
the current task is killed""). There are likely other similar code paths
which do not check for fatal signals in an allocation&charge loop.
Also there are some kernel objects charged to a memcg which are not
bound to a process life time.

It has been observed that it is not really hard to trigger these
bypasses and cause global OOM situation.

One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves). This is
certainly possible but it is not really clear how much of an excess is
desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.

This patch is addressing the problem by removing the heuristic
altogether. Bypass is only allowed for requests which either cannot fail
or where the failure is not desirable while excess should be still
limited (e.g. atomic requests). Implementation wise a killed or dying
task fails to charge if it has passed the OOM killer stage. That should
give all forms of reclaim chance to restore the limit before the
failure (ENOMEM) and tell the caller to back off.

In addition, this patch renames should_force_charge() helper
to task_is_dying() because now its use is not associated witch forced
charging.

If try_charge_memcg() is called from #PF, its new failres can force
pagefault_out_of_memory() to execute the global OOM. To prevent it
pagefault_out_of_memory() was updated to properly handle memcg-related
restrictions.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
v4: updated pagefault_out_of_memory() to properly handle new memcg-related 
     restrictions and not allow false global OOM 
v3: no functional changes, just improved patch description
v2: swicthed to patch version proposed by mhocko@

 mm/memcontrol.c | 52 ++++++++++++++++++++++++++++---------------------
 mm/oom_kill.c   |  3 +++
 2 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..b09d3c64f63f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -239,7 +239,7 @@ enum res_type {
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
-static inline bool should_force_charge(void)
+static inline bool task_is_dying(void)
 {
 	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
 		(current->flags & PF_EXITING);
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = should_force_charge() || out_of_memory(&oc);
+	ret = task_is_dying() || out_of_memory(&oc);
 
 unlock:
 	mutex_unlock(&oom_lock);
@@ -1810,11 +1810,21 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
 		mem_cgroup_oom_notify(memcg);
 
 	mem_cgroup_unmark_under_oom(memcg);
-	if (mem_cgroup_out_of_memory(memcg, mask, order))
+	if (mem_cgroup_out_of_memory(memcg, mask, order)) {
 		ret = OOM_SUCCESS;
-	else
+	} else {
 		ret = OOM_FAILED;
-
+		/*
+		 * In some rare cases mem_cgroup_out_of_memory() can return false.
+		 * If it was called from #PF it forces handle_mm_fault()
+		 * return VM_FAULT_OOM and executes pagefault_out_of_memory().
+		 * memcg_in_oom is set here to notify pagefault_out_of_memory()
+		 * that it was a memcg-related failure and not allow to run
+		 * global OOM.
+		 */
+		if (current->in_user_fault)
+			current->memcg_in_oom = (struct mem_cgroup *)ret;
+	}
 	if (locked)
 		mem_cgroup_oom_unlock(memcg);
 
@@ -1848,6 +1858,15 @@ bool mem_cgroup_oom_synchronize(bool handle)
 	if (!memcg)
 		return false;
 
+	/* OOM is memcg, however out_of_memory() found no victim */
+	if (memcg == (struct mem_cgroup *)OOM_FAILED) {
+		/*
+		 * Should be called from pagefault_out_of_memory() only,
+		 * where it is used to prevent false global OOM.
+		 */
+		current->memcg_in_oom = NULL;
+		return true;
+	}
 	if (!handle)
 		goto cleanup;
 
@@ -2530,6 +2549,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct page_counter *counter;
 	enum oom_status oom_status;
 	unsigned long nr_reclaimed;
+	bool passed_oom = false;
 	bool may_swap = true;
 	bool drained = false;
 	unsigned long pflags;
@@ -2564,15 +2584,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_ATOMIC)
 		goto force;
 
-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(should_force_charge()))
-		goto force;
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2630,8 +2641,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_RETRY_MAYFAIL)
 		goto nomem;
 
-	if (fatal_signal_pending(current))
-		goto force;
+	/* Avoid endless loop for tasks bypassed by the oom killer */
+	if (passed_oom && task_is_dying())
+		goto nomem;
 
 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
@@ -2640,14 +2652,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
-	switch (oom_status) {
-	case OOM_SUCCESS:
+	if (oom_status == OOM_SUCCESS) {
+		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-	case OOM_FAILED:
-		goto force;
-	default:
-		goto nomem;
 	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 831340e7ad8b..1deef8c7a71b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
+	if (fatal_signal_pending(current))
+		return;
+
 	if (!mutex_trylock(&oom_lock))
 		return;
 	out_of_memory(&oc);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH memcg v4] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-20  8:07                         ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-20  8:07 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It is assumed that the amount of the memory charged by those
tasks is bound and most of the memory will get released while the task
is exiting. This is resembling a heuristic for the global OOM situation
when tasks get access to memory reserves. There is no global memory
shortage at the memcg level so the memcg heuristic is more relieved.

The above assumption is overly optimistic though. E.g. vmalloc can scale
to really large requests and the heuristic would allow that. We used to
have an early break in the vmalloc allocator for killed tasks but this
has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
the current task is killed""). There are likely other similar code paths
which do not check for fatal signals in an allocation&charge loop.
Also there are some kernel objects charged to a memcg which are not
bound to a process life time.

It has been observed that it is not really hard to trigger these
bypasses and cause global OOM situation.

One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves). This is
certainly possible but it is not really clear how much of an excess is
desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.

This patch is addressing the problem by removing the heuristic
altogether. Bypass is only allowed for requests which either cannot fail
or where the failure is not desirable while excess should be still
limited (e.g. atomic requests). Implementation wise a killed or dying
task fails to charge if it has passed the OOM killer stage. That should
give all forms of reclaim chance to restore the limit before the
failure (ENOMEM) and tell the caller to back off.

In addition, this patch renames should_force_charge() helper
to task_is_dying() because now its use is not associated witch forced
charging.

If try_charge_memcg() is called from #PF, its new failres can force
pagefault_out_of_memory() to execute the global OOM. To prevent it
pagefault_out_of_memory() was updated to properly handle memcg-related
restrictions.

Suggested-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
v4: updated pagefault_out_of_memory() to properly handle new memcg-related 
     restrictions and not allow false global OOM 
v3: no functional changes, just improved patch description
v2: swicthed to patch version proposed by mhocko@

 mm/memcontrol.c | 52 ++++++++++++++++++++++++++++---------------------
 mm/oom_kill.c   |  3 +++
 2 files changed, 33 insertions(+), 22 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..b09d3c64f63f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -239,7 +239,7 @@ enum res_type {
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
-static inline bool should_force_charge(void)
+static inline bool task_is_dying(void)
 {
 	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
 		(current->flags & PF_EXITING);
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = should_force_charge() || out_of_memory(&oc);
+	ret = task_is_dying() || out_of_memory(&oc);
 
 unlock:
 	mutex_unlock(&oom_lock);
@@ -1810,11 +1810,21 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
 		mem_cgroup_oom_notify(memcg);
 
 	mem_cgroup_unmark_under_oom(memcg);
-	if (mem_cgroup_out_of_memory(memcg, mask, order))
+	if (mem_cgroup_out_of_memory(memcg, mask, order)) {
 		ret = OOM_SUCCESS;
-	else
+	} else {
 		ret = OOM_FAILED;
-
+		/*
+		 * In some rare cases mem_cgroup_out_of_memory() can return false.
+		 * If it was called from #PF it forces handle_mm_fault()
+		 * return VM_FAULT_OOM and executes pagefault_out_of_memory().
+		 * memcg_in_oom is set here to notify pagefault_out_of_memory()
+		 * that it was a memcg-related failure and not allow to run
+		 * global OOM.
+		 */
+		if (current->in_user_fault)
+			current->memcg_in_oom = (struct mem_cgroup *)ret;
+	}
 	if (locked)
 		mem_cgroup_oom_unlock(memcg);
 
@@ -1848,6 +1858,15 @@ bool mem_cgroup_oom_synchronize(bool handle)
 	if (!memcg)
 		return false;
 
+	/* OOM is memcg, however out_of_memory() found no victim */
+	if (memcg == (struct mem_cgroup *)OOM_FAILED) {
+		/*
+		 * Should be called from pagefault_out_of_memory() only,
+		 * where it is used to prevent false global OOM.
+		 */
+		current->memcg_in_oom = NULL;
+		return true;
+	}
 	if (!handle)
 		goto cleanup;
 
@@ -2530,6 +2549,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct page_counter *counter;
 	enum oom_status oom_status;
 	unsigned long nr_reclaimed;
+	bool passed_oom = false;
 	bool may_swap = true;
 	bool drained = false;
 	unsigned long pflags;
@@ -2564,15 +2584,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_ATOMIC)
 		goto force;
 
-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(should_force_charge()))
-		goto force;
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2630,8 +2641,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_RETRY_MAYFAIL)
 		goto nomem;
 
-	if (fatal_signal_pending(current))
-		goto force;
+	/* Avoid endless loop for tasks bypassed by the oom killer */
+	if (passed_oom && task_is_dying())
+		goto nomem;
 
 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
@@ -2640,14 +2652,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
-	switch (oom_status) {
-	case OOM_SUCCESS:
+	if (oom_status == OOM_SUCCESS) {
+		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-	case OOM_FAILED:
-		goto force;
-	default:
-		goto nomem;
 	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 831340e7ad8b..1deef8c7a71b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
+	if (fatal_signal_pending(current))
+		return;
+
 	if (!mutex_trylock(&oom_lock))
 		return;
 	out_of_memory(&oc);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v4] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-20  8:43                           ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-20  8:43 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel

On Wed 20-10-21 11:07:02, Vasily Averin wrote:
[...]
I haven't read through the changelog and only focused on the patch this
time.

[...]
> @@ -1810,11 +1810,21 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
>  		mem_cgroup_oom_notify(memcg);
>  
>  	mem_cgroup_unmark_under_oom(memcg);
> -	if (mem_cgroup_out_of_memory(memcg, mask, order))
> +	if (mem_cgroup_out_of_memory(memcg, mask, order)) {
>  		ret = OOM_SUCCESS;
> -	else
> +	} else {
>  		ret = OOM_FAILED;
> -
> +		/*
> +		 * In some rare cases mem_cgroup_out_of_memory() can return false.
> +		 * If it was called from #PF it forces handle_mm_fault()
> +		 * return VM_FAULT_OOM and executes pagefault_out_of_memory().
> +		 * memcg_in_oom is set here to notify pagefault_out_of_memory()
> +		 * that it was a memcg-related failure and not allow to run
> +		 * global OOM.
> +		 */
> +		if (current->in_user_fault)
> +			current->memcg_in_oom = (struct mem_cgroup *)ret;
> +	}
>  	if (locked)
>  		mem_cgroup_oom_unlock(memcg);
>  
> @@ -1848,6 +1858,15 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  	if (!memcg)
>  		return false;
>  
> +	/* OOM is memcg, however out_of_memory() found no victim */
> +	if (memcg == (struct mem_cgroup *)OOM_FAILED) {
> +		/*
> +		 * Should be called from pagefault_out_of_memory() only,
> +		 * where it is used to prevent false global OOM.
> +		 */
> +		current->memcg_in_oom = NULL;
> +		return true;
> +	}
>  	if (!handle)
>  		goto cleanup;

I have to say I am not a great fan of this but this belongs to a
separate patch on top of all the previous ones.

[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 831340e7ad8b..1deef8c7a71b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
>  	if (mem_cgroup_oom_synchronize(true))
>  		return;
>  
> +	if (fatal_signal_pending(current))
> +		return;
> +

This belongs to its own patch as well.

All that being said I would go with pagefault_out_of_memory as the first
patch because it is necessary to handle charge failures. Then go with a
patch to remove charge forcing when OOM killer succeeds but the retry
still fails and finally go with one that tries to handle oom failures.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v4] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-20  8:43                           ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-20  8:43 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Wed 20-10-21 11:07:02, Vasily Averin wrote:
[...]
I haven't read through the changelog and only focused on the patch this
time.

[...]
> @@ -1810,11 +1810,21 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
>  		mem_cgroup_oom_notify(memcg);
>  
>  	mem_cgroup_unmark_under_oom(memcg);
> -	if (mem_cgroup_out_of_memory(memcg, mask, order))
> +	if (mem_cgroup_out_of_memory(memcg, mask, order)) {
>  		ret = OOM_SUCCESS;
> -	else
> +	} else {
>  		ret = OOM_FAILED;
> -
> +		/*
> +		 * In some rare cases mem_cgroup_out_of_memory() can return false.
> +		 * If it was called from #PF it forces handle_mm_fault()
> +		 * return VM_FAULT_OOM and executes pagefault_out_of_memory().
> +		 * memcg_in_oom is set here to notify pagefault_out_of_memory()
> +		 * that it was a memcg-related failure and not allow to run
> +		 * global OOM.
> +		 */
> +		if (current->in_user_fault)
> +			current->memcg_in_oom = (struct mem_cgroup *)ret;
> +	}
>  	if (locked)
>  		mem_cgroup_oom_unlock(memcg);
>  
> @@ -1848,6 +1858,15 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  	if (!memcg)
>  		return false;
>  
> +	/* OOM is memcg, however out_of_memory() found no victim */
> +	if (memcg == (struct mem_cgroup *)OOM_FAILED) {
> +		/*
> +		 * Should be called from pagefault_out_of_memory() only,
> +		 * where it is used to prevent false global OOM.
> +		 */
> +		current->memcg_in_oom = NULL;
> +		return true;
> +	}
>  	if (!handle)
>  		goto cleanup;

I have to say I am not a great fan of this but this belongs to a
separate patch on top of all the previous ones.

[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 831340e7ad8b..1deef8c7a71b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
>  	if (mem_cgroup_oom_synchronize(true))
>  		return;
>  
> +	if (fatal_signal_pending(current))
> +		return;
> +

This belongs to its own patch as well.

All that being said I would go with pagefault_out_of_memory as the first
patch because it is necessary to handle charge failures. Then go with a
patch to remove charge forcing when OOM killer succeeds but the retry
still fails and finally go with one that tries to handle oom failures.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH memcg RFC 0/3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-20 12:11                             ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-20 12:11 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
	kernel

Dear Michal,
as you requested, I splited v4 patch version into 3 separate parts.
Let's discuss each of them.

Open questions/ToDo:

- Update patch descriptions (and add some comments) 

- in part 2 aka "memcg: remove charge forcinig for dying tasks":
  should we keep task_is_dying() in mem_cgroup_out_of_memory() ?

- in part 3 aka "memcg: handle memcg oom failures"
   what is the best way to notify pagefault_out_of_memory() about 
    mem_cgroup_out_of_memory failure ?
    
- what is the best way to handle memcg failure doe to kmem limit,
    it can trigger false global OOM

Vasily Averin (3):
  mm: do not firce global OOM from inside dying tasks
  memcg: remove charge forcinig for dying tasks
  memcg: handle memcg oom failures

 mm/memcontrol.c | 52 ++++++++++++++++++++++++++++---------------------
 mm/oom_kill.c   |  3 +++
 2 files changed, 33 insertions(+), 22 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH memcg RFC 0/3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-20 12:11                             ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-20 12:11 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

Dear Michal,
as you requested, I splited v4 patch version into 3 separate parts.
Let's discuss each of them.

Open questions/ToDo:

- Update patch descriptions (and add some comments) 

- in part 2 aka "memcg: remove charge forcinig for dying tasks":
  should we keep task_is_dying() in mem_cgroup_out_of_memory() ?

- in part 3 aka "memcg: handle memcg oom failures"
   what is the best way to notify pagefault_out_of_memory() about 
    mem_cgroup_out_of_memory failure ?
    
- what is the best way to handle memcg failure doe to kmem limit,
    it can trigger false global OOM

Vasily Averin (3):
  mm: do not firce global OOM from inside dying tasks
  memcg: remove charge forcinig for dying tasks
  memcg: handle memcg oom failures

 mm/memcontrol.c | 52 ++++++++++++++++++++++++++++---------------------
 mm/oom_kill.c   |  3 +++
 2 files changed, 33 insertions(+), 22 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH memcg 1/3] mm: do not firce global OOM from inside dying tasks
@ 2021-10-20 12:12                               ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-20 12:12 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
	kernel

There is no sense to force global OOM if current task is dying.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/oom_kill.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 831340e7ad8b..1deef8c7a71b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
+	if (fatal_signal_pending(current))
+		return;
+
 	if (!mutex_trylock(&oom_lock))
 		return;
 	out_of_memory(&oc);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH memcg 1/3] mm: do not firce global OOM from inside dying tasks
@ 2021-10-20 12:12                               ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-20 12:12 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

There is no sense to force global OOM if current task is dying.

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 mm/oom_kill.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 831340e7ad8b..1deef8c7a71b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
+	if (fatal_signal_pending(current))
+		return;
+
 	if (!mutex_trylock(&oom_lock))
 		return;
 	out_of_memory(&oc);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks
@ 2021-10-20 12:13                               ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-20 12:13 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
	kernel

ToDo: should we keep task_is_dying() in mem_cgroup_out_of_memory() ?

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/memcontrol.c | 20 +++++++-------------
 1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..74a7379dbac1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -239,7 +239,7 @@ enum res_type {
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
-static inline bool should_force_charge(void)
+static inline bool task_is_dying(void)
 {
 	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
 		(current->flags & PF_EXITING);
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = should_force_charge() || out_of_memory(&oc);
+	ret = task_is_dying() || out_of_memory(&oc);
 
 unlock:
 	mutex_unlock(&oom_lock);
@@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct page_counter *counter;
 	enum oom_status oom_status;
 	unsigned long nr_reclaimed;
+	bool passed_oom = false;
 	bool may_swap = true;
 	bool drained = false;
 	unsigned long pflags;
@@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_ATOMIC)
 		goto force;
 
-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(should_force_charge()))
-		goto force;
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_RETRY_MAYFAIL)
 		goto nomem;
 
-	if (fatal_signal_pending(current))
-		goto force;
+	/* Avoid endless loop for tasks bypassed by the oom killer */
+	if (passed_oom && task_is_dying())
+		goto nomem;
 
 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
@@ -2642,6 +2635,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
 	switch (oom_status) {
 	case OOM_SUCCESS:
+		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
 	case OOM_FAILED:
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks
@ 2021-10-20 12:13                               ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-20 12:13 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

ToDo: should we keep task_is_dying() in mem_cgroup_out_of_memory() ?

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 mm/memcontrol.c | 20 +++++++-------------
 1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..74a7379dbac1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -239,7 +239,7 @@ enum res_type {
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
-static inline bool should_force_charge(void)
+static inline bool task_is_dying(void)
 {
 	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
 		(current->flags & PF_EXITING);
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = should_force_charge() || out_of_memory(&oc);
+	ret = task_is_dying() || out_of_memory(&oc);
 
 unlock:
 	mutex_unlock(&oom_lock);
@@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct page_counter *counter;
 	enum oom_status oom_status;
 	unsigned long nr_reclaimed;
+	bool passed_oom = false;
 	bool may_swap = true;
 	bool drained = false;
 	unsigned long pflags;
@@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_ATOMIC)
 		goto force;
 
-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(should_force_charge()))
-		goto force;
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_RETRY_MAYFAIL)
 		goto nomem;
 
-	if (fatal_signal_pending(current))
-		goto force;
+	/* Avoid endless loop for tasks bypassed by the oom killer */
+	if (passed_oom && task_is_dying())
+		goto nomem;
 
 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
@@ -2642,6 +2635,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
 	switch (oom_status) {
 	case OOM_SUCCESS:
+		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
 	case OOM_FAILED:
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH memcg 3/3] memcg: handle memcg oom failures
@ 2021-10-20 12:14                               ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-20 12:14 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
	kernel

mem_cgroup_oom() can fail if current task was marked unkillable
and oom killer cannot find any victim.

Currently we force memcg charge for such allocations,
however it allow memcg-limited userspace task in to overuse assigned limits
and potentially trigger the global memory shortage.

Let's fail the memory charge in such cases.

This failure should be somehow recognised in #PF context,
so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED

ToDo: what is the best way to notify pagefault_out_of_memory() about 
    mem_cgroup_out_of_memory failure ?

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/memcontrol.c | 32 +++++++++++++++++++++++---------
 1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 74a7379dbac1..b09d3c64f63f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1810,11 +1810,21 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
 		mem_cgroup_oom_notify(memcg);
 
 	mem_cgroup_unmark_under_oom(memcg);
-	if (mem_cgroup_out_of_memory(memcg, mask, order))
+	if (mem_cgroup_out_of_memory(memcg, mask, order)) {
 		ret = OOM_SUCCESS;
-	else
+	} else {
 		ret = OOM_FAILED;
-
+		/*
+		 * In some rare cases mem_cgroup_out_of_memory() can return false.
+		 * If it was called from #PF it forces handle_mm_fault()
+		 * return VM_FAULT_OOM and executes pagefault_out_of_memory().
+		 * memcg_in_oom is set here to notify pagefault_out_of_memory()
+		 * that it was a memcg-related failure and not allow to run
+		 * global OOM.
+		 */
+		if (current->in_user_fault)
+			current->memcg_in_oom = (struct mem_cgroup *)ret;
+	}
 	if (locked)
 		mem_cgroup_oom_unlock(memcg);
 
@@ -1848,6 +1858,15 @@ bool mem_cgroup_oom_synchronize(bool handle)
 	if (!memcg)
 		return false;
 
+	/* OOM is memcg, however out_of_memory() found no victim */
+	if (memcg == (struct mem_cgroup *)OOM_FAILED) {
+		/*
+		 * Should be called from pagefault_out_of_memory() only,
+		 * where it is used to prevent false global OOM.
+		 */
+		current->memcg_in_oom = NULL;
+		return true;
+	}
 	if (!handle)
 		goto cleanup;
 
@@ -2633,15 +2652,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
-	switch (oom_status) {
-	case OOM_SUCCESS:
+	if (oom_status == OOM_SUCCESS) {
 		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-	case OOM_FAILED:
-		goto force;
-	default:
-		goto nomem;
 	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH memcg 3/3] memcg: handle memcg oom failures
@ 2021-10-20 12:14                               ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-20 12:14 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

mem_cgroup_oom() can fail if current task was marked unkillable
and oom killer cannot find any victim.

Currently we force memcg charge for such allocations,
however it allow memcg-limited userspace task in to overuse assigned limits
and potentially trigger the global memory shortage.

Let's fail the memory charge in such cases.

This failure should be somehow recognised in #PF context,
so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED

ToDo: what is the best way to notify pagefault_out_of_memory() about 
    mem_cgroup_out_of_memory failure ?

Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 mm/memcontrol.c | 32 +++++++++++++++++++++++---------
 1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 74a7379dbac1..b09d3c64f63f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1810,11 +1810,21 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
 		mem_cgroup_oom_notify(memcg);
 
 	mem_cgroup_unmark_under_oom(memcg);
-	if (mem_cgroup_out_of_memory(memcg, mask, order))
+	if (mem_cgroup_out_of_memory(memcg, mask, order)) {
 		ret = OOM_SUCCESS;
-	else
+	} else {
 		ret = OOM_FAILED;
-
+		/*
+		 * In some rare cases mem_cgroup_out_of_memory() can return false.
+		 * If it was called from #PF it forces handle_mm_fault()
+		 * return VM_FAULT_OOM and executes pagefault_out_of_memory().
+		 * memcg_in_oom is set here to notify pagefault_out_of_memory()
+		 * that it was a memcg-related failure and not allow to run
+		 * global OOM.
+		 */
+		if (current->in_user_fault)
+			current->memcg_in_oom = (struct mem_cgroup *)ret;
+	}
 	if (locked)
 		mem_cgroup_oom_unlock(memcg);
 
@@ -1848,6 +1858,15 @@ bool mem_cgroup_oom_synchronize(bool handle)
 	if (!memcg)
 		return false;
 
+	/* OOM is memcg, however out_of_memory() found no victim */
+	if (memcg == (struct mem_cgroup *)OOM_FAILED) {
+		/*
+		 * Should be called from pagefault_out_of_memory() only,
+		 * where it is used to prevent false global OOM.
+		 */
+		current->memcg_in_oom = NULL;
+		return true;
+	}
 	if (!handle)
 		goto cleanup;
 
@@ -2633,15 +2652,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
-	switch (oom_status) {
-	case OOM_SUCCESS:
+	if (oom_status == OOM_SUCCESS) {
 		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-	case OOM_FAILED:
-		goto force;
-	default:
-		goto nomem;
 	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 1/3] mm: do not firce global OOM from inside dying tasks
@ 2021-10-20 12:33                                 ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-20 12:33 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel

s@firce@force@

On Wed 20-10-21 15:12:19, Vasily Averin wrote:
> There is no sense to force global OOM if current task is dying.

This really begs for much more information. Feel free to get an
inspiration from my previous attempt to achieve something similar.
In minimum it is important to mention that the OOM killer is already
handled at the page allocator level for the global OOM and at the
charging level for the memcg one. Both have much more information
about the scope of allocation/charge request. This means that either the
OOM killer has been invoked properly and didn't lead to the allocation
success or it has been skipped because it couldn't have been invoked.
In both cases triggering it from here is pointless and even harmful.

Another argument is that it is more reasonable to let killed task die
rather than hit the oom killer and retry the allocation.

> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  mm/oom_kill.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 831340e7ad8b..1deef8c7a71b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
>  	if (mem_cgroup_oom_synchronize(true))
>  		return;
>  
> +	if (fatal_signal_pending(current))
> +		return;
> +
>  	if (!mutex_trylock(&oom_lock))
>  		return;
>  	out_of_memory(&oc);
> -- 
> 2.32.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 1/3] mm: do not firce global OOM from inside dying tasks
@ 2021-10-20 12:33                                 ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-20 12:33 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

s@firce@force@

On Wed 20-10-21 15:12:19, Vasily Averin wrote:
> There is no sense to force global OOM if current task is dying.

This really begs for much more information. Feel free to get an
inspiration from my previous attempt to achieve something similar.
In minimum it is important to mention that the OOM killer is already
handled at the page allocator level for the global OOM and at the
charging level for the memcg one. Both have much more information
about the scope of allocation/charge request. This means that either the
OOM killer has been invoked properly and didn't lead to the allocation
success or it has been skipped because it couldn't have been invoked.
In both cases triggering it from here is pointless and even harmful.

Another argument is that it is more reasonable to let killed task die
rather than hit the oom killer and retry the allocation.

> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
> ---
>  mm/oom_kill.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 831340e7ad8b..1deef8c7a71b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
>  	if (mem_cgroup_oom_synchronize(true))
>  		return;
>  
> +	if (fatal_signal_pending(current))
> +		return;
> +
>  	if (!mutex_trylock(&oom_lock))
>  		return;
>  	out_of_memory(&oc);
> -- 
> 2.32.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks
@ 2021-10-20 12:41                                 ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-20 12:41 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel

On Wed 20-10-21 15:13:46, Vasily Averin wrote:
> ToDo: should we keep task_is_dying() in mem_cgroup_out_of_memory() ?
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  mm/memcontrol.c | 20 +++++++-------------
>  1 file changed, 7 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6da5020a8656..74a7379dbac1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -239,7 +239,7 @@ enum res_type {
>  	     iter != NULL;				\
>  	     iter = mem_cgroup_iter(NULL, iter, NULL))
>  
> -static inline bool should_force_charge(void)
> +static inline bool task_is_dying(void)
>  {
>  	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
>  		(current->flags & PF_EXITING);
> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * A few threads which were not waiting at mutex_lock_killable() can
>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>  	 */
> -	ret = should_force_charge() || out_of_memory(&oc);
> +	ret = task_is_dying() || out_of_memory(&oc);

Why are you keeping the task_is_dying check here? IIRC I have already
pointed out that out_of_memory already has some means to do a bypass
when needed.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks
@ 2021-10-20 12:41                                 ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-20 12:41 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Wed 20-10-21 15:13:46, Vasily Averin wrote:
> ToDo: should we keep task_is_dying() in mem_cgroup_out_of_memory() ?
> 
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
> ---
>  mm/memcontrol.c | 20 +++++++-------------
>  1 file changed, 7 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6da5020a8656..74a7379dbac1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -239,7 +239,7 @@ enum res_type {
>  	     iter != NULL;				\
>  	     iter = mem_cgroup_iter(NULL, iter, NULL))
>  
> -static inline bool should_force_charge(void)
> +static inline bool task_is_dying(void)
>  {
>  	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
>  		(current->flags & PF_EXITING);
> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * A few threads which were not waiting at mutex_lock_killable() can
>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>  	 */
> -	ret = should_force_charge() || out_of_memory(&oc);
> +	ret = task_is_dying() || out_of_memory(&oc);

Why are you keeping the task_is_dying check here? IIRC I have already
pointed out that out_of_memory already has some means to do a bypass
when needed.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 3/3] memcg: handle memcg oom failures
  2021-10-20 12:14                               ` Vasily Averin
  (?)
@ 2021-10-20 13:02                               ` Michal Hocko
  2021-10-20 15:46                                   ` Vasily Averin
  -1 siblings, 1 reply; 131+ messages in thread
From: Michal Hocko @ 2021-10-20 13:02 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel

On Wed 20-10-21 15:14:27, Vasily Averin wrote:
> mem_cgroup_oom() can fail if current task was marked unkillable
> and oom killer cannot find any victim.
> 
> Currently we force memcg charge for such allocations,
> however it allow memcg-limited userspace task in to overuse assigned limits
> and potentially trigger the global memory shortage.

You should really go into more details whether that is a practical
problem to handle. OOM_FAILED means that the memcg oom killer couldn't
find any oom victim so it cannot help with a forward progress. There are
not that many situations when that can happen. Naming that would be
really useful.
 
> Let's fail the memory charge in such cases.
> 
> This failure should be somehow recognised in #PF context,

explain why

> so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED
> 
> ToDo: what is the best way to notify pagefault_out_of_memory() about 
>     mem_cgroup_out_of_memory failure ?

why don't you simply remove out_of_memory from pagefault_out_of_memory
and leave it only with the blocking memcg OOM handling? Wouldn't that be a
more generic solution? Your first patch already goes that way partially.

This change is more risky than the first one. If somebody returns
VM_FAULT_OOM without invoking allocator then it can loop for ever but
invoking OOM killer in such a situation is equally wrong as the oom
killer cannot really help, right?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 1/3] mm: do not firce global OOM from inside dying tasks
@ 2021-10-20 13:52                                   ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-20 13:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel

On 20.10.2021 15:33, Michal Hocko wrote:
> s@firce@force@
> 
> On Wed 20-10-21 15:12:19, Vasily Averin wrote:
>> There is no sense to force global OOM if current task is dying.
> 
> This really begs for much more information. Feel free to get an
> inspiration from my previous attempt to achieve something similar.
> In minimum it is important to mention that the OOM killer is already
> handled at the page allocator level for the global OOM and at the
> charging level for the memcg one. Both have much more information
> about the scope of allocation/charge request. This means that either the
> OOM killer has been invoked properly and didn't lead to the allocation
> success or it has been skipped because it couldn't have been invoked.
> In both cases triggering it from here is pointless and even harmful.
> 
> Another argument is that it is more reasonable to let killed task die
> rather than hit the oom killer and retry the allocation.

Thank you,
I'll update  patch description later,
this time I would like to clarify patch content. 

>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>> ---
>>  mm/oom_kill.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 831340e7ad8b..1deef8c7a71b 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
>>  	if (mem_cgroup_oom_synchronize(true))
>>  		return;
>>  
>> +	if (fatal_signal_pending(current))
>> +		return;
>> +
>>  	if (!mutex_trylock(&oom_lock))
>>  		return;
>>  	out_of_memory(&oc);
>> -- 
>> 2.32.0
> 


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 1/3] mm: do not firce global OOM from inside dying tasks
@ 2021-10-20 13:52                                   ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-20 13:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On 20.10.2021 15:33, Michal Hocko wrote:
> s@firce@force@
> 
> On Wed 20-10-21 15:12:19, Vasily Averin wrote:
>> There is no sense to force global OOM if current task is dying.
> 
> This really begs for much more information. Feel free to get an
> inspiration from my previous attempt to achieve something similar.
> In minimum it is important to mention that the OOM killer is already
> handled at the page allocator level for the global OOM and at the
> charging level for the memcg one. Both have much more information
> about the scope of allocation/charge request. This means that either the
> OOM killer has been invoked properly and didn't lead to the allocation
> success or it has been skipped because it couldn't have been invoked.
> In both cases triggering it from here is pointless and even harmful.
> 
> Another argument is that it is more reasonable to let killed task die
> rather than hit the oom killer and retry the allocation.

Thank you,
I'll update  patch description later,
this time I would like to clarify patch content. 

>> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
>> ---
>>  mm/oom_kill.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 831340e7ad8b..1deef8c7a71b 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
>>  	if (mem_cgroup_oom_synchronize(true))
>>  		return;
>>  
>> +	if (fatal_signal_pending(current))
>> +		return;
>> +
>>  	if (!mutex_trylock(&oom_lock))
>>  		return;
>>  	out_of_memory(&oc);
>> -- 
>> 2.32.0
> 


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks
@ 2021-10-20 14:21                                   ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-20 14:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel

On 20.10.2021 15:41, Michal Hocko wrote:
> On Wed 20-10-21 15:13:46, Vasily Averin wrote:
>> ToDo: should we keep task_is_dying() in mem_cgroup_out_of_memory() ?
>>
>> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
>> ---
>>  mm/memcontrol.c | 20 +++++++-------------
>>  1 file changed, 7 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 6da5020a8656..74a7379dbac1 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -239,7 +239,7 @@ enum res_type {
>>  	     iter != NULL;				\
>>  	     iter = mem_cgroup_iter(NULL, iter, NULL))
>>  
>> -static inline bool should_force_charge(void)
>> +static inline bool task_is_dying(void)
>>  {
>>  	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
>>  		(current->flags & PF_EXITING);
>> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>  	 * A few threads which were not waiting at mutex_lock_killable() can
>>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>>  	 */
>> -	ret = should_force_charge() || out_of_memory(&oc);
>> +	ret = task_is_dying() || out_of_memory(&oc);
> 
> Why are you keeping the task_is_dying check here? IIRC I have already
> pointed out that out_of_memory already has some means to do a bypass
> when needed.

It was a misunderstanding.
I've been waiting for your final decision.

I have no good arguments "pro" or strong objection "contra". 
However, I prefer to keep task_is_dying() so as not to touch other tasks unnecessarily.

I can't justify why its removal is necessary,
but on the other hand, it shouldn't break anything.

I can drop it if you think it's necessary.

Thank you,
	Vasily Averin


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks
@ 2021-10-20 14:21                                   ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-20 14:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On 20.10.2021 15:41, Michal Hocko wrote:
> On Wed 20-10-21 15:13:46, Vasily Averin wrote:
>> ToDo: should we keep task_is_dying() in mem_cgroup_out_of_memory() ?
>>
>> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
>> ---
>>  mm/memcontrol.c | 20 +++++++-------------
>>  1 file changed, 7 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 6da5020a8656..74a7379dbac1 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -239,7 +239,7 @@ enum res_type {
>>  	     iter != NULL;				\
>>  	     iter = mem_cgroup_iter(NULL, iter, NULL))
>>  
>> -static inline bool should_force_charge(void)
>> +static inline bool task_is_dying(void)
>>  {
>>  	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
>>  		(current->flags & PF_EXITING);
>> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>  	 * A few threads which were not waiting at mutex_lock_killable() can
>>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>>  	 */
>> -	ret = should_force_charge() || out_of_memory(&oc);
>> +	ret = task_is_dying() || out_of_memory(&oc);
> 
> Why are you keeping the task_is_dying check here? IIRC I have already
> pointed out that out_of_memory already has some means to do a bypass
> when needed.

It was a misunderstanding.
I've been waiting for your final decision.

I have no good arguments "pro" or strong objection "contra". 
However, I prefer to keep task_is_dying() so as not to touch other tasks unnecessarily.

I can't justify why its removal is necessary,
but on the other hand, it shouldn't break anything.

I can drop it if you think it's necessary.

Thank you,
	Vasily Averin


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks
@ 2021-10-20 14:57                                     ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-20 14:57 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel

On Wed 20-10-21 17:21:33, Vasily Averin wrote:
> On 20.10.2021 15:41, Michal Hocko wrote:
> > On Wed 20-10-21 15:13:46, Vasily Averin wrote:
> >> ToDo: should we keep task_is_dying() in mem_cgroup_out_of_memory() ?
> >>
> >> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> >> ---
> >>  mm/memcontrol.c | 20 +++++++-------------
> >>  1 file changed, 7 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 6da5020a8656..74a7379dbac1 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -239,7 +239,7 @@ enum res_type {
> >>  	     iter != NULL;				\
> >>  	     iter = mem_cgroup_iter(NULL, iter, NULL))
> >>  
> >> -static inline bool should_force_charge(void)
> >> +static inline bool task_is_dying(void)
> >>  {
> >>  	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
> >>  		(current->flags & PF_EXITING);
> >> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >>  	 * A few threads which were not waiting at mutex_lock_killable() can
> >>  	 * fail to bail out. Therefore, check again after holding oom_lock.
> >>  	 */
> >> -	ret = should_force_charge() || out_of_memory(&oc);
> >> +	ret = task_is_dying() || out_of_memory(&oc);
> > 
> > Why are you keeping the task_is_dying check here? IIRC I have already
> > pointed out that out_of_memory already has some means to do a bypass
> > when needed.
> 
> It was a misunderstanding.

Sorry if I made you confused.

> I've been waiting for your final decision.
> 
> I have no good arguments "pro" or strong objection "contra". 
> However, I prefer to keep task_is_dying() so as not to touch other tasks unnecessarily.

One argument for removing it from here is the maintainability. Now you
have a memcg specific check which is not in sync with the oom. E.g.
out_of_memory does task_will_free_mem as the very first thing. You are
also automatically excluding oom killer for cases where that might make
a sense.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks
@ 2021-10-20 14:57                                     ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-20 14:57 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Wed 20-10-21 17:21:33, Vasily Averin wrote:
> On 20.10.2021 15:41, Michal Hocko wrote:
> > On Wed 20-10-21 15:13:46, Vasily Averin wrote:
> >> ToDo: should we keep task_is_dying() in mem_cgroup_out_of_memory() ?
> >>
> >> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
> >> ---
> >>  mm/memcontrol.c | 20 +++++++-------------
> >>  1 file changed, 7 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 6da5020a8656..74a7379dbac1 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -239,7 +239,7 @@ enum res_type {
> >>  	     iter != NULL;				\
> >>  	     iter = mem_cgroup_iter(NULL, iter, NULL))
> >>  
> >> -static inline bool should_force_charge(void)
> >> +static inline bool task_is_dying(void)
> >>  {
> >>  	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
> >>  		(current->flags & PF_EXITING);
> >> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >>  	 * A few threads which were not waiting at mutex_lock_killable() can
> >>  	 * fail to bail out. Therefore, check again after holding oom_lock.
> >>  	 */
> >> -	ret = should_force_charge() || out_of_memory(&oc);
> >> +	ret = task_is_dying() || out_of_memory(&oc);
> > 
> > Why are you keeping the task_is_dying check here? IIRC I have already
> > pointed out that out_of_memory already has some means to do a bypass
> > when needed.
> 
> It was a misunderstanding.

Sorry if I made you confused.

> I've been waiting for your final decision.
> 
> I have no good arguments "pro" or strong objection "contra". 
> However, I prefer to keep task_is_dying() so as not to touch other tasks unnecessarily.

One argument for removing it from here is the maintainability. Now you
have a memcg specific check which is not in sync with the oom. E.g.
out_of_memory does task_will_free_mem as the very first thing. You are
also automatically excluding oom killer for cases where that might make
a sense.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks
@ 2021-10-20 15:20                                       ` Tetsuo Handa
  0 siblings, 0 replies; 131+ messages in thread
From: Tetsuo Handa @ 2021-10-20 15:20 UTC (permalink / raw)
  To: Michal Hocko, Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups, linux-mm, linux-kernel, kernel

On 2021/10/20 23:57, Michal Hocko wrote:
> One argument for removing it from here is the maintainability. Now you
> have a memcg specific check which is not in sync with the oom. E.g.
> out_of_memory does task_will_free_mem as the very first thing. You are
> also automatically excluding oom killer for cases where that might make
> a sense.

What makes it possible to remove this check?
This check was added because task_will_free_mem() in out_of_memory() does NOT work.
See commit 7775face207922ea ("memcg: killed threads should not invoke memcg OOM killer").

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks
@ 2021-10-20 15:20                                       ` Tetsuo Handa
  0 siblings, 0 replies; 131+ messages in thread
From: Tetsuo Handa @ 2021-10-20 15:20 UTC (permalink / raw)
  To: Michal Hocko, Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On 2021/10/20 23:57, Michal Hocko wrote:
> One argument for removing it from here is the maintainability. Now you
> have a memcg specific check which is not in sync with the oom. E.g.
> out_of_memory does task_will_free_mem as the very first thing. You are
> also automatically excluding oom killer for cases where that might make
> a sense.

What makes it possible to remove this check?
This check was added because task_will_free_mem() in out_of_memory() does NOT work.
See commit 7775face207922ea ("memcg: killed threads should not invoke memcg OOM killer").

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 3/3] memcg: handle memcg oom failures
@ 2021-10-20 15:46                                   ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-20 15:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel

On 20.10.2021 16:02, Michal Hocko wrote:
> On Wed 20-10-21 15:14:27, Vasily Averin wrote:
>> mem_cgroup_oom() can fail if current task was marked unkillable
>> and oom killer cannot find any victim.
>>
>> Currently we force memcg charge for such allocations,
>> however it allow memcg-limited userspace task in to overuse assigned limits
>> and potentially trigger the global memory shortage.
> 
> You should really go into more details whether that is a practical
> problem to handle. OOM_FAILED means that the memcg oom killer couldn't
> find any oom victim so it cannot help with a forward progress. There are
> not that many situations when that can happen. Naming that would be
> really useful.

I've pointed above: 
"if current task was marked unkillable and oom killer cannot find any victim."
This may happen when current task cannot be oom-killed because it was marked
unkillable i.e. it have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
and other processes in memcg are either dying, or are kernel threads or are marked unkillable 
by the same way. Or when memcg have this process only.

If we always approve such kind of allocation, it can be misused.
Process can mmap a lot of memory,
ant then touch it and generate page fault and make overcharged memory allocations.
Finally it can consume all node memory and trigger global memory shortage on the host.

>> Let's fail the memory charge in such cases.
>>
>> This failure should be somehow recognised in #PF context,
> 
> explain why

When #PF cannot allocate memory (due to reason described above), handle_mm_fault returns VM_FAULT_OOM,
then its caller executes pagefault_out_of_memory(). If last one cannot recognize the real reason of this fail,
it expect it was global memory shortage and executed global out_ouf_memory() that can kill random process 
or just crash node if sysctl vm.panic_on_oom is set to 1.

Currently pagefault_out_of_memory() knows about possible async memcg OOM and handles it correctly.
However it is not aware that memcg can reject some other allocations, do not recognize the fault
as memcg-related and allows to run global OOM.

>> so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED
>>
>> ToDo: what is the best way to notify pagefault_out_of_memory() about 
>>     mem_cgroup_out_of_memory failure ?
> 
> why don't you simply remove out_of_memory from pagefault_out_of_memory
> and leave it only with the blocking memcg OOM handling? Wouldn't that be a
> more generic solution? Your first patch already goes that way partially.

I clearly understand that global out_of_memory should not be trggired by memcg restrictions.
I clearly understand that dying task will release some memory soon and we can do not run global oom if current task is dying.

However I'm not sure that I can remove out_of_memory at all. At least I do not have good arguments to do it.

> This change is more risky than the first one. If somebody returns
> VM_FAULT_OOM without invoking allocator then it can loop for ever but
> invoking OOM killer in such a situation is equally wrong as the oom
> killer cannot really help, right?

I'm not ready to comment this right now and take time out to think about.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 3/3] memcg: handle memcg oom failures
@ 2021-10-20 15:46                                   ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-20 15:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On 20.10.2021 16:02, Michal Hocko wrote:
> On Wed 20-10-21 15:14:27, Vasily Averin wrote:
>> mem_cgroup_oom() can fail if current task was marked unkillable
>> and oom killer cannot find any victim.
>>
>> Currently we force memcg charge for such allocations,
>> however it allow memcg-limited userspace task in to overuse assigned limits
>> and potentially trigger the global memory shortage.
> 
> You should really go into more details whether that is a practical
> problem to handle. OOM_FAILED means that the memcg oom killer couldn't
> find any oom victim so it cannot help with a forward progress. There are
> not that many situations when that can happen. Naming that would be
> really useful.

I've pointed above: 
"if current task was marked unkillable and oom killer cannot find any victim."
This may happen when current task cannot be oom-killed because it was marked
unkillable i.e. it have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
and other processes in memcg are either dying, or are kernel threads or are marked unkillable 
by the same way. Or when memcg have this process only.

If we always approve such kind of allocation, it can be misused.
Process can mmap a lot of memory,
ant then touch it and generate page fault and make overcharged memory allocations.
Finally it can consume all node memory and trigger global memory shortage on the host.

>> Let's fail the memory charge in such cases.
>>
>> This failure should be somehow recognised in #PF context,
> 
> explain why

When #PF cannot allocate memory (due to reason described above), handle_mm_fault returns VM_FAULT_OOM,
then its caller executes pagefault_out_of_memory(). If last one cannot recognize the real reason of this fail,
it expect it was global memory shortage and executed global out_ouf_memory() that can kill random process 
or just crash node if sysctl vm.panic_on_oom is set to 1.

Currently pagefault_out_of_memory() knows about possible async memcg OOM and handles it correctly.
However it is not aware that memcg can reject some other allocations, do not recognize the fault
as memcg-related and allows to run global OOM.

>> so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED
>>
>> ToDo: what is the best way to notify pagefault_out_of_memory() about 
>>     mem_cgroup_out_of_memory failure ?
> 
> why don't you simply remove out_of_memory from pagefault_out_of_memory
> and leave it only with the blocking memcg OOM handling? Wouldn't that be a
> more generic solution? Your first patch already goes that way partially.

I clearly understand that global out_of_memory should not be trggired by memcg restrictions.
I clearly understand that dying task will release some memory soon and we can do not run global oom if current task is dying.

However I'm not sure that I can remove out_of_memory at all. At least I do not have good arguments to do it.

> This change is more risky than the first one. If somebody returns
> VM_FAULT_OOM without invoking allocator then it can loop for ever but
> invoking OOM killer in such a situation is equally wrong as the oom
> killer cannot really help, right?

I'm not ready to comment this right now and take time out to think about.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-21  8:03     ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-21  8:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups, linux-mm, linux-kernel, kernel

On 18.10.2021 12:04, Michal Hocko wrote:
> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> [...]
>> How could this happen?
>>
>> User-space task inside the memcg-limited container generated a page fault,
>> its handler do_user_addr_fault() called handle_mm_fault which could not
>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>> out_of_memory() without set of memcg.

> I will be honest that I am not really happy about pagefault_out_of_memory.
> I have tried to remove it in the past. Without much success back then,
> unfortunately[1]. 
> 
> [1] I do not have msg-id so I cannot provide a lore link but google
> pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html

I re-read this discussion and in general I support your position.
As far as I understand your opponents cannot explain why "random kill" is mandatory here,
they are just afraid that it might be useful here and do not want to remove it completely.

Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior.

However I would like to have some choice in this point.

In general we can:
- continue to use "random kill" and rely on the wisdom of the ancestors.
- do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again".
- add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory.
- mark the current task as cycled in #PF and somehow use this mark in allocator
- make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle.
- implement any better ideas,
- use any combination of previous points

We can select required strategy for example via sysctl.

For me "random kill" is worst choice, 
Why can't we just kill the looped process instead?
It can be marked as oom-unkillable, so OOM-killer was unable to select it.
However I doubt it means "never kill it", for me it is something like "last possible victim" priority.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-21  8:03     ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-21  8:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On 18.10.2021 12:04, Michal Hocko wrote:
> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> [...]
>> How could this happen?
>>
>> User-space task inside the memcg-limited container generated a page fault,
>> its handler do_user_addr_fault() called handle_mm_fault which could not
>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>> out_of_memory() without set of memcg.

> I will be honest that I am not really happy about pagefault_out_of_memory.
> I have tried to remove it in the past. Without much success back then,
> unfortunately[1]. 
> 
> [1] I do not have msg-id so I cannot provide a lore link but google
> pointed me to https://www.mail-archive.com/linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg1400402.html

I re-read this discussion and in general I support your position.
As far as I understand your opponents cannot explain why "random kill" is mandatory here,
they are just afraid that it might be useful here and do not want to remove it completely.

Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior.

However I would like to have some choice in this point.

In general we can:
- continue to use "random kill" and rely on the wisdom of the ancestors.
- do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again".
- add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory.
- mark the current task as cycled in #PF and somehow use this mark in allocator
- make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle.
- implement any better ideas,
- use any combination of previous points

We can select required strategy for example via sysctl.

For me "random kill" is worst choice, 
Why can't we just kill the looped process instead?
It can be marked as oom-unkillable, so OOM-killer was unable to select it.
However I doubt it means "never kill it", for me it is something like "last possible victim" priority.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks
@ 2021-10-21 10:03                                         ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-21 10:03 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, cgroups, linux-mm, linux-kernel, kernel

On Thu 21-10-21 00:20:02, Tetsuo Handa wrote:
> On 2021/10/20 23:57, Michal Hocko wrote:
> > One argument for removing it from here is the maintainability. Now you
> > have a memcg specific check which is not in sync with the oom. E.g.
> > out_of_memory does task_will_free_mem as the very first thing. You are
> > also automatically excluding oom killer for cases where that might make
> > a sense.
> 
> What makes it possible to remove this check?
> This check was added because task_will_free_mem() in out_of_memory() does NOT work.
> See commit 7775face207922ea ("memcg: killed threads should not invoke memcg OOM killer").

You are right. I've forgot about this and should have checked git blame.
Thanks for bringing this up!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 2/3] memcg: remove charge forcinig for dying tasks
@ 2021-10-21 10:03                                         ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-21 10:03 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Andrew Morton,
	Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Thu 21-10-21 00:20:02, Tetsuo Handa wrote:
> On 2021/10/20 23:57, Michal Hocko wrote:
> > One argument for removing it from here is the maintainability. Now you
> > have a memcg specific check which is not in sync with the oom. E.g.
> > out_of_memory does task_will_free_mem as the very first thing. You are
> > also automatically excluding oom killer for cases where that might make
> > a sense.
> 
> What makes it possible to remove this check?
> This check was added because task_will_free_mem() in out_of_memory() does NOT work.
> See commit 7775face207922ea ("memcg: killed threads should not invoke memcg OOM killer").

You are right. I've forgot about this and should have checked git blame.
Thanks for bringing this up!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-21 11:49       ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-21 11:49 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups, linux-mm, linux-kernel, kernel

On Thu 21-10-21 11:03:43, Vasily Averin wrote:
> On 18.10.2021 12:04, Michal Hocko wrote:
> > On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> > [...]
> >> How could this happen?
> >>
> >> User-space task inside the memcg-limited container generated a page fault,
> >> its handler do_user_addr_fault() called handle_mm_fault which could not
> >> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
> >> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
> >> out_of_memory() without set of memcg.
> 
> > I will be honest that I am not really happy about pagefault_out_of_memory.
> > I have tried to remove it in the past. Without much success back then,
> > unfortunately[1]. 
> > 
> > [1] I do not have msg-id so I cannot provide a lore link but google
> > pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html
> 
> I re-read this discussion and in general I support your position.
> As far as I understand your opponents cannot explain why "random kill" is mandatory here,
> they are just afraid that it might be useful here and do not want to remove it completely.

That aligns with my recollection.

> Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior.
> 
> However I would like to have some choice in this point.
> 
> In general we can:
> - continue to use "random kill" and rely on the wisdom of the ancestors.

I do not follow. Does that mean to preserve existing oom killer from
#PF?

> - do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again".
> - add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory.

Again, not really sure what you mean

> - mark the current task as cycled in #PF and somehow use this mark in allocator

How?

> - make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle.

No! We cannot really kill the task if we could we would have done it by
the oom killer already

> - implement any better ideas,
> - use any combination of previous points
> 
> We can select required strategy for example via sysctl.

Absolutely no! How can admin know any better than the kernel?

> For me "random kill" is worst choice, 
> Why can't we just kill the looped process instead?

See above.

> It can be marked as oom-unkillable, so OOM-killer was unable to select it.
> However I doubt it means "never kill it", for me it is something like "last possible victim" priority.

It means never kill it because of OOM. If it is retrying because of OOM
then it is effectively the same thing.

The oom killer from the #PF doesn't really provide any clear advantage
these days AFAIK. On the other hand it allows for a very disruptive
behavior. In a worst case it can lead to a system panic if the
VM_FAULT_OOM is not really caused by a memory shortage but rather a
wrong error handling. If a task is looping there without any progress
then it is still kilallable which is a much saner behavior IMHO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-21 11:49       ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-21 11:49 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Thu 21-10-21 11:03:43, Vasily Averin wrote:
> On 18.10.2021 12:04, Michal Hocko wrote:
> > On Mon 18-10-21 11:13:52, Vasily Averin wrote:
> > [...]
> >> How could this happen?
> >>
> >> User-space task inside the memcg-limited container generated a page fault,
> >> its handler do_user_addr_fault() called handle_mm_fault which could not
> >> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
> >> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
> >> out_of_memory() without set of memcg.
> 
> > I will be honest that I am not really happy about pagefault_out_of_memory.
> > I have tried to remove it in the past. Without much success back then,
> > unfortunately[1]. 
> > 
> > [1] I do not have msg-id so I cannot provide a lore link but google
> > pointed me to https://www.mail-archive.com/linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg1400402.html
> 
> I re-read this discussion and in general I support your position.
> As far as I understand your opponents cannot explain why "random kill" is mandatory here,
> they are just afraid that it might be useful here and do not want to remove it completely.

That aligns with my recollection.

> Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior.
> 
> However I would like to have some choice in this point.
> 
> In general we can:
> - continue to use "random kill" and rely on the wisdom of the ancestors.

I do not follow. Does that mean to preserve existing oom killer from
#PF?

> - do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again".
> - add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory.

Again, not really sure what you mean

> - mark the current task as cycled in #PF and somehow use this mark in allocator

How?

> - make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle.

No! We cannot really kill the task if we could we would have done it by
the oom killer already

> - implement any better ideas,
> - use any combination of previous points
> 
> We can select required strategy for example via sysctl.

Absolutely no! How can admin know any better than the kernel?

> For me "random kill" is worst choice, 
> Why can't we just kill the looped process instead?

See above.

> It can be marked as oom-unkillable, so OOM-killer was unable to select it.
> However I doubt it means "never kill it", for me it is something like "last possible victim" priority.

It means never kill it because of OOM. If it is retrying because of OOM
then it is effectively the same thing.

The oom killer from the #PF doesn't really provide any clear advantage
these days AFAIK. On the other hand it allows for a very disruptive
behavior. In a worst case it can lead to a system panic if the
VM_FAULT_OOM is not really caused by a memory shortage but rather a
wrong error handling. If a task is looping there without any progress
then it is still kilallable which is a much saner behavior IMHO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 3/3] memcg: handle memcg oom failures
@ 2021-10-21 11:49                                     ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-21 11:49 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel

On Wed 20-10-21 18:46:56, Vasily Averin wrote:
> On 20.10.2021 16:02, Michal Hocko wrote:
> > On Wed 20-10-21 15:14:27, Vasily Averin wrote:
> >> mem_cgroup_oom() can fail if current task was marked unkillable
> >> and oom killer cannot find any victim.
> >>
> >> Currently we force memcg charge for such allocations,
> >> however it allow memcg-limited userspace task in to overuse assigned limits
> >> and potentially trigger the global memory shortage.
> > 
> > You should really go into more details whether that is a practical
> > problem to handle. OOM_FAILED means that the memcg oom killer couldn't
> > find any oom victim so it cannot help with a forward progress. There are
> > not that many situations when that can happen. Naming that would be
> > really useful.
> 
> I've pointed above: 
> "if current task was marked unkillable and oom killer cannot find any victim."
> This may happen when current task cannot be oom-killed because it was marked
> unkillable i.e. it have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> and other processes in memcg are either dying, or are kernel threads or are marked unkillable 
> by the same way. Or when memcg have this process only.
> 
> If we always approve such kind of allocation, it can be misused.
> Process can mmap a lot of memory,
> ant then touch it and generate page fault and make overcharged memory allocations.
> Finally it can consume all node memory and trigger global memory shortage on the host.

Yes, this is true but a) OOM_SCORE_ADJ_MIN tasks are excluded from the
OOM handling so they have to be careful with the memory consumption and
b) is this a theoretical or a practical concern. 

This is mostly what I wanted to make sure you describe in the changelog.

> >> Let's fail the memory charge in such cases.
> >>
> >> This failure should be somehow recognised in #PF context,
> > 
> > explain why
> 
> When #PF cannot allocate memory (due to reason described above), handle_mm_fault returns VM_FAULT_OOM,
> then its caller executes pagefault_out_of_memory(). If last one cannot recognize the real reason of this fail,
> it expect it was global memory shortage and executed global out_ouf_memory() that can kill random process 
> or just crash node if sysctl vm.panic_on_oom is set to 1.
> 
> Currently pagefault_out_of_memory() knows about possible async memcg OOM and handles it correctly.
> However it is not aware that memcg can reject some other allocations, do not recognize the fault
> as memcg-related and allows to run global OOM.

Again something to be added to the changelog.

> >> so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED
> >>
> >> ToDo: what is the best way to notify pagefault_out_of_memory() about 
> >>     mem_cgroup_out_of_memory failure ?
> > 
> > why don't you simply remove out_of_memory from pagefault_out_of_memory
> > and leave it only with the blocking memcg OOM handling? Wouldn't that be a
> > more generic solution? Your first patch already goes that way partially.
> 
> I clearly understand that global out_of_memory should not be trggired by memcg restrictions.
> I clearly understand that dying task will release some memory soon and we can do not run global oom if current task is dying.
> 
> However I'm not sure that I can remove out_of_memory at all. At least I do not have good arguments to do it.

I do understand that handling a very specific case sounds easier but it
would be better to have a robust fix even if that requires some more
head scratching. So far we have collected several reasons why the it is
bad to trigger oom killer from the #PF path. There is no single argument
to keep it so it sounds like a viable path to pursue. Maybe there are
some very well hidden reasons but those should be documented and this is
a great opportunity to do either of the step.

Moreover if it turns out that there is a regression then this can be
easily reverted and a different, maybe memcg specific, solution can be
implemented.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 3/3] memcg: handle memcg oom failures
@ 2021-10-21 11:49                                     ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-21 11:49 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Wed 20-10-21 18:46:56, Vasily Averin wrote:
> On 20.10.2021 16:02, Michal Hocko wrote:
> > On Wed 20-10-21 15:14:27, Vasily Averin wrote:
> >> mem_cgroup_oom() can fail if current task was marked unkillable
> >> and oom killer cannot find any victim.
> >>
> >> Currently we force memcg charge for such allocations,
> >> however it allow memcg-limited userspace task in to overuse assigned limits
> >> and potentially trigger the global memory shortage.
> > 
> > You should really go into more details whether that is a practical
> > problem to handle. OOM_FAILED means that the memcg oom killer couldn't
> > find any oom victim so it cannot help with a forward progress. There are
> > not that many situations when that can happen. Naming that would be
> > really useful.
> 
> I've pointed above: 
> "if current task was marked unkillable and oom killer cannot find any victim."
> This may happen when current task cannot be oom-killed because it was marked
> unkillable i.e. it have p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> and other processes in memcg are either dying, or are kernel threads or are marked unkillable 
> by the same way. Or when memcg have this process only.
> 
> If we always approve such kind of allocation, it can be misused.
> Process can mmap a lot of memory,
> ant then touch it and generate page fault and make overcharged memory allocations.
> Finally it can consume all node memory and trigger global memory shortage on the host.

Yes, this is true but a) OOM_SCORE_ADJ_MIN tasks are excluded from the
OOM handling so they have to be careful with the memory consumption and
b) is this a theoretical or a practical concern. 

This is mostly what I wanted to make sure you describe in the changelog.

> >> Let's fail the memory charge in such cases.
> >>
> >> This failure should be somehow recognised in #PF context,
> > 
> > explain why
> 
> When #PF cannot allocate memory (due to reason described above), handle_mm_fault returns VM_FAULT_OOM,
> then its caller executes pagefault_out_of_memory(). If last one cannot recognize the real reason of this fail,
> it expect it was global memory shortage and executed global out_ouf_memory() that can kill random process 
> or just crash node if sysctl vm.panic_on_oom is set to 1.
> 
> Currently pagefault_out_of_memory() knows about possible async memcg OOM and handles it correctly.
> However it is not aware that memcg can reject some other allocations, do not recognize the fault
> as memcg-related and allows to run global OOM.

Again something to be added to the changelog.

> >> so let's use current->memcg_in_oom == (struct mem_cgroup *)OOM_FAILED
> >>
> >> ToDo: what is the best way to notify pagefault_out_of_memory() about 
> >>     mem_cgroup_out_of_memory failure ?
> > 
> > why don't you simply remove out_of_memory from pagefault_out_of_memory
> > and leave it only with the blocking memcg OOM handling? Wouldn't that be a
> > more generic solution? Your first patch already goes that way partially.
> 
> I clearly understand that global out_of_memory should not be trggired by memcg restrictions.
> I clearly understand that dying task will release some memory soon and we can do not run global oom if current task is dying.
> 
> However I'm not sure that I can remove out_of_memory at all. At least I do not have good arguments to do it.

I do understand that handling a very specific case sounds easier but it
would be better to have a robust fix even if that requires some more
head scratching. So far we have collected several reasons why the it is
bad to trigger oom killer from the #PF path. There is no single argument
to keep it so it sounds like a viable path to pursue. Maybe there are
some very well hidden reasons but those should be documented and this is
a great opportunity to do either of the step.

Moreover if it turns out that there is a regression then this can be
easily reverted and a different, maybe memcg specific, solution can be
implemented.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
  2021-10-21 11:49       ` Michal Hocko
@ 2021-10-21 13:24         ` Vasily Averin
  -1 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-21 13:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups, linux-mm, linux-kernel, kernel

On 21.10.2021 14:49, Michal Hocko wrote:
> On Thu 21-10-21 11:03:43, Vasily Averin wrote:
>> On 18.10.2021 12:04, Michal Hocko wrote:
>>> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
>>> [...]
>>>> How could this happen?
>>>>
>>>> User-space task inside the memcg-limited container generated a page fault,
>>>> its handler do_user_addr_fault() called handle_mm_fault which could not
>>>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>>>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>>>> out_of_memory() without set of memcg.
>>
>>> I will be honest that I am not really happy about pagefault_out_of_memory.
>>> I have tried to remove it in the past. Without much success back then,
>>> unfortunately[1]. 
>>>
>>> [1] I do not have msg-id so I cannot provide a lore link but google
>>> pointed me to https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1400402.html
>>
>> I re-read this discussion and in general I support your position.
>> As far as I understand your opponents cannot explain why "random kill" is mandatory here,
>> they are just afraid that it might be useful here and do not want to remove it completely.
> 
> That aligns with my recollection.
> 
>> Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior.
>>
>> However I would like to have some choice in this point.
>>
>> In general we can:
>> - continue to use "random kill" and rely on the wisdom of the ancestors.
> 
> I do not follow. Does that mean to preserve existing oom killer from
> #PF?
> 
>> - do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again".
>> - add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory.
> 
> Again, not really sure what you mean
> 
>> - mark the current task as cycled in #PF and somehow use this mark in allocator
> 
> How?
> 
>> - make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle.
> 
> No! We cannot really kill the task if we could we would have done it by
> the oom killer already
> 
>> - implement any better ideas,
>> - use any combination of previous points
>>
>> We can select required strategy for example via sysctl.
> 
> Absolutely no! How can admin know any better than the kernel?
> 
>> For me "random kill" is worst choice, 
>> Why can't we just kill the looped process instead?
> 
> See above.
> 
>> It can be marked as oom-unkillable, so OOM-killer was unable to select it.
>> However I doubt it means "never kill it", for me it is something like "last possible victim" priority.
> 
> It means never kill it because of OOM. If it is retrying because of OOM
> then it is effectively the same thing.
> 
> The oom killer from the #PF doesn't really provide any clear advantage
> these days AFAIK. On the other hand it allows for a very disruptive
> behavior. In a worst case it can lead to a system panic if the
> VM_FAULT_OOM is not really caused by a memory shortage but rather a
> wrong error handling. If a task is looping there without any progress
> then it is still kilallable which is a much saner behavior IMHO.

Let's continue this discussion in "Re: [PATCH memcg 3/3] memcg: handle memcg oom failures" thread.
.


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 0/1] false global OOM triggered by memcg-limited task
@ 2021-10-21 13:24         ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-21 13:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On 21.10.2021 14:49, Michal Hocko wrote:
> On Thu 21-10-21 11:03:43, Vasily Averin wrote:
>> On 18.10.2021 12:04, Michal Hocko wrote:
>>> On Mon 18-10-21 11:13:52, Vasily Averin wrote:
>>> [...]
>>>> How could this happen?
>>>>
>>>> User-space task inside the memcg-limited container generated a page fault,
>>>> its handler do_user_addr_fault() called handle_mm_fault which could not
>>>> allocate the page due to exceeding the memcg limit and returned VM_FAULT_OOM.
>>>> Then do_user_addr_fault() called pagefault_out_of_memory() which executed
>>>> out_of_memory() without set of memcg.
>>
>>> I will be honest that I am not really happy about pagefault_out_of_memory.
>>> I have tried to remove it in the past. Without much success back then,
>>> unfortunately[1]. 
>>>
>>> [1] I do not have msg-id so I cannot provide a lore link but google
>>> pointed me to https://www.mail-archive.com/linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org/msg1400402.html
>>
>> I re-read this discussion and in general I support your position.
>> As far as I understand your opponents cannot explain why "random kill" is mandatory here,
>> they are just afraid that it might be useful here and do not want to remove it completely.
> 
> That aligns with my recollection.
> 
>> Ok, let's allow him to do it. Moreover I'm ready to keep it as default behavior.
>>
>> However I would like to have some choice in this point.
>>
>> In general we can:
>> - continue to use "random kill" and rely on the wisdom of the ancestors.
> 
> I do not follow. Does that mean to preserve existing oom killer from
> #PF?
> 
>> - do nothing, repeat #PF and rely on fate: "nothing bad will happen if we do it again".
>> - add some (progressive) killable delay, rely on good will of (unkillable) neighbors and wait for them to release required memory.
> 
> Again, not really sure what you mean
> 
>> - mark the current task as cycled in #PF and somehow use this mark in allocator
> 
> How?
> 
>> - make sure that the current task is really cycled, have no progress, send him fatal signal to kill it and break the cycle.
> 
> No! We cannot really kill the task if we could we would have done it by
> the oom killer already
> 
>> - implement any better ideas,
>> - use any combination of previous points
>>
>> We can select required strategy for example via sysctl.
> 
> Absolutely no! How can admin know any better than the kernel?
> 
>> For me "random kill" is worst choice, 
>> Why can't we just kill the looped process instead?
> 
> See above.
> 
>> It can be marked as oom-unkillable, so OOM-killer was unable to select it.
>> However I doubt it means "never kill it", for me it is something like "last possible victim" priority.
> 
> It means never kill it because of OOM. If it is retrying because of OOM
> then it is effectively the same thing.
> 
> The oom killer from the #PF doesn't really provide any clear advantage
> these days AFAIK. On the other hand it allows for a very disruptive
> behavior. In a worst case it can lead to a system panic if the
> VM_FAULT_OOM is not really caused by a memory shortage but rather a
> wrong error handling. If a task is looping there without any progress
> then it is still kilallable which is a much saner behavior IMHO.

Let's continue this discussion in "Re: [PATCH memcg 3/3] memcg: handle memcg oom failures" thread.
.


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 3/3] memcg: handle memcg oom failures
@ 2021-10-21 15:05                                       ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-21 15:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel

On 21.10.2021 14:49, Michal Hocko wrote:
> I do understand that handling a very specific case sounds easier but it
> would be better to have a robust fix even if that requires some more
> head scratching. So far we have collected several reasons why the it is
> bad to trigger oom killer from the #PF path. There is no single argument
> to keep it so it sounds like a viable path to pursue. Maybe there are
> some very well hidden reasons but those should be documented and this is
> a great opportunity to do either of the step.
> 
> Moreover if it turns out that there is a regression then this can be
> easily reverted and a different, maybe memcg specific, solution can be
> implemented.

Now I'm agree,
however I still have a few open questions.

1) VM_FAULT_OOM may be triggered w/o execution of out_of_memory()
for exampel it can be caused by incorrect vm fault operations, 
(a) which can return this error without calling allocator at all.
(b) or which can provide incorrect gfp flags and allocator can fail without execution of out_of_memory.
(c) This may happen on stable/LTS kernels when successful allocation was failed by hit into limit of legacy memcg-kmem contoller.
We'll drop it in upstream kernels, however how to handle it in old kenrels?

We can make sure that out_of_memory or alocator was called by set of some per-task flags.

Can pagefault_out_of_memory() send itself a SIGKILL in all these cases?

If not -- task will be looped. 
It is much better than execution of global OOM, however it would be even better to avoid it somehow.

You said: "We cannot really kill the task if we could we would have done it by the oom killer already".
However what to do if we even not tried to use oom-killer? (see (b) and (c)) 
or if we did not used the allocator at all (see (a))

2) in your patch we just exit from pagefault_out_of_memory(). and restart new #PF.
We can call schedule_timeout() and wait some time before a new #PF restart.
Additionally we can increase this delay in each new cycle. 
It helps to save CPU time for other tasks.
What do you think about?

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 3/3] memcg: handle memcg oom failures
@ 2021-10-21 15:05                                       ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-21 15:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On 21.10.2021 14:49, Michal Hocko wrote:
> I do understand that handling a very specific case sounds easier but it
> would be better to have a robust fix even if that requires some more
> head scratching. So far we have collected several reasons why the it is
> bad to trigger oom killer from the #PF path. There is no single argument
> to keep it so it sounds like a viable path to pursue. Maybe there are
> some very well hidden reasons but those should be documented and this is
> a great opportunity to do either of the step.
> 
> Moreover if it turns out that there is a regression then this can be
> easily reverted and a different, maybe memcg specific, solution can be
> implemented.

Now I'm agree,
however I still have a few open questions.

1) VM_FAULT_OOM may be triggered w/o execution of out_of_memory()
for exampel it can be caused by incorrect vm fault operations, 
(a) which can return this error without calling allocator at all.
(b) or which can provide incorrect gfp flags and allocator can fail without execution of out_of_memory.
(c) This may happen on stable/LTS kernels when successful allocation was failed by hit into limit of legacy memcg-kmem contoller.
We'll drop it in upstream kernels, however how to handle it in old kenrels?

We can make sure that out_of_memory or alocator was called by set of some per-task flags.

Can pagefault_out_of_memory() send itself a SIGKILL in all these cases?

If not -- task will be looped. 
It is much better than execution of global OOM, however it would be even better to avoid it somehow.

You said: "We cannot really kill the task if we could we would have done it by the oom killer already".
However what to do if we even not tried to use oom-killer? (see (b) and (c)) 
or if we did not used the allocator at all (see (a))

2) in your patch we just exit from pagefault_out_of_memory(). and restart new #PF.
We can call schedule_timeout() and wait some time before a new #PF restart.
Additionally we can increase this delay in each new cycle. 
It helps to save CPU time for other tasks.
What do you think about?

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 3/3] memcg: handle memcg oom failures
@ 2021-10-21 16:47                                         ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-21 16:47 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel

On Thu 21-10-21 18:05:28, Vasily Averin wrote:
> On 21.10.2021 14:49, Michal Hocko wrote:
> > I do understand that handling a very specific case sounds easier but it
> > would be better to have a robust fix even if that requires some more
> > head scratching. So far we have collected several reasons why the it is
> > bad to trigger oom killer from the #PF path. There is no single argument
> > to keep it so it sounds like a viable path to pursue. Maybe there are
> > some very well hidden reasons but those should be documented and this is
> > a great opportunity to do either of the step.
> > 
> > Moreover if it turns out that there is a regression then this can be
> > easily reverted and a different, maybe memcg specific, solution can be
> > implemented.
> 
> Now I'm agree,
> however I still have a few open questions.
> 
> 1) VM_FAULT_OOM may be triggered w/o execution of out_of_memory()
> for exampel it can be caused by incorrect vm fault operations, 
> (a) which can return this error without calling allocator at all.

I would argue this to be a bug. How can that particular code tell
whether the system is OOM and the oom killer is the a reasonable measure
to take?

> (b) or which can provide incorrect gfp flags and allocator can fail without execution of out_of_memory.

I am not sure I can see any sensible scenario where pagefault oom killer
would be an appropriate fix for that.

> (c) This may happen on stable/LTS kernels when successful allocation was failed by hit into limit of legacy memcg-kmem contoller.
> We'll drop it in upstream kernels, however how to handle it in old kenrels?

Triggering the global oom killer for legacy kmem charge failure is
clearly wrong. Removing oom killer from #PF would fix that problem.

> We can make sure that out_of_memory or alocator was called by set of some per-task flags.

I am not sure I see how that would be useful other than reporting a
dubious VM_FAULT_OOM usage. I am also not sure how that would be
implemented as allocator can be called several times not to mention that
the allocation itself could have been done from a different context -
e.g. WQ.

> Can pagefault_out_of_memory() send itself a SIGKILL in all these cases?

In principle it can as sending signal is not prohibited. I would argue
it should not though because it is just wrong thing to do in all those
cases.

> If not -- task will be looped. 

Yes, but it will be killable from userspace. So this is not an
unrecoverable situation.
> It is much better than execution of global OOM, however it would be even better to avoid it somehow.

How?

> You said: "We cannot really kill the task if we could we would have done it by the oom killer already".
> However what to do if we even not tried to use oom-killer? (see (b) and (c)) 
> or if we did not used the allocator at all (see (a))

See above

> 2) in your patch we just exit from pagefault_out_of_memory(). and restart new #PF.
> We can call schedule_timeout() and wait some time before a new #PF restart.
> Additionally we can increase this delay in each new cycle. 
> It helps to save CPU time for other tasks.
> What do you think about?

I do not have a strong opinion on this. A short sleep makes sense. I am
not sure a more complex implementation is really needed.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg 3/3] memcg: handle memcg oom failures
@ 2021-10-21 16:47                                         ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-21 16:47 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Thu 21-10-21 18:05:28, Vasily Averin wrote:
> On 21.10.2021 14:49, Michal Hocko wrote:
> > I do understand that handling a very specific case sounds easier but it
> > would be better to have a robust fix even if that requires some more
> > head scratching. So far we have collected several reasons why the it is
> > bad to trigger oom killer from the #PF path. There is no single argument
> > to keep it so it sounds like a viable path to pursue. Maybe there are
> > some very well hidden reasons but those should be documented and this is
> > a great opportunity to do either of the step.
> > 
> > Moreover if it turns out that there is a regression then this can be
> > easily reverted and a different, maybe memcg specific, solution can be
> > implemented.
> 
> Now I'm agree,
> however I still have a few open questions.
> 
> 1) VM_FAULT_OOM may be triggered w/o execution of out_of_memory()
> for exampel it can be caused by incorrect vm fault operations, 
> (a) which can return this error without calling allocator at all.

I would argue this to be a bug. How can that particular code tell
whether the system is OOM and the oom killer is the a reasonable measure
to take?

> (b) or which can provide incorrect gfp flags and allocator can fail without execution of out_of_memory.

I am not sure I can see any sensible scenario where pagefault oom killer
would be an appropriate fix for that.

> (c) This may happen on stable/LTS kernels when successful allocation was failed by hit into limit of legacy memcg-kmem contoller.
> We'll drop it in upstream kernels, however how to handle it in old kenrels?

Triggering the global oom killer for legacy kmem charge failure is
clearly wrong. Removing oom killer from #PF would fix that problem.

> We can make sure that out_of_memory or alocator was called by set of some per-task flags.

I am not sure I see how that would be useful other than reporting a
dubious VM_FAULT_OOM usage. I am also not sure how that would be
implemented as allocator can be called several times not to mention that
the allocation itself could have been done from a different context -
e.g. WQ.

> Can pagefault_out_of_memory() send itself a SIGKILL in all these cases?

In principle it can as sending signal is not prohibited. I would argue
it should not though because it is just wrong thing to do in all those
cases.

> If not -- task will be looped. 

Yes, but it will be killable from userspace. So this is not an
unrecoverable situation.
> It is much better than execution of global OOM, however it would be even better to avoid it somehow.

How?

> You said: "We cannot really kill the task if we could we would have done it by the oom killer already".
> However what to do if we even not tried to use oom-killer? (see (b) and (c)) 
> or if we did not used the allocator at all (see (a))

See above

> 2) in your patch we just exit from pagefault_out_of_memory(). and restart new #PF.
> We can call schedule_timeout() and wait some time before a new #PF restart.
> Additionally we can increase this delay in each new cycle. 
> It helps to save CPU time for other tasks.
> What do you think about?

I do not have a strong opinion on this. A short sleep makes sense. I am
not sure a more complex implementation is really needed.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH memcg v2 0/2] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-22  8:10                                           ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-22  8:10 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
	kernel

Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It can be misused and allow to trigger global OOM from inside
memcg-limited container. On the other hand if memcg fail allocation,
called from inside #PF handler it trigger global OOM from inside
pagefault_out_of_memory().

To prevent these problems this patch set:
1) removes execution of out_of_memory() from pagefault_out_of_memory(),
   becasue nobody can explain why it is necessary.
2) allows memcg to fail the allocations of dying/killed tasks. 

v2: resplit,
    use old patch from Michal Hocko removing out_of_memory() from
      pagefault_out_of_memory()

Michal Hocko (1):
  mm, oom: do not trigger out_of_memory from the #PF

Vasily Averin (1):
  memcg: prohibit unconditional exceeding the limit of dying tasks

 mm/memcontrol.c | 27 ++++++++-------------------
 mm/oom_kill.c   | 23 ++++++++++-------------
 2 files changed, 18 insertions(+), 32 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH memcg v2 0/2] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-22  8:10                                           ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-22  8:10 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It can be misused and allow to trigger global OOM from inside
memcg-limited container. On the other hand if memcg fail allocation,
called from inside #PF handler it trigger global OOM from inside
pagefault_out_of_memory().

To prevent these problems this patch set:
1) removes execution of out_of_memory() from pagefault_out_of_memory(),
   becasue nobody can explain why it is necessary.
2) allows memcg to fail the allocations of dying/killed tasks. 

v2: resplit,
    use old patch from Michal Hocko removing out_of_memory() from
      pagefault_out_of_memory()

Michal Hocko (1):
  mm, oom: do not trigger out_of_memory from the #PF

Vasily Averin (1):
  memcg: prohibit unconditional exceeding the limit of dying tasks

 mm/memcontrol.c | 27 ++++++++-------------------
 mm/oom_kill.c   | 23 ++++++++++-------------
 2 files changed, 18 insertions(+), 32 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH memcg v2 1/2] mm, oom: do not trigger out_of_memory from the #PF
@ 2021-10-22  8:11                                             ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-22  8:11 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
	kernel

From: Michal Hocko <mhocko@suse.com>

Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory. This can happen for
2 different reasons. a) Memcg is out of memory and we rely on
mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
normal allocation fails.

The later is quite problematic because allocation paths already trigger
out_of_memory and the page allocator tries really hard to not fail
allocations. Anyway, if the OOM killer has been already invoked there
is no reason to invoke it again from the #PF path. Especially when the
OOM condition might be gone by that time and we have no way to find out
other than allocate.

Moreover if the allocation failed and the OOM killer hasn't been
invoked then we are unlikely to do the right thing from the #PF context
because we have already lost the allocation context and restictions and
therefore might oom kill a task from a different NUMA domain.

An allocation might fail also when the current task is the oom victim
and there are no memory reserves left and we should simply bail out
from the #PF rather than invoking out_of_memory.

This all suggests that there is no legitimate reason to trigger
out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
that no #PF path returns with VM_FAULT_OOM without allocation print a
warning that this is happening before we restart the #PF.

[VvS: #PF allocation can hit into limit of cgroup v1 kmem controlle.
This is a local problem related to memcg, however, it causes unnecessary
global OOM kills that are repeated over and over again and escalate into
a real disaster.  It was broken long time ago, most likely since
6a1a0d3b625a ("mm: allocate kernel pages to the right memcg").
In upstream the problem will be fixed by removing the outdated kmem limit,
however stable and LTS kernels cannot do it and are still affected.
This patch fixes the problem and should be backported into stable.]

Fixes: 6a1a0d3b625a ("mm: allocate kernel pages to the right memcg")
Cc: stable@vger.kernel.org

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/oom_kill.c | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 831340e7ad8b..f98954befafb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1120,27 +1120,24 @@ bool out_of_memory(struct oom_control *oc)
 }
 
 /*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
- * killing is already in progress so do nothing.
+ * The pagefault handler calls here because some allocation has failed. We have
+ * to take care of the memcg OOM here because this is the only safe context without
+ * any locks held but let the oom killer triggered from the allocation context care
+ * about the global OOM.
  */
 void pagefault_out_of_memory(void)
 {
-	struct oom_control oc = {
-		.zonelist = NULL,
-		.nodemask = NULL,
-		.memcg = NULL,
-		.gfp_mask = 0,
-		.order = 0,
-	};
+	static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
-	if (!mutex_trylock(&oom_lock))
+	if (fatal_signal_pending(current))
 		return;
-	out_of_memory(&oc);
-	mutex_unlock(&oom_lock);
+
+	if (__ratelimit(&pfoom_rs))
+		pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
 }
 
 SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH memcg v2 1/2] mm, oom: do not trigger out_of_memory from the #PF
@ 2021-10-22  8:11                                             ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-22  8:11 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory. This can happen for
2 different reasons. a) Memcg is out of memory and we rely on
mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
normal allocation fails.

The later is quite problematic because allocation paths already trigger
out_of_memory and the page allocator tries really hard to not fail
allocations. Anyway, if the OOM killer has been already invoked there
is no reason to invoke it again from the #PF path. Especially when the
OOM condition might be gone by that time and we have no way to find out
other than allocate.

Moreover if the allocation failed and the OOM killer hasn't been
invoked then we are unlikely to do the right thing from the #PF context
because we have already lost the allocation context and restictions and
therefore might oom kill a task from a different NUMA domain.

An allocation might fail also when the current task is the oom victim
and there are no memory reserves left and we should simply bail out
from the #PF rather than invoking out_of_memory.

This all suggests that there is no legitimate reason to trigger
out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
that no #PF path returns with VM_FAULT_OOM without allocation print a
warning that this is happening before we restart the #PF.

[VvS: #PF allocation can hit into limit of cgroup v1 kmem controlle.
This is a local problem related to memcg, however, it causes unnecessary
global OOM kills that are repeated over and over again and escalate into
a real disaster.  It was broken long time ago, most likely since
6a1a0d3b625a ("mm: allocate kernel pages to the right memcg").
In upstream the problem will be fixed by removing the outdated kmem limit,
however stable and LTS kernels cannot do it and are still affected.
This patch fixes the problem and should be backported into stable.]

Fixes: 6a1a0d3b625a ("mm: allocate kernel pages to the right memcg")
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Signed-off-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Reviewed-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 mm/oom_kill.c | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 831340e7ad8b..f98954befafb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1120,27 +1120,24 @@ bool out_of_memory(struct oom_control *oc)
 }
 
 /*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
- * killing is already in progress so do nothing.
+ * The pagefault handler calls here because some allocation has failed. We have
+ * to take care of the memcg OOM here because this is the only safe context without
+ * any locks held but let the oom killer triggered from the allocation context care
+ * about the global OOM.
  */
 void pagefault_out_of_memory(void)
 {
-	struct oom_control oc = {
-		.zonelist = NULL,
-		.nodemask = NULL,
-		.memcg = NULL,
-		.gfp_mask = 0,
-		.order = 0,
-	};
+	static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
-	if (!mutex_trylock(&oom_lock))
+	if (fatal_signal_pending(current))
 		return;
-	out_of_memory(&oc);
-	mutex_unlock(&oom_lock);
+
+	if (__ratelimit(&pfoom_rs))
+		pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
 }
 
 SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH memcg v2 2/2] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-22  8:11                                             ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-22  8:11 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
	kernel

Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It is assumed that the amount of the memory charged by those
tasks is bound and most of the memory will get released while the task
is exiting. This is resembling a heuristic for the global OOM situation
when tasks get access to memory reserves. There is no global memory
shortage at the memcg level so the memcg heuristic is more relieved.

The above assumption is overly optimistic though. E.g. vmalloc can scale
to really large requests and the heuristic would allow that. We used to
have an early break in the vmalloc allocator for killed tasks but this
has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
the current task is killed""). There are likely other similar code paths
which do not check for fatal signals in an allocation&charge loop.
Also there are some kernel objects charged to a memcg which are not
bound to a process life time.

It has been observed that it is not really hard to trigger these
bypasses and cause global OOM situation.

One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves). This is
certainly possible but it is not really clear how much of an excess is
desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.

This patch is addressing the problem by removing the heuristic
altogether. Bypass is only allowed for requests which either cannot fail
or where the failure is not desirable while excess should be still
limited (e.g. atomic requests). Implementation wise a killed or dying
task fails to charge if it has passed the OOM killer stage. That should
give all forms of reclaim chance to restore the limit before the
failure (ENOMEM) and tell the caller to back off.

In addition, this patch renames should_force_charge() helper
to task_is_dying() because now its use is not associated witch forced
charging.

Fixes: a636b327f731 ("memcg: avoid unnecessary system-wide-oom-killer")
Cc: stable@vger.kernel.org
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/memcontrol.c | 27 ++++++++-------------------
 1 file changed, 8 insertions(+), 19 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..87e41c3cac10 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -239,7 +239,7 @@ enum res_type {
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
-static inline bool should_force_charge(void)
+static inline bool task_is_dying(void)
 {
 	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
 		(current->flags & PF_EXITING);
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = should_force_charge() || out_of_memory(&oc);
+	ret = task_is_dying() || out_of_memory(&oc);
 
 unlock:
 	mutex_unlock(&oom_lock);
@@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct page_counter *counter;
 	enum oom_status oom_status;
 	unsigned long nr_reclaimed;
+	bool passed_oom = false;
 	bool may_swap = true;
 	bool drained = false;
 	unsigned long pflags;
@@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_ATOMIC)
 		goto force;
 
-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(should_force_charge()))
-		goto force;
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_RETRY_MAYFAIL)
 		goto nomem;
 
-	if (fatal_signal_pending(current))
-		goto force;
+	/* Avoid endless loop for tasks bypassed by the oom killer */
+	if (passed_oom && task_is_dying())
+		goto nomem;
 
 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
@@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
-	switch (oom_status) {
-	case OOM_SUCCESS:
+	if (oom_status == OOM_SUCCESS) {
+		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-	case OOM_FAILED:
-		goto force;
-	default:
-		goto nomem;
 	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH memcg v2 2/2] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-22  8:11                                             ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-22  8:11 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It is assumed that the amount of the memory charged by those
tasks is bound and most of the memory will get released while the task
is exiting. This is resembling a heuristic for the global OOM situation
when tasks get access to memory reserves. There is no global memory
shortage at the memcg level so the memcg heuristic is more relieved.

The above assumption is overly optimistic though. E.g. vmalloc can scale
to really large requests and the heuristic would allow that. We used to
have an early break in the vmalloc allocator for killed tasks but this
has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
the current task is killed""). There are likely other similar code paths
which do not check for fatal signals in an allocation&charge loop.
Also there are some kernel objects charged to a memcg which are not
bound to a process life time.

It has been observed that it is not really hard to trigger these
bypasses and cause global OOM situation.

One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves). This is
certainly possible but it is not really clear how much of an excess is
desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.

This patch is addressing the problem by removing the heuristic
altogether. Bypass is only allowed for requests which either cannot fail
or where the failure is not desirable while excess should be still
limited (e.g. atomic requests). Implementation wise a killed or dying
task fails to charge if it has passed the OOM killer stage. That should
give all forms of reclaim chance to restore the limit before the
failure (ENOMEM) and tell the caller to back off.

In addition, this patch renames should_force_charge() helper
to task_is_dying() because now its use is not associated witch forced
charging.

Fixes: a636b327f731 ("memcg: avoid unnecessary system-wide-oom-killer")
Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Suggested-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 mm/memcontrol.c | 27 ++++++++-------------------
 1 file changed, 8 insertions(+), 19 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..87e41c3cac10 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -239,7 +239,7 @@ enum res_type {
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
-static inline bool should_force_charge(void)
+static inline bool task_is_dying(void)
 {
 	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
 		(current->flags & PF_EXITING);
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = should_force_charge() || out_of_memory(&oc);
+	ret = task_is_dying() || out_of_memory(&oc);
 
 unlock:
 	mutex_unlock(&oom_lock);
@@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct page_counter *counter;
 	enum oom_status oom_status;
 	unsigned long nr_reclaimed;
+	bool passed_oom = false;
 	bool may_swap = true;
 	bool drained = false;
 	unsigned long pflags;
@@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_ATOMIC)
 		goto force;
 
-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(should_force_charge()))
-		goto force;
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_RETRY_MAYFAIL)
 		goto nomem;
 
-	if (fatal_signal_pending(current))
-		goto force;
+	/* Avoid endless loop for tasks bypassed by the oom killer */
+	if (passed_oom && task_is_dying())
+		goto nomem;
 
 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
@@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
-	switch (oom_status) {
-	case OOM_SUCCESS:
+	if (oom_status == OOM_SUCCESS) {
+		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-	case OOM_FAILED:
-		goto force;
-	default:
-		goto nomem;
 	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v2 1/2] mm, oom: do not trigger out_of_memory from the #PF
@ 2021-10-22  8:55                                               ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-22  8:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel

On Fri 22-10-21 11:11:09, Vasily Averin wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Any allocation failure during the #PF path will return with VM_FAULT_OOM
> which in turn results in pagefault_out_of_memory. This can happen for
> 2 different reasons. a) Memcg is out of memory and we rely on
> mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> normal allocation fails.
> 
> The later is quite problematic because allocation paths already trigger
> out_of_memory and the page allocator tries really hard to not fail
> allocations. Anyway, if the OOM killer has been already invoked there
> is no reason to invoke it again from the #PF path. Especially when the
> OOM condition might be gone by that time and we have no way to find out
> other than allocate.
> 
> Moreover if the allocation failed and the OOM killer hasn't been
> invoked then we are unlikely to do the right thing from the #PF context
> because we have already lost the allocation context and restictions and
> therefore might oom kill a task from a different NUMA domain.
> 
> An allocation might fail also when the current task is the oom victim
> and there are no memory reserves left and we should simply bail out
> from the #PF rather than invoking out_of_memory.
> 
> This all suggests that there is no legitimate reason to trigger
> out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
> that no #PF path returns with VM_FAULT_OOM without allocation print a
> warning that this is happening before we restart the #PF.

Thanks for reviving my old patch! I still do think this is the right
direction. I would argue that we should split this into two stages as
already mentioned. First to add a bail out on fatal_signal_pending
because that is a trivial thing to do and it should be obviously safer.
Then we can remove out_of_memory completely. That should be obviously
right thing to do but if it turns out we have missed something we would
need to revert that part.

> [VvS: #PF allocation can hit into limit of cgroup v1 kmem controlle.
> This is a local problem related to memcg, however, it causes unnecessary
> global OOM kills that are repeated over and over again and escalate into
> a real disaster.  It was broken long time ago, most likely since
> 6a1a0d3b625a ("mm: allocate kernel pages to the right memcg").
> In upstream the problem will be fixed by removing the outdated kmem limit,
> however stable and LTS kernels cannot do it and are still affected.
> This patch fixes the problem and should be backported into stable.]
> 
> Fixes: 6a1a0d3b625a ("mm: allocate kernel pages to the right memcg")

I am not sure the sha1 is the right one but the issue is there since
kmem accounting has been introduced so it should be around that time.
Maybe do not name the specific commit but stick to something like
"This has been broken since kmem accounting has been introduced for
cgroup v1 (3.8). There was no kmem specific reclaim for the separate
limit so the only way to handle kmem hard limit was to return with
ENOMEM."

> Cc: stable@vger.kernel.org

I am still not sure this really needs a stable backport. There are no
bug reports for years so I strongly suspect people are simply not using
kmem hard limit with cgroup v1.

> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Vasily Averin <vvs@virtuozzo.com>

Btw. you should be adding your s-o-b here.

> ---
>  mm/oom_kill.c | 23 ++++++++++-------------
>  1 file changed, 10 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 831340e7ad8b..f98954befafb 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1120,27 +1120,24 @@ bool out_of_memory(struct oom_control *oc)
>  }
>  
>  /*
> - * The pagefault handler calls here because it is out of memory, so kill a
> - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
> - * killing is already in progress so do nothing.
> + * The pagefault handler calls here because some allocation has failed. We have
> + * to take care of the memcg OOM here because this is the only safe context without
> + * any locks held but let the oom killer triggered from the allocation context care
> + * about the global OOM.
>   */
>  void pagefault_out_of_memory(void)
>  {
> -	struct oom_control oc = {
> -		.zonelist = NULL,
> -		.nodemask = NULL,
> -		.memcg = NULL,
> -		.gfp_mask = 0,
> -		.order = 0,
> -	};
> +	static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
> +				      DEFAULT_RATELIMIT_BURST);
>  
>  	if (mem_cgroup_oom_synchronize(true))
>  		return;
>  
> -	if (!mutex_trylock(&oom_lock))
> +	if (fatal_signal_pending(current))
>  		return;
> -	out_of_memory(&oc);
> -	mutex_unlock(&oom_lock);
> +
> +	if (__ratelimit(&pfoom_rs))
> +		pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
>  }
>  
>  SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> -- 
> 2.32.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v2 1/2] mm, oom: do not trigger out_of_memory from the #PF
@ 2021-10-22  8:55                                               ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-22  8:55 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Fri 22-10-21 11:11:09, Vasily Averin wrote:
> From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> 
> Any allocation failure during the #PF path will return with VM_FAULT_OOM
> which in turn results in pagefault_out_of_memory. This can happen for
> 2 different reasons. a) Memcg is out of memory and we rely on
> mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> normal allocation fails.
> 
> The later is quite problematic because allocation paths already trigger
> out_of_memory and the page allocator tries really hard to not fail
> allocations. Anyway, if the OOM killer has been already invoked there
> is no reason to invoke it again from the #PF path. Especially when the
> OOM condition might be gone by that time and we have no way to find out
> other than allocate.
> 
> Moreover if the allocation failed and the OOM killer hasn't been
> invoked then we are unlikely to do the right thing from the #PF context
> because we have already lost the allocation context and restictions and
> therefore might oom kill a task from a different NUMA domain.
> 
> An allocation might fail also when the current task is the oom victim
> and there are no memory reserves left and we should simply bail out
> from the #PF rather than invoking out_of_memory.
> 
> This all suggests that there is no legitimate reason to trigger
> out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
> that no #PF path returns with VM_FAULT_OOM without allocation print a
> warning that this is happening before we restart the #PF.

Thanks for reviving my old patch! I still do think this is the right
direction. I would argue that we should split this into two stages as
already mentioned. First to add a bail out on fatal_signal_pending
because that is a trivial thing to do and it should be obviously safer.
Then we can remove out_of_memory completely. That should be obviously
right thing to do but if it turns out we have missed something we would
need to revert that part.

> [VvS: #PF allocation can hit into limit of cgroup v1 kmem controlle.
> This is a local problem related to memcg, however, it causes unnecessary
> global OOM kills that are repeated over and over again and escalate into
> a real disaster.  It was broken long time ago, most likely since
> 6a1a0d3b625a ("mm: allocate kernel pages to the right memcg").
> In upstream the problem will be fixed by removing the outdated kmem limit,
> however stable and LTS kernels cannot do it and are still affected.
> This patch fixes the problem and should be backported into stable.]
> 
> Fixes: 6a1a0d3b625a ("mm: allocate kernel pages to the right memcg")

I am not sure the sha1 is the right one but the issue is there since
kmem accounting has been introduced so it should be around that time.
Maybe do not name the specific commit but stick to something like
"This has been broken since kmem accounting has been introduced for
cgroup v1 (3.8). There was no kmem specific reclaim for the separate
limit so the only way to handle kmem hard limit was to return with
ENOMEM."

> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

I am still not sure this really needs a stable backport. There are no
bug reports for years so I strongly suspect people are simply not using
kmem hard limit with cgroup v1.

> Signed-off-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> Reviewed-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

Btw. you should be adding your s-o-b here.

> ---
>  mm/oom_kill.c | 23 ++++++++++-------------
>  1 file changed, 10 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 831340e7ad8b..f98954befafb 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1120,27 +1120,24 @@ bool out_of_memory(struct oom_control *oc)
>  }
>  
>  /*
> - * The pagefault handler calls here because it is out of memory, so kill a
> - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
> - * killing is already in progress so do nothing.
> + * The pagefault handler calls here because some allocation has failed. We have
> + * to take care of the memcg OOM here because this is the only safe context without
> + * any locks held but let the oom killer triggered from the allocation context care
> + * about the global OOM.
>   */
>  void pagefault_out_of_memory(void)
>  {
> -	struct oom_control oc = {
> -		.zonelist = NULL,
> -		.nodemask = NULL,
> -		.memcg = NULL,
> -		.gfp_mask = 0,
> -		.order = 0,
> -	};
> +	static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
> +				      DEFAULT_RATELIMIT_BURST);
>  
>  	if (mem_cgroup_oom_synchronize(true))
>  		return;
>  
> -	if (!mutex_trylock(&oom_lock))
> +	if (fatal_signal_pending(current))
>  		return;
> -	out_of_memory(&oc);
> -	mutex_unlock(&oom_lock);
> +
> +	if (__ratelimit(&pfoom_rs))
> +		pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
>  }
>  
>  SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> -- 
> 2.32.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v2 2/2] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-22  9:10                                               ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-22  9:10 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel

On Fri 22-10-21 11:11:29, Vasily Averin wrote:
> Memory cgroup charging allows killed or exiting tasks to exceed the hard
> limit. It is assumed that the amount of the memory charged by those
> tasks is bound and most of the memory will get released while the task
> is exiting. This is resembling a heuristic for the global OOM situation
> when tasks get access to memory reserves. There is no global memory
> shortage at the memcg level so the memcg heuristic is more relieved.
> 
> The above assumption is overly optimistic though. E.g. vmalloc can scale
> to really large requests and the heuristic would allow that. We used to
> have an early break in the vmalloc allocator for killed tasks but this
> has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
> the current task is killed""). There are likely other similar code paths
> which do not check for fatal signals in an allocation&charge loop.
> Also there are some kernel objects charged to a memcg which are not
> bound to a process life time.
> 
> It has been observed that it is not really hard to trigger these
> bypasses and cause global OOM situation.
> 
> One potential way to address these runaways would be to limit the amount
> of excess (similar to the global OOM with limited oom reserves). This is
> certainly possible but it is not really clear how much of an excess is
> desirable and still protects from global OOMs as that would have to
> consider the overall memcg configuration.
> 
> This patch is addressing the problem by removing the heuristic
> altogether. Bypass is only allowed for requests which either cannot fail
> or where the failure is not desirable while excess should be still
> limited (e.g. atomic requests). Implementation wise a killed or dying
> task fails to charge if it has passed the OOM killer stage. That should
> give all forms of reclaim chance to restore the limit before the
> failure (ENOMEM) and tell the caller to back off.
> 
> In addition, this patch renames should_force_charge() helper
> to task_is_dying() because now its use is not associated witch forced
> charging.

I would explicitly mention that this depends on pagefault_out_of_memory
to not trigger out_of_memory because then a memcg failure can unwind to
VM_FAULT_OOM and cause a global OOM killer.

Maybe it would be even better to fold the removal to this patch so the
dependency is more obvious. I will live that to you.

> Fixes: a636b327f731 ("memcg: avoid unnecessary system-wide-oom-killer")

Fixes tag would be quite hard here. For example you certainly didn't
have a practically unbound vector to go over the hard limit - like
vmalloc. At least not after __GFP_ACCOUNT has been introduced. So I
would just not bother with a Fixes tag at all rather than cause it more
questions than answers.

> Cc: stable@vger.kernel.org
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Other than that looks good to me.
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!
> ---
>  mm/memcontrol.c | 27 ++++++++-------------------
>  1 file changed, 8 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6da5020a8656..87e41c3cac10 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -239,7 +239,7 @@ enum res_type {
>  	     iter != NULL;				\
>  	     iter = mem_cgroup_iter(NULL, iter, NULL))
>  
> -static inline bool should_force_charge(void)
> +static inline bool task_is_dying(void)
>  {
>  	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
>  		(current->flags & PF_EXITING);
> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * A few threads which were not waiting at mutex_lock_killable() can
>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>  	 */
> -	ret = should_force_charge() || out_of_memory(&oc);
> +	ret = task_is_dying() || out_of_memory(&oc);
>  
>  unlock:
>  	mutex_unlock(&oom_lock);
> @@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	struct page_counter *counter;
>  	enum oom_status oom_status;
>  	unsigned long nr_reclaimed;
> +	bool passed_oom = false;
>  	bool may_swap = true;
>  	bool drained = false;
>  	unsigned long pflags;
> @@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (gfp_mask & __GFP_ATOMIC)
>  		goto force;
>  
> -	/*
> -	 * Unlike in global OOM situations, memcg is not in a physical
> -	 * memory shortage.  Allow dying and OOM-killed tasks to
> -	 * bypass the last charges so that they can exit quickly and
> -	 * free their memory.
> -	 */
> -	if (unlikely(should_force_charge()))
> -		goto force;
> -
>  	/*
>  	 * Prevent unbounded recursion when reclaim operations need to
>  	 * allocate memory. This might exceed the limits temporarily,
> @@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (gfp_mask & __GFP_RETRY_MAYFAIL)
>  		goto nomem;
>  
> -	if (fatal_signal_pending(current))
> -		goto force;
> +	/* Avoid endless loop for tasks bypassed by the oom killer */
> +	if (passed_oom && task_is_dying())
> +		goto nomem;
>  
>  	/*
>  	 * keep retrying as long as the memcg oom killer is able to make
> @@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 */
>  	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
>  		       get_order(nr_pages * PAGE_SIZE));
> -	switch (oom_status) {
> -	case OOM_SUCCESS:
> +	if (oom_status == OOM_SUCCESS) {
> +		passed_oom = true;
>  		nr_retries = MAX_RECLAIM_RETRIES;
>  		goto retry;
> -	case OOM_FAILED:
> -		goto force;
> -	default:
> -		goto nomem;
>  	}
>  nomem:
>  	if (!(gfp_mask & __GFP_NOFAIL))
> -- 
> 2.32.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v2 2/2] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-22  9:10                                               ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-22  9:10 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Fri 22-10-21 11:11:29, Vasily Averin wrote:
> Memory cgroup charging allows killed or exiting tasks to exceed the hard
> limit. It is assumed that the amount of the memory charged by those
> tasks is bound and most of the memory will get released while the task
> is exiting. This is resembling a heuristic for the global OOM situation
> when tasks get access to memory reserves. There is no global memory
> shortage at the memcg level so the memcg heuristic is more relieved.
> 
> The above assumption is overly optimistic though. E.g. vmalloc can scale
> to really large requests and the heuristic would allow that. We used to
> have an early break in the vmalloc allocator for killed tasks but this
> has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
> the current task is killed""). There are likely other similar code paths
> which do not check for fatal signals in an allocation&charge loop.
> Also there are some kernel objects charged to a memcg which are not
> bound to a process life time.
> 
> It has been observed that it is not really hard to trigger these
> bypasses and cause global OOM situation.
> 
> One potential way to address these runaways would be to limit the amount
> of excess (similar to the global OOM with limited oom reserves). This is
> certainly possible but it is not really clear how much of an excess is
> desirable and still protects from global OOMs as that would have to
> consider the overall memcg configuration.
> 
> This patch is addressing the problem by removing the heuristic
> altogether. Bypass is only allowed for requests which either cannot fail
> or where the failure is not desirable while excess should be still
> limited (e.g. atomic requests). Implementation wise a killed or dying
> task fails to charge if it has passed the OOM killer stage. That should
> give all forms of reclaim chance to restore the limit before the
> failure (ENOMEM) and tell the caller to back off.
> 
> In addition, this patch renames should_force_charge() helper
> to task_is_dying() because now its use is not associated witch forced
> charging.

I would explicitly mention that this depends on pagefault_out_of_memory
to not trigger out_of_memory because then a memcg failure can unwind to
VM_FAULT_OOM and cause a global OOM killer.

Maybe it would be even better to fold the removal to this patch so the
dependency is more obvious. I will live that to you.

> Fixes: a636b327f731 ("memcg: avoid unnecessary system-wide-oom-killer")

Fixes tag would be quite hard here. For example you certainly didn't
have a practically unbound vector to go over the hard limit - like
vmalloc. At least not after __GFP_ACCOUNT has been introduced. So I
would just not bother with a Fixes tag at all rather than cause it more
questions than answers.

> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Suggested-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

Other than that looks good to me.
Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

Thanks!
> ---
>  mm/memcontrol.c | 27 ++++++++-------------------
>  1 file changed, 8 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6da5020a8656..87e41c3cac10 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -239,7 +239,7 @@ enum res_type {
>  	     iter != NULL;				\
>  	     iter = mem_cgroup_iter(NULL, iter, NULL))
>  
> -static inline bool should_force_charge(void)
> +static inline bool task_is_dying(void)
>  {
>  	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
>  		(current->flags & PF_EXITING);
> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * A few threads which were not waiting at mutex_lock_killable() can
>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>  	 */
> -	ret = should_force_charge() || out_of_memory(&oc);
> +	ret = task_is_dying() || out_of_memory(&oc);
>  
>  unlock:
>  	mutex_unlock(&oom_lock);
> @@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	struct page_counter *counter;
>  	enum oom_status oom_status;
>  	unsigned long nr_reclaimed;
> +	bool passed_oom = false;
>  	bool may_swap = true;
>  	bool drained = false;
>  	unsigned long pflags;
> @@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (gfp_mask & __GFP_ATOMIC)
>  		goto force;
>  
> -	/*
> -	 * Unlike in global OOM situations, memcg is not in a physical
> -	 * memory shortage.  Allow dying and OOM-killed tasks to
> -	 * bypass the last charges so that they can exit quickly and
> -	 * free their memory.
> -	 */
> -	if (unlikely(should_force_charge()))
> -		goto force;
> -
>  	/*
>  	 * Prevent unbounded recursion when reclaim operations need to
>  	 * allocate memory. This might exceed the limits temporarily,
> @@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (gfp_mask & __GFP_RETRY_MAYFAIL)
>  		goto nomem;
>  
> -	if (fatal_signal_pending(current))
> -		goto force;
> +	/* Avoid endless loop for tasks bypassed by the oom killer */
> +	if (passed_oom && task_is_dying())
> +		goto nomem;
>  
>  	/*
>  	 * keep retrying as long as the memcg oom killer is able to make
> @@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 */
>  	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
>  		       get_order(nr_pages * PAGE_SIZE));
> -	switch (oom_status) {
> -	case OOM_SUCCESS:
> +	if (oom_status == OOM_SUCCESS) {
> +		passed_oom = true;
>  		nr_retries = MAX_RECLAIM_RETRIES;
>  		goto retry;
> -	case OOM_FAILED:
> -		goto force;
> -	default:
> -		goto nomem;
>  	}
>  nomem:
>  	if (!(gfp_mask & __GFP_NOFAIL))
> -- 
> 2.32.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH memcg v3 0/3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-23 13:18                                                 ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-23 13:18 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
	kernel

Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It can be misused and allow to trigger global OOM from inside
memcg-limited container. On the other hand if memcg fail allocation,
called from inside #PF handler it trigger global OOM from inside
pagefault_out_of_memory().

To prevent these problems this patch set:
a) removes execution of out_of_memory() from pagefault_out_of_memory(),
   becasue nobody can explain why it is necessary.
b) allow memcg to fail allocation of dying/killed tasks.

v3: resplit, improved patch descriptions
v2: resplit,
    use old patch from Michal Hocko removing out_of_memory() from
      pagefault_out_of_memory()


Michal Hocko (1):
  mm, oom: do not trigger out_of_memory from the #PF

Vasily Averin (2):
  mm, oom: pagefault_out_of_memory: don't force global OOM for dying
    tasks
  memcg: prohibit unconditional exceeding the limit of dying tasks

 mm/memcontrol.c | 27 ++++++++-------------------
 mm/oom_kill.c   | 23 ++++++++++-------------
 2 files changed, 18 insertions(+), 32 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH memcg v3 0/3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-23 13:18                                                 ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-23 13:18 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It can be misused and allow to trigger global OOM from inside
memcg-limited container. On the other hand if memcg fail allocation,
called from inside #PF handler it trigger global OOM from inside
pagefault_out_of_memory().

To prevent these problems this patch set:
a) removes execution of out_of_memory() from pagefault_out_of_memory(),
   becasue nobody can explain why it is necessary.
b) allow memcg to fail allocation of dying/killed tasks.

v3: resplit, improved patch descriptions
v2: resplit,
    use old patch from Michal Hocko removing out_of_memory() from
      pagefault_out_of_memory()


Michal Hocko (1):
  mm, oom: do not trigger out_of_memory from the #PF

Vasily Averin (2):
  mm, oom: pagefault_out_of_memory: don't force global OOM for dying
    tasks
  memcg: prohibit unconditional exceeding the limit of dying tasks

 mm/memcontrol.c | 27 ++++++++-------------------
 mm/oom_kill.c   | 23 ++++++++++-------------
 2 files changed, 18 insertions(+), 32 deletions(-)

-- 
2.32.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH memcg v3 1/3] mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks
@ 2021-10-23 13:19                                                   ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-23 13:19 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
	kernel

Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory which in own turn
executes out_out_memory() and can kill a random task.

An allocation might fail when the current task is the oom victim
and there are no memory reserves left. The OOM killer is already
handled at the page allocator level for the global OOM and at the
charging level for the memcg one. Both have much more information
about the scope of allocation/charge request. This means that
either the OOM killer has been invoked properly and didn't lead
to the allocation success or it has been skipped because it couldn't
have been invoked. In both cases triggering it from here is pointless
and even harmful.

It makes much more sense to let the killed task die rather than to
wake up an eternally hungry oom-killer and send him to choose a fatter
victim for breakfast.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 mm/oom_kill.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 831340e7ad8b..1deef8c7a71b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
+	if (fatal_signal_pending(current))
+		return;
+
 	if (!mutex_trylock(&oom_lock))
 		return;
 	out_of_memory(&oc);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH memcg v3 1/3] mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks
@ 2021-10-23 13:19                                                   ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-23 13:19 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory which in own turn
executes out_out_memory() and can kill a random task.

An allocation might fail when the current task is the oom victim
and there are no memory reserves left. The OOM killer is already
handled at the page allocator level for the global OOM and at the
charging level for the memcg one. Both have much more information
about the scope of allocation/charge request. This means that
either the OOM killer has been invoked properly and didn't lead
to the allocation success or it has been skipped because it couldn't
have been invoked. In both cases triggering it from here is pointless
and even harmful.

It makes much more sense to let the killed task die rather than to
wake up an eternally hungry oom-killer and send him to choose a fatter
victim for breakfast.

Suggested-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
 mm/oom_kill.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 831340e7ad8b..1deef8c7a71b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
+	if (fatal_signal_pending(current))
+		return;
+
 	if (!mutex_trylock(&oom_lock))
 		return;
 	out_of_memory(&oc);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
@ 2021-10-23 13:20                                                   ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-23 13:20 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
	kernel

From: Michal Hocko <mhocko@suse.com>

Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory. This can happen for
2 different reasons. a) Memcg is out of memory and we rely on
mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
normal allocation fails.

The later is quite problematic because allocation paths already trigger
out_of_memory and the page allocator tries really hard to not fail
allocations. Anyway, if the OOM killer has been already invoked there
is no reason to invoke it again from the #PF path. Especially when the
OOM condition might be gone by that time and we have no way to find out
other than allocate.

Moreover if the allocation failed and the OOM killer hasn't been
invoked then we are unlikely to do the right thing from the #PF context
because we have already lost the allocation context and restictions and
therefore might oom kill a task from a different NUMA domain.

This all suggests that there is no legitimate reason to trigger
out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
that no #PF path returns with VM_FAULT_OOM without allocation print a
warning that this is happening before we restart the #PF.

[VvS: #PF allocation can hit into limit of cgroup v1 kmem controller.
This is a local problem related to memcg, however, it causes unnecessary
global OOM kills that are repeated over and over again and escalate into
a real disaster. This has been broken since kmem accounting has been
introduced for cgroup v1 (3.8). There was no kmem specific reclaim
for the separate limit so the only way to handle kmem hard limit
was to return with ENOMEM.
In upstream the problem will be fixed by removing the outdated kmem limit,
however stable and LTS kernels cannot do it and are still affected.
This patch fixes the problem and should be backported into stable/LTS.]

Cc: stable@vger.kernel.org

Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
v2: hook with fatal_signal_pending() has beem moved into a separate patch,
    improved patch description, removed "Fixed" mark.
---
 mm/oom_kill.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1deef8c7a71b..f98954befafb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1120,19 +1120,15 @@ bool out_of_memory(struct oom_control *oc)
 }
 
 /*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
- * killing is already in progress so do nothing.
+ * The pagefault handler calls here because some allocation has failed. We have
+ * to take care of the memcg OOM here because this is the only safe context without
+ * any locks held but let the oom killer triggered from the allocation context care
+ * about the global OOM.
  */
 void pagefault_out_of_memory(void)
 {
-	struct oom_control oc = {
-		.zonelist = NULL,
-		.nodemask = NULL,
-		.memcg = NULL,
-		.gfp_mask = 0,
-		.order = 0,
-	};
+	static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 	if (mem_cgroup_oom_synchronize(true))
 		return;
@@ -1140,10 +1136,8 @@ void pagefault_out_of_memory(void)
 	if (fatal_signal_pending(current))
 		return;
 
-	if (!mutex_trylock(&oom_lock))
-		return;
-	out_of_memory(&oc);
-	mutex_unlock(&oom_lock);
+	if (__ratelimit(&pfoom_rs))
+		pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
 }
 
 SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
@ 2021-10-23 13:20                                                   ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-23 13:20 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory. This can happen for
2 different reasons. a) Memcg is out of memory and we rely on
mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
normal allocation fails.

The later is quite problematic because allocation paths already trigger
out_of_memory and the page allocator tries really hard to not fail
allocations. Anyway, if the OOM killer has been already invoked there
is no reason to invoke it again from the #PF path. Especially when the
OOM condition might be gone by that time and we have no way to find out
other than allocate.

Moreover if the allocation failed and the OOM killer hasn't been
invoked then we are unlikely to do the right thing from the #PF context
because we have already lost the allocation context and restictions and
therefore might oom kill a task from a different NUMA domain.

This all suggests that there is no legitimate reason to trigger
out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
that no #PF path returns with VM_FAULT_OOM without allocation print a
warning that this is happening before we restart the #PF.

[VvS: #PF allocation can hit into limit of cgroup v1 kmem controller.
This is a local problem related to memcg, however, it causes unnecessary
global OOM kills that are repeated over and over again and escalate into
a real disaster. This has been broken since kmem accounting has been
introduced for cgroup v1 (3.8). There was no kmem specific reclaim
for the separate limit so the only way to handle kmem hard limit
was to return with ENOMEM.
In upstream the problem will be fixed by removing the outdated kmem limit,
however stable and LTS kernels cannot do it and are still affected.
This patch fixes the problem and should be backported into stable/LTS.]

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Signed-off-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
---
v2: hook with fatal_signal_pending() has beem moved into a separate patch,
    improved patch description, removed "Fixed" mark.
---
 mm/oom_kill.c | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1deef8c7a71b..f98954befafb 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1120,19 +1120,15 @@ bool out_of_memory(struct oom_control *oc)
 }
 
 /*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
- * killing is already in progress so do nothing.
+ * The pagefault handler calls here because some allocation has failed. We have
+ * to take care of the memcg OOM here because this is the only safe context without
+ * any locks held but let the oom killer triggered from the allocation context care
+ * about the global OOM.
  */
 void pagefault_out_of_memory(void)
 {
-	struct oom_control oc = {
-		.zonelist = NULL,
-		.nodemask = NULL,
-		.memcg = NULL,
-		.gfp_mask = 0,
-		.order = 0,
-	};
+	static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 	if (mem_cgroup_oom_synchronize(true))
 		return;
@@ -1140,10 +1136,8 @@ void pagefault_out_of_memory(void)
 	if (fatal_signal_pending(current))
 		return;
 
-	if (!mutex_trylock(&oom_lock))
-		return;
-	out_of_memory(&oc);
-	mutex_unlock(&oom_lock);
+	if (__ratelimit(&pfoom_rs))
+		pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
 }
 
 SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-23 13:20                                                   ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-23 13:20 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
	kernel

Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It is assumed that the amount of the memory charged by those
tasks is bound and most of the memory will get released while the task
is exiting. This is resembling a heuristic for the global OOM situation
when tasks get access to memory reserves. There is no global memory
shortage at the memcg level so the memcg heuristic is more relieved.

The above assumption is overly optimistic though. E.g. vmalloc can scale
to really large requests and the heuristic would allow that. We used to
have an early break in the vmalloc allocator for killed tasks but this
has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
the current task is killed""). There are likely other similar code paths
which do not check for fatal signals in an allocation&charge loop.
Also there are some kernel objects charged to a memcg which are not
bound to a process life time.

It has been observed that it is not really hard to trigger these
bypasses and cause global OOM situation.

One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves). This is
certainly possible but it is not really clear how much of an excess is
desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.

This patch is addressing the problem by removing the heuristic
altogether. Bypass is only allowed for requests which either cannot fail
or where the failure is not desirable while excess should be still
limited (e.g. atomic requests). Implementation wise a killed or dying
task fails to charge if it has passed the OOM killer stage. That should
give all forms of reclaim chance to restore the limit before the
failure (ENOMEM) and tell the caller to back off.

In addition, this patch renames should_force_charge() helper
to task_is_dying() because now its use is not associated witch forced
charging.

This patch depends on pagefault_out_of_memory() to not trigger
out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
and cause a global OOM killer.

Cc: stable@vger.kernel.org
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/memcontrol.c | 27 ++++++++-------------------
 1 file changed, 8 insertions(+), 19 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..87e41c3cac10 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -239,7 +239,7 @@ enum res_type {
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
-static inline bool should_force_charge(void)
+static inline bool task_is_dying(void)
 {
 	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
 		(current->flags & PF_EXITING);
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = should_force_charge() || out_of_memory(&oc);
+	ret = task_is_dying() || out_of_memory(&oc);
 
 unlock:
 	mutex_unlock(&oom_lock);
@@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct page_counter *counter;
 	enum oom_status oom_status;
 	unsigned long nr_reclaimed;
+	bool passed_oom = false;
 	bool may_swap = true;
 	bool drained = false;
 	unsigned long pflags;
@@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_ATOMIC)
 		goto force;
 
-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(should_force_charge()))
-		goto force;
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_RETRY_MAYFAIL)
 		goto nomem;
 
-	if (fatal_signal_pending(current))
-		goto force;
+	/* Avoid endless loop for tasks bypassed by the oom killer */
+	if (passed_oom && task_is_dying())
+		goto nomem;
 
 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
@@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
-	switch (oom_status) {
-	case OOM_SUCCESS:
+	if (oom_status == OOM_SUCCESS) {
+		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-	case OOM_FAILED:
-		goto force;
-	default:
-		goto nomem;
 	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-23 13:20                                                   ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-23 13:20 UTC (permalink / raw)
  To: Michal Hocko, Johannes Weiner, Vladimir Davydov, Andrew Morton
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It is assumed that the amount of the memory charged by those
tasks is bound and most of the memory will get released while the task
is exiting. This is resembling a heuristic for the global OOM situation
when tasks get access to memory reserves. There is no global memory
shortage at the memcg level so the memcg heuristic is more relieved.

The above assumption is overly optimistic though. E.g. vmalloc can scale
to really large requests and the heuristic would allow that. We used to
have an early break in the vmalloc allocator for killed tasks but this
has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
the current task is killed""). There are likely other similar code paths
which do not check for fatal signals in an allocation&charge loop.
Also there are some kernel objects charged to a memcg which are not
bound to a process life time.

It has been observed that it is not really hard to trigger these
bypasses and cause global OOM situation.

One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves). This is
certainly possible but it is not really clear how much of an excess is
desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.

This patch is addressing the problem by removing the heuristic
altogether. Bypass is only allowed for requests which either cannot fail
or where the failure is not desirable while excess should be still
limited (e.g. atomic requests). Implementation wise a killed or dying
task fails to charge if it has passed the OOM killer stage. That should
give all forms of reclaim chance to restore the limit before the
failure (ENOMEM) and tell the caller to back off.

In addition, this patch renames should_force_charge() helper
to task_is_dying() because now its use is not associated witch forced
charging.

This patch depends on pagefault_out_of_memory() to not trigger
out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
and cause a global OOM killer.

Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Suggested-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
---
 mm/memcontrol.c | 27 ++++++++-------------------
 1 file changed, 8 insertions(+), 19 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..87e41c3cac10 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -239,7 +239,7 @@ enum res_type {
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
-static inline bool should_force_charge(void)
+static inline bool task_is_dying(void)
 {
 	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
 		(current->flags & PF_EXITING);
@@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = should_force_charge() || out_of_memory(&oc);
+	ret = task_is_dying() || out_of_memory(&oc);
 
 unlock:
 	mutex_unlock(&oom_lock);
@@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	struct page_counter *counter;
 	enum oom_status oom_status;
 	unsigned long nr_reclaimed;
+	bool passed_oom = false;
 	bool may_swap = true;
 	bool drained = false;
 	unsigned long pflags;
@@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_ATOMIC)
 		goto force;
 
-	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(should_force_charge()))
-		goto force;
-
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (gfp_mask & __GFP_RETRY_MAYFAIL)
 		goto nomem;
 
-	if (fatal_signal_pending(current))
-		goto force;
+	/* Avoid endless loop for tasks bypassed by the oom killer */
+	if (passed_oom && task_is_dying())
+		goto nomem;
 
 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
@@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 */
 	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
-	switch (oom_status) {
-	case OOM_SUCCESS:
+	if (oom_status == OOM_SUCCESS) {
+		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-	case OOM_FAILED:
-		goto force;
-	default:
-		goto nomem;
 	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
@ 2021-10-23 15:01                                                     ` Tetsuo Handa
  0 siblings, 0 replies; 131+ messages in thread
From: Tetsuo Handa @ 2021-10-23 15:01 UTC (permalink / raw)
  To: Vasily Averin, Michal Hocko
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, cgroups, linux-mm, linux-kernel, kernel,
	Johannes Weiner, Vladimir Davydov, Andrew Morton

On 2021/10/23 22:20, Vasily Averin wrote:
>  /*
> - * The pagefault handler calls here because it is out of memory, so kill a
> - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
> - * killing is already in progress so do nothing.
> + * The pagefault handler calls here because some allocation has failed. We have
> + * to take care of the memcg OOM here because this is the only safe context without
> + * any locks held but let the oom killer triggered from the allocation context care
> + * about the global OOM.
>   */

Excuse me for a stupid question. I consider

  if (!mutex_trylock(&oom_lock))
    return;
  out_of_memory(&oc);
  mutex_unlock(&oom_lock);

here as the last resort (safeguard) when neither __alloc_pages_may_oom()
nor mem_cgroup_out_of_memory() can make progress. This patch says

  let the oom killer triggered from the allocation context care
  about the global OOM.

but what if the OOM killer cannot be invoked from the allocation context?
Is there a guarantee that all memory allocations which might result in
VM_FAULT_OOM can invoke the OOM killer?

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
@ 2021-10-23 15:01                                                     ` Tetsuo Handa
  0 siblings, 0 replies; 131+ messages in thread
From: Tetsuo Handa @ 2021-10-23 15:01 UTC (permalink / raw)
  To: Vasily Averin, Michal Hocko
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A, Johannes Weiner, Vladimir Davydov,
	Andrew Morton

On 2021/10/23 22:20, Vasily Averin wrote:
>  /*
> - * The pagefault handler calls here because it is out of memory, so kill a
> - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
> - * killing is already in progress so do nothing.
> + * The pagefault handler calls here because some allocation has failed. We have
> + * to take care of the memcg OOM here because this is the only safe context without
> + * any locks held but let the oom killer triggered from the allocation context care
> + * about the global OOM.
>   */

Excuse me for a stupid question. I consider

  if (!mutex_trylock(&oom_lock))
    return;
  out_of_memory(&oc);
  mutex_unlock(&oom_lock);

here as the last resort (safeguard) when neither __alloc_pages_may_oom()
nor mem_cgroup_out_of_memory() can make progress. This patch says

  let the oom killer triggered from the allocation context care
  about the global OOM.

but what if the OOM killer cannot be invoked from the allocation context?
Is there a guarantee that all memory allocations which might result in
VM_FAULT_OOM can invoke the OOM killer?

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
  2021-10-23 15:01                                                     ` Tetsuo Handa
  (?)
@ 2021-10-23 19:15                                                     ` Vasily Averin
  -1 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-23 19:15 UTC (permalink / raw)
  To: Tetsuo Handa, Michal Hocko
  Cc: Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, cgroups, linux-mm, linux-kernel, kernel,
	Johannes Weiner, Vladimir Davydov, Andrew Morton

On 23.10.2021 18:01, Tetsuo Handa wrote:
> On 2021/10/23 22:20, Vasily Averin wrote:
>>  /*
>> - * The pagefault handler calls here because it is out of memory, so kill a
>> - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
>> - * killing is already in progress so do nothing.
>> + * The pagefault handler calls here because some allocation has failed. We have
>> + * to take care of the memcg OOM here because this is the only safe context without
>> + * any locks held but let the oom killer triggered from the allocation context care
>> + * about the global OOM.
>>   */
> 
> Excuse me for a stupid question. I consider
> 
>   if (!mutex_trylock(&oom_lock))
>     return;
>   out_of_memory(&oc);
>   mutex_unlock(&oom_lock);
> 
> here as the last resort (safeguard) when neither __alloc_pages_may_oom()
> nor mem_cgroup_out_of_memory() can make progress. This patch says
> 
>   let the oom killer triggered from the allocation context care
>   about the global OOM.
> 
> but what if the OOM killer cannot be invoked from the allocation context?
> Is there a guarantee that all memory allocations which might result in
> VM_FAULT_OOM can invoke the OOM killer?

I don't think this question is stupid, since I asked it myself :)
You can find this discussion here:
https://lkml.org/lkml/2021/10/21/900

Let me quote it here too
:> 1) VM_FAULT_OOM may be triggered w/o execution of out_of_memory()
:> for exampel it can be caused by incorrect vm fault operations, 
:> (a) which can return this error without calling allocator at all.
:
:I would argue this to be a bug. How can that particular code tell
:whether the system is OOM and the oom killer is the a reasonable measure
:to take?
:
:> (b) or which can provide incorrect gfp flags and allocator can fail without execution of out_of_memory.
:
: I am not sure I can see any sensible scenario where pagefault oom killer
: would be an appropriate fix for that.
:
:> (c) This may happen on stable/LTS kernels when successful allocation was failed by hit into limit of legacy memcg-kmem contoller.
:> We'll drop it in upstream kernels, however how to handle it in old kenrels?
:
:Triggering the global oom killer for legacy kmem charge failure is
:clearly wrong. Removing oom killer from #PF would fix that problem.

I would note: (c) is not theoretical but real life problem, in this case allocation was failed without execution of OOM,
however, it is in this case that execution out_of_memory() from pagefault_out_of_memory() leads to a disaster.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
@ 2021-10-25  8:04                                                       ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-25  8:04 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Vasily Averin, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka,
	Shakeel Butt, Mel Gorman, cgroups, linux-mm, linux-kernel,
	kernel, Johannes Weiner, Vladimir Davydov, Andrew Morton

On Sun 24-10-21 00:01:07, Tetsuo Handa wrote:
> On 2021/10/23 22:20, Vasily Averin wrote:
> >  /*
> > - * The pagefault handler calls here because it is out of memory, so kill a
> > - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
> > - * killing is already in progress so do nothing.
> > + * The pagefault handler calls here because some allocation has failed. We have
> > + * to take care of the memcg OOM here because this is the only safe context without
> > + * any locks held but let the oom killer triggered from the allocation context care
> > + * about the global OOM.
> >   */
> 
> Excuse me for a stupid question. I consider
> 
>   if (!mutex_trylock(&oom_lock))
>     return;
>   out_of_memory(&oc);
>   mutex_unlock(&oom_lock);
> 
> here as the last resort (safeguard) when neither __alloc_pages_may_oom()
> nor mem_cgroup_out_of_memory() can make progress. This patch says
> 
>   let the oom killer triggered from the allocation context care
>   about the global OOM.
> 
> but what if the OOM killer cannot be invoked from the allocation context?
> Is there a guarantee that all memory allocations which might result in
> VM_FAULT_OOM can invoke the OOM killer?

I do not think there is any guarantee. This code has meant to be a
safeguard but it turns out to be adding more harm than a safety. There
are several scenarios mentioned in this thread where this would be
counter productive or outright wrong thing to do.

On the other hand it is hard to imagine any legitimate situation where
this would be a right thing to do. Maybe you have something more
specific in mind? What would be the legit code to rely on OOM handling
out of the line (where the details about the allocation scope is lost)?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
@ 2021-10-25  8:04                                                       ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-25  8:04 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Vasily Averin, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka,
	Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A, Johannes Weiner, Vladimir Davydov,
	Andrew Morton

On Sun 24-10-21 00:01:07, Tetsuo Handa wrote:
> On 2021/10/23 22:20, Vasily Averin wrote:
> >  /*
> > - * The pagefault handler calls here because it is out of memory, so kill a
> > - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
> > - * killing is already in progress so do nothing.
> > + * The pagefault handler calls here because some allocation has failed. We have
> > + * to take care of the memcg OOM here because this is the only safe context without
> > + * any locks held but let the oom killer triggered from the allocation context care
> > + * about the global OOM.
> >   */
> 
> Excuse me for a stupid question. I consider
> 
>   if (!mutex_trylock(&oom_lock))
>     return;
>   out_of_memory(&oc);
>   mutex_unlock(&oom_lock);
> 
> here as the last resort (safeguard) when neither __alloc_pages_may_oom()
> nor mem_cgroup_out_of_memory() can make progress. This patch says
> 
>   let the oom killer triggered from the allocation context care
>   about the global OOM.
> 
> but what if the OOM killer cannot be invoked from the allocation context?
> Is there a guarantee that all memory allocations which might result in
> VM_FAULT_OOM can invoke the OOM killer?

I do not think there is any guarantee. This code has meant to be a
safeguard but it turns out to be adding more harm than a safety. There
are several scenarios mentioned in this thread where this would be
counter productive or outright wrong thing to do.

On the other hand it is hard to imagine any legitimate situation where
this would be a right thing to do. Maybe you have something more
specific in mind? What would be the legit code to rely on OOM handling
out of the line (where the details about the allocation scope is lost)?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 1/3] mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks
@ 2021-10-25  9:27                                                     ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-25  9:27 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel

On Sat 23-10-21 16:19:28, Vasily Averin wrote:
> Any allocation failure during the #PF path will return with VM_FAULT_OOM
> which in turn results in pagefault_out_of_memory which in own turn
> executes out_out_memory() and can kill a random task.
> 
> An allocation might fail when the current task is the oom victim
> and there are no memory reserves left. The OOM killer is already
> handled at the page allocator level for the global OOM and at the
> charging level for the memcg one. Both have much more information
> about the scope of allocation/charge request. This means that
> either the OOM killer has been invoked properly and didn't lead
> to the allocation success or it has been skipped because it couldn't
> have been invoked. In both cases triggering it from here is pointless
> and even harmful.
> 
> It makes much more sense to let the killed task die rather than to
> wake up an eternally hungry oom-killer and send him to choose a fatter
> victim for breakfast.
> 
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/oom_kill.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 831340e7ad8b..1deef8c7a71b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
>  	if (mem_cgroup_oom_synchronize(true))
>  		return;
>  
> +	if (fatal_signal_pending(current))
> +		return;
> +
>  	if (!mutex_trylock(&oom_lock))
>  		return;
>  	out_of_memory(&oc);
> -- 
> 2.32.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 1/3] mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks
@ 2021-10-25  9:27                                                     ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-25  9:27 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Sat 23-10-21 16:19:28, Vasily Averin wrote:
> Any allocation failure during the #PF path will return with VM_FAULT_OOM
> which in turn results in pagefault_out_of_memory which in own turn
> executes out_out_memory() and can kill a random task.
> 
> An allocation might fail when the current task is the oom victim
> and there are no memory reserves left. The OOM killer is already
> handled at the page allocator level for the global OOM and at the
> charging level for the memcg one. Both have much more information
> about the scope of allocation/charge request. This means that
> either the OOM killer has been invoked properly and didn't lead
> to the allocation success or it has been skipped because it couldn't
> have been invoked. In both cases triggering it from here is pointless
> and even harmful.
> 
> It makes much more sense to let the killed task die rather than to
> wake up an eternally hungry oom-killer and send him to choose a fatter
> victim for breakfast.
> 
> Suggested-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

Thanks!

> ---
>  mm/oom_kill.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 831340e7ad8b..1deef8c7a71b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
>  	if (mem_cgroup_oom_synchronize(true))
>  		return;
>  
> +	if (fatal_signal_pending(current))
> +		return;
> +
>  	if (!mutex_trylock(&oom_lock))
>  		return;
>  	out_of_memory(&oc);
> -- 
> 2.32.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
@ 2021-10-25  9:34                                                     ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-25  9:34 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel

On Sat 23-10-21 16:20:18, Vasily Averin wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Any allocation failure during the #PF path will return with VM_FAULT_OOM
> which in turn results in pagefault_out_of_memory. This can happen for
> 2 different reasons. a) Memcg is out of memory and we rely on
> mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> normal allocation fails.
> 
> The later is quite problematic because allocation paths already trigger
> out_of_memory and the page allocator tries really hard to not fail
> allocations. Anyway, if the OOM killer has been already invoked there
> is no reason to invoke it again from the #PF path. Especially when the
> OOM condition might be gone by that time and we have no way to find out
> other than allocate.
> 
> Moreover if the allocation failed and the OOM killer hasn't been
> invoked then we are unlikely to do the right thing from the #PF context
> because we have already lost the allocation context and restictions and
> therefore might oom kill a task from a different NUMA domain.
> 
> This all suggests that there is no legitimate reason to trigger
> out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
> that no #PF path returns with VM_FAULT_OOM without allocation print a
> warning that this is happening before we restart the #PF.
> 
> [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller.
> This is a local problem related to memcg, however, it causes unnecessary
> global OOM kills that are repeated over and over again and escalate into
> a real disaster. This has been broken since kmem accounting has been
> introduced for cgroup v1 (3.8). There was no kmem specific reclaim
> for the separate limit so the only way to handle kmem hard limit
> was to return with ENOMEM.
> In upstream the problem will be fixed by removing the outdated kmem limit,
> however stable and LTS kernels cannot do it and are still affected.
> This patch fixes the problem and should be backported into stable/LTS.]
> 
> Cc: stable@vger.kernel.org

I would be still careful about backporting to stable trees. At least
wait for a release cycle to catch potential problems before backporting.
The problem with kmem is documented and for quite a lot of time and we
haven't received a single bug report IIRC so this is likely not a real
problem people are facing out there.


> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

I think this is the right thing to do. Hopefuly we won't find some
tricky code which would depend on this behavior. If this turns out to be
the case then we will clearly learn about it by the kernel message and
we can to handle that situation more gracefully.

Maybe we will want to update the chengelog to be more specific based on
the review comments but this one should describe the problem quite well
already.
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
> v2: hook with fatal_signal_pending() has beem moved into a separate patch,
>     improved patch description, removed "Fixed" mark.
> ---
>  mm/oom_kill.c | 22 ++++++++--------------
>  1 file changed, 8 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1deef8c7a71b..f98954befafb 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1120,19 +1120,15 @@ bool out_of_memory(struct oom_control *oc)
>  }
>  
>  /*
> - * The pagefault handler calls here because it is out of memory, so kill a
> - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
> - * killing is already in progress so do nothing.
> + * The pagefault handler calls here because some allocation has failed. We have
> + * to take care of the memcg OOM here because this is the only safe context without
> + * any locks held but let the oom killer triggered from the allocation context care
> + * about the global OOM.
>   */
>  void pagefault_out_of_memory(void)
>  {
> -	struct oom_control oc = {
> -		.zonelist = NULL,
> -		.nodemask = NULL,
> -		.memcg = NULL,
> -		.gfp_mask = 0,
> -		.order = 0,
> -	};
> +	static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
> +				      DEFAULT_RATELIMIT_BURST);
>  
>  	if (mem_cgroup_oom_synchronize(true))
>  		return;
> @@ -1140,10 +1136,8 @@ void pagefault_out_of_memory(void)
>  	if (fatal_signal_pending(current))
>  		return;
>  
> -	if (!mutex_trylock(&oom_lock))
> -		return;
> -	out_of_memory(&oc);
> -	mutex_unlock(&oom_lock);
> +	if (__ratelimit(&pfoom_rs))
> +		pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
>  }
>  
>  SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> -- 
> 2.32.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
@ 2021-10-25  9:34                                                     ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-25  9:34 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Sat 23-10-21 16:20:18, Vasily Averin wrote:
> From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> 
> Any allocation failure during the #PF path will return with VM_FAULT_OOM
> which in turn results in pagefault_out_of_memory. This can happen for
> 2 different reasons. a) Memcg is out of memory and we rely on
> mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
> normal allocation fails.
> 
> The later is quite problematic because allocation paths already trigger
> out_of_memory and the page allocator tries really hard to not fail
> allocations. Anyway, if the OOM killer has been already invoked there
> is no reason to invoke it again from the #PF path. Especially when the
> OOM condition might be gone by that time and we have no way to find out
> other than allocate.
> 
> Moreover if the allocation failed and the OOM killer hasn't been
> invoked then we are unlikely to do the right thing from the #PF context
> because we have already lost the allocation context and restictions and
> therefore might oom kill a task from a different NUMA domain.
> 
> This all suggests that there is no legitimate reason to trigger
> out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
> that no #PF path returns with VM_FAULT_OOM without allocation print a
> warning that this is happening before we restart the #PF.
> 
> [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller.
> This is a local problem related to memcg, however, it causes unnecessary
> global OOM kills that are repeated over and over again and escalate into
> a real disaster. This has been broken since kmem accounting has been
> introduced for cgroup v1 (3.8). There was no kmem specific reclaim
> for the separate limit so the only way to handle kmem hard limit
> was to return with ENOMEM.
> In upstream the problem will be fixed by removing the outdated kmem limit,
> however stable and LTS kernels cannot do it and are still affected.
> This patch fixes the problem and should be backported into stable/LTS.]
> 
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

I would be still careful about backporting to stable trees. At least
wait for a release cycle to catch potential problems before backporting.
The problem with kmem is documented and for quite a lot of time and we
haven't received a single bug report IIRC so this is likely not a real
problem people are facing out there.


> Signed-off-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>

I think this is the right thing to do. Hopefuly we won't find some
tricky code which would depend on this behavior. If this turns out to be
the case then we will clearly learn about it by the kernel message and
we can to handle that situation more gracefully.

Maybe we will want to update the chengelog to be more specific based on
the review comments but this one should describe the problem quite well
already.
Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

Thanks!

> ---
> v2: hook with fatal_signal_pending() has beem moved into a separate patch,
>     improved patch description, removed "Fixed" mark.
> ---
>  mm/oom_kill.c | 22 ++++++++--------------
>  1 file changed, 8 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1deef8c7a71b..f98954befafb 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1120,19 +1120,15 @@ bool out_of_memory(struct oom_control *oc)
>  }
>  
>  /*
> - * The pagefault handler calls here because it is out of memory, so kill a
> - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
> - * killing is already in progress so do nothing.
> + * The pagefault handler calls here because some allocation has failed. We have
> + * to take care of the memcg OOM here because this is the only safe context without
> + * any locks held but let the oom killer triggered from the allocation context care
> + * about the global OOM.
>   */
>  void pagefault_out_of_memory(void)
>  {
> -	struct oom_control oc = {
> -		.zonelist = NULL,
> -		.nodemask = NULL,
> -		.memcg = NULL,
> -		.gfp_mask = 0,
> -		.order = 0,
> -	};
> +	static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
> +				      DEFAULT_RATELIMIT_BURST);
>  
>  	if (mem_cgroup_oom_synchronize(true))
>  		return;
> @@ -1140,10 +1136,8 @@ void pagefault_out_of_memory(void)
>  	if (fatal_signal_pending(current))
>  		return;
>  
> -	if (!mutex_trylock(&oom_lock))
> -		return;
> -	out_of_memory(&oc);
> -	mutex_unlock(&oom_lock);
> +	if (__ratelimit(&pfoom_rs))
> +		pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
>  }
>  
>  SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
> -- 
> 2.32.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-25  9:36                                                     ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-25  9:36 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel

On Sat 23-10-21 16:20:51, Vasily Averin wrote:
> Memory cgroup charging allows killed or exiting tasks to exceed the hard
> limit. It is assumed that the amount of the memory charged by those
> tasks is bound and most of the memory will get released while the task
> is exiting. This is resembling a heuristic for the global OOM situation
> when tasks get access to memory reserves. There is no global memory
> shortage at the memcg level so the memcg heuristic is more relieved.
> 
> The above assumption is overly optimistic though. E.g. vmalloc can scale
> to really large requests and the heuristic would allow that. We used to
> have an early break in the vmalloc allocator for killed tasks but this
> has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
> the current task is killed""). There are likely other similar code paths
> which do not check for fatal signals in an allocation&charge loop.
> Also there are some kernel objects charged to a memcg which are not
> bound to a process life time.
> 
> It has been observed that it is not really hard to trigger these
> bypasses and cause global OOM situation.
> 
> One potential way to address these runaways would be to limit the amount
> of excess (similar to the global OOM with limited oom reserves). This is
> certainly possible but it is not really clear how much of an excess is
> desirable and still protects from global OOMs as that would have to
> consider the overall memcg configuration.
> 
> This patch is addressing the problem by removing the heuristic
> altogether. Bypass is only allowed for requests which either cannot fail
> or where the failure is not desirable while excess should be still
> limited (e.g. atomic requests). Implementation wise a killed or dying
> task fails to charge if it has passed the OOM killer stage. That should
> give all forms of reclaim chance to restore the limit before the
> failure (ENOMEM) and tell the caller to back off.
> 
> In addition, this patch renames should_force_charge() helper
> to task_is_dying() because now its use is not associated witch forced
> charging.
> 
> This patch depends on pagefault_out_of_memory() to not trigger
> out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
> and cause a global OOM killer.
> 
> Cc: stable@vger.kernel.org

My view on stable backport is similar to the previous patch. If we want
to have it there then let's wait for some time to see whether there are
any fallouts as this patch depends on the PF_OOM change.

> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/memcontrol.c | 27 ++++++++-------------------
>  1 file changed, 8 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6da5020a8656..87e41c3cac10 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -239,7 +239,7 @@ enum res_type {
>  	     iter != NULL;				\
>  	     iter = mem_cgroup_iter(NULL, iter, NULL))
>  
> -static inline bool should_force_charge(void)
> +static inline bool task_is_dying(void)
>  {
>  	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
>  		(current->flags & PF_EXITING);
> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * A few threads which were not waiting at mutex_lock_killable() can
>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>  	 */
> -	ret = should_force_charge() || out_of_memory(&oc);
> +	ret = task_is_dying() || out_of_memory(&oc);
>  
>  unlock:
>  	mutex_unlock(&oom_lock);
> @@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	struct page_counter *counter;
>  	enum oom_status oom_status;
>  	unsigned long nr_reclaimed;
> +	bool passed_oom = false;
>  	bool may_swap = true;
>  	bool drained = false;
>  	unsigned long pflags;
> @@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (gfp_mask & __GFP_ATOMIC)
>  		goto force;
>  
> -	/*
> -	 * Unlike in global OOM situations, memcg is not in a physical
> -	 * memory shortage.  Allow dying and OOM-killed tasks to
> -	 * bypass the last charges so that they can exit quickly and
> -	 * free their memory.
> -	 */
> -	if (unlikely(should_force_charge()))
> -		goto force;
> -
>  	/*
>  	 * Prevent unbounded recursion when reclaim operations need to
>  	 * allocate memory. This might exceed the limits temporarily,
> @@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (gfp_mask & __GFP_RETRY_MAYFAIL)
>  		goto nomem;
>  
> -	if (fatal_signal_pending(current))
> -		goto force;
> +	/* Avoid endless loop for tasks bypassed by the oom killer */
> +	if (passed_oom && task_is_dying())
> +		goto nomem;
>  
>  	/*
>  	 * keep retrying as long as the memcg oom killer is able to make
> @@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 */
>  	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
>  		       get_order(nr_pages * PAGE_SIZE));
> -	switch (oom_status) {
> -	case OOM_SUCCESS:
> +	if (oom_status == OOM_SUCCESS) {
> +		passed_oom = true;
>  		nr_retries = MAX_RECLAIM_RETRIES;
>  		goto retry;
> -	case OOM_FAILED:
> -		goto force;
> -	default:
> -		goto nomem;
>  	}
>  nomem:
>  	if (!(gfp_mask & __GFP_NOFAIL))
> -- 
> 2.32.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-25  9:36                                                     ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-25  9:36 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Johannes Weiner, Vladimir Davydov, Andrew Morton, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Sat 23-10-21 16:20:51, Vasily Averin wrote:
> Memory cgroup charging allows killed or exiting tasks to exceed the hard
> limit. It is assumed that the amount of the memory charged by those
> tasks is bound and most of the memory will get released while the task
> is exiting. This is resembling a heuristic for the global OOM situation
> when tasks get access to memory reserves. There is no global memory
> shortage at the memcg level so the memcg heuristic is more relieved.
> 
> The above assumption is overly optimistic though. E.g. vmalloc can scale
> to really large requests and the heuristic would allow that. We used to
> have an early break in the vmalloc allocator for killed tasks but this
> has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when
> the current task is killed""). There are likely other similar code paths
> which do not check for fatal signals in an allocation&charge loop.
> Also there are some kernel objects charged to a memcg which are not
> bound to a process life time.
> 
> It has been observed that it is not really hard to trigger these
> bypasses and cause global OOM situation.
> 
> One potential way to address these runaways would be to limit the amount
> of excess (similar to the global OOM with limited oom reserves). This is
> certainly possible but it is not really clear how much of an excess is
> desirable and still protects from global OOMs as that would have to
> consider the overall memcg configuration.
> 
> This patch is addressing the problem by removing the heuristic
> altogether. Bypass is only allowed for requests which either cannot fail
> or where the failure is not desirable while excess should be still
> limited (e.g. atomic requests). Implementation wise a killed or dying
> task fails to charge if it has passed the OOM killer stage. That should
> give all forms of reclaim chance to restore the limit before the
> failure (ENOMEM) and tell the caller to back off.
> 
> In addition, this patch renames should_force_charge() helper
> to task_is_dying() because now its use is not associated witch forced
> charging.
> 
> This patch depends on pagefault_out_of_memory() to not trigger
> out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
> and cause a global OOM killer.
> 
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

My view on stable backport is similar to the previous patch. If we want
to have it there then let's wait for some time to see whether there are
any fallouts as this patch depends on the PF_OOM change.

> Suggested-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> Signed-off-by: Vasily Averin <vvs-5HdwGun5lf+gSpxsJD1C4w@public.gmane.org>
> Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

Thanks!

> ---
>  mm/memcontrol.c | 27 ++++++++-------------------
>  1 file changed, 8 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6da5020a8656..87e41c3cac10 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -239,7 +239,7 @@ enum res_type {
>  	     iter != NULL;				\
>  	     iter = mem_cgroup_iter(NULL, iter, NULL))
>  
> -static inline bool should_force_charge(void)
> +static inline bool task_is_dying(void)
>  {
>  	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
>  		(current->flags & PF_EXITING);
> @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * A few threads which were not waiting at mutex_lock_killable() can
>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>  	 */
> -	ret = should_force_charge() || out_of_memory(&oc);
> +	ret = task_is_dying() || out_of_memory(&oc);
>  
>  unlock:
>  	mutex_unlock(&oom_lock);
> @@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	struct page_counter *counter;
>  	enum oom_status oom_status;
>  	unsigned long nr_reclaimed;
> +	bool passed_oom = false;
>  	bool may_swap = true;
>  	bool drained = false;
>  	unsigned long pflags;
> @@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (gfp_mask & __GFP_ATOMIC)
>  		goto force;
>  
> -	/*
> -	 * Unlike in global OOM situations, memcg is not in a physical
> -	 * memory shortage.  Allow dying and OOM-killed tasks to
> -	 * bypass the last charges so that they can exit quickly and
> -	 * free their memory.
> -	 */
> -	if (unlikely(should_force_charge()))
> -		goto force;
> -
>  	/*
>  	 * Prevent unbounded recursion when reclaim operations need to
>  	 * allocate memory. This might exceed the limits temporarily,
> @@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (gfp_mask & __GFP_RETRY_MAYFAIL)
>  		goto nomem;
>  
> -	if (fatal_signal_pending(current))
> -		goto force;
> +	/* Avoid endless loop for tasks bypassed by the oom killer */
> +	if (passed_oom && task_is_dying())
> +		goto nomem;
>  
>  	/*
>  	 * keep retrying as long as the memcg oom killer is able to make
> @@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 */
>  	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
>  		       get_order(nr_pages * PAGE_SIZE));
> -	switch (oom_status) {
> -	case OOM_SUCCESS:
> +	if (oom_status == OOM_SUCCESS) {
> +		passed_oom = true;
>  		nr_retries = MAX_RECLAIM_RETRIES;
>  		goto retry;
> -	case OOM_FAILED:
> -		goto force;
> -	default:
> -		goto nomem;
>  	}
>  nomem:
>  	if (!(gfp_mask & __GFP_NOFAIL))
> -- 
> 2.32.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
@ 2021-10-26 13:56                                                         ` Tetsuo Handa
  0 siblings, 0 replies; 131+ messages in thread
From: Tetsuo Handa @ 2021-10-26 13:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vasily Averin, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka,
	Shakeel Butt, Mel Gorman, cgroups, linux-mm, linux-kernel,
	kernel, Johannes Weiner, Vladimir Davydov, Andrew Morton

On 2021/10/25 17:04, Michal Hocko wrote:
> I do not think there is any guarantee. This code has meant to be a
> safeguard but it turns out to be adding more harm than a safety. There
> are several scenarios mentioned in this thread where this would be
> counter productive or outright wrong thing to do.

Setting PR_IO_FLUSHER via prctl(PR_SET_IO_FLUSHER) + hitting legacy kmem
charge limit might be an unexpected combination?

> 
> On the other hand it is hard to imagine any legitimate situation where
> this would be a right thing to do. Maybe you have something more
> specific in mind? What would be the legit code to rely on OOM handling
> out of the line (where the details about the allocation scope is lost)?

I don't have specific scenario, but I feel that it might be a chance to
retry killable vmalloc(). Commit b8c8a338f75e ("Revert "vmalloc: back off
when the current task is killed"") was 4.5 years ago, and fuzz testing found
many bugs triggered by memory allocation fault injection. Thus, I think that
the direction is going towards "we can fail memory allocation upon SIGKILL
(rather than worrying about depleting memory reserves and/or escalating to
global OOM killer invocations)". Most memory allocation requests which
allocate memory for userspace process are willing to give up upon SIGKILL.

Like you are trying to add NOFS, NOIO, NOFAIL support to vmalloc(), you could
consider KILLABLE support as well. Of course, direct reclaim makes it difficult
to immediately give up upon SIGKILL, but killable allocation sounds still nice
even if best-effort basis.


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
@ 2021-10-26 13:56                                                         ` Tetsuo Handa
  0 siblings, 0 replies; 131+ messages in thread
From: Tetsuo Handa @ 2021-10-26 13:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vasily Averin, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka,
	Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A, Johannes Weiner, Vladimir Davydov,
	Andrew Morton

On 2021/10/25 17:04, Michal Hocko wrote:
> I do not think there is any guarantee. This code has meant to be a
> safeguard but it turns out to be adding more harm than a safety. There
> are several scenarios mentioned in this thread where this would be
> counter productive or outright wrong thing to do.

Setting PR_IO_FLUSHER via prctl(PR_SET_IO_FLUSHER) + hitting legacy kmem
charge limit might be an unexpected combination?

> 
> On the other hand it is hard to imagine any legitimate situation where
> this would be a right thing to do. Maybe you have something more
> specific in mind? What would be the legit code to rely on OOM handling
> out of the line (where the details about the allocation scope is lost)?

I don't have specific scenario, but I feel that it might be a chance to
retry killable vmalloc(). Commit b8c8a338f75e ("Revert "vmalloc: back off
when the current task is killed"") was 4.5 years ago, and fuzz testing found
many bugs triggered by memory allocation fault injection. Thus, I think that
the direction is going towards "we can fail memory allocation upon SIGKILL
(rather than worrying about depleting memory reserves and/or escalating to
global OOM killer invocations)". Most memory allocation requests which
allocate memory for userspace process are willing to give up upon SIGKILL.

Like you are trying to add NOFS, NOIO, NOFAIL support to vmalloc(), you could
consider KILLABLE support as well. Of course, direct reclaim makes it difficult
to immediately give up upon SIGKILL, but killable allocation sounds still nice
even if best-effort basis.


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
@ 2021-10-26 14:07                                                           ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-26 14:07 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Vasily Averin, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka,
	Shakeel Butt, Mel Gorman, cgroups, linux-mm, linux-kernel,
	kernel, Johannes Weiner, Vladimir Davydov, Andrew Morton

On Tue 26-10-21 22:56:44, Tetsuo Handa wrote:
> On 2021/10/25 17:04, Michal Hocko wrote:
> > I do not think there is any guarantee. This code has meant to be a
> > safeguard but it turns out to be adding more harm than a safety. There
> > are several scenarios mentioned in this thread where this would be
> > counter productive or outright wrong thing to do.
> 
> Setting PR_IO_FLUSHER via prctl(PR_SET_IO_FLUSHER) + hitting legacy kmem
> charge limit might be an unexpected combination?

I am not sure I follow or why PR_SET_IO_FLUSHER should be relevant. But
triggering the global OOM killer on kmem charge limit failure is
certainly not the right thing to do. Quite opposite because this would
be effectivelly a global DoS as a result of a local memory constrain.
 
> > On the other hand it is hard to imagine any legitimate situation where
> > this would be a right thing to do. Maybe you have something more
> > specific in mind? What would be the legit code to rely on OOM handling
> > out of the line (where the details about the allocation scope is lost)?
> 
> I don't have specific scenario, but I feel that it might be a chance to
> retry killable vmalloc(). Commit b8c8a338f75e ("Revert "vmalloc: back off
> when the current task is killed"") was 4.5 years ago, and fuzz testing found
> many bugs triggered by memory allocation fault injection. Thus, I think that
> the direction is going towards "we can fail memory allocation upon SIGKILL
> (rather than worrying about depleting memory reserves and/or escalating to
> global OOM killer invocations)". Most memory allocation requests which
> allocate memory for userspace process are willing to give up upon SIGKILL.
> 
> Like you are trying to add NOFS, NOIO, NOFAIL support to vmalloc(), you could
> consider KILLABLE support as well. Of course, direct reclaim makes it difficult
> to immediately give up upon SIGKILL, but killable allocation sounds still nice
> even if best-effort basis.

This is all fine but I am not sure how this is realated to this patch.
The previous patch already gives up in pagefault_out_of_memory on fatal
signal pending. So this code is not really reachable.

Also alowing more allocations to fail doesn't really suggest that we
should trigger OOM killer from #PF. I would argue that the opposite is
the case actually. Or I just haven't understood your concern?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF
@ 2021-10-26 14:07                                                           ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-26 14:07 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Vasily Averin, Roman Gushchin, Uladzislau Rezki, Vlastimil Babka,
	Shakeel Butt, Mel Gorman, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A, Johannes Weiner, Vladimir Davydov,
	Andrew Morton

On Tue 26-10-21 22:56:44, Tetsuo Handa wrote:
> On 2021/10/25 17:04, Michal Hocko wrote:
> > I do not think there is any guarantee. This code has meant to be a
> > safeguard but it turns out to be adding more harm than a safety. There
> > are several scenarios mentioned in this thread where this would be
> > counter productive or outright wrong thing to do.
> 
> Setting PR_IO_FLUSHER via prctl(PR_SET_IO_FLUSHER) + hitting legacy kmem
> charge limit might be an unexpected combination?

I am not sure I follow or why PR_SET_IO_FLUSHER should be relevant. But
triggering the global OOM killer on kmem charge limit failure is
certainly not the right thing to do. Quite opposite because this would
be effectivelly a global DoS as a result of a local memory constrain.
 
> > On the other hand it is hard to imagine any legitimate situation where
> > this would be a right thing to do. Maybe you have something more
> > specific in mind? What would be the legit code to rely on OOM handling
> > out of the line (where the details about the allocation scope is lost)?
> 
> I don't have specific scenario, but I feel that it might be a chance to
> retry killable vmalloc(). Commit b8c8a338f75e ("Revert "vmalloc: back off
> when the current task is killed"") was 4.5 years ago, and fuzz testing found
> many bugs triggered by memory allocation fault injection. Thus, I think that
> the direction is going towards "we can fail memory allocation upon SIGKILL
> (rather than worrying about depleting memory reserves and/or escalating to
> global OOM killer invocations)". Most memory allocation requests which
> allocate memory for userspace process are willing to give up upon SIGKILL.
> 
> Like you are trying to add NOFS, NOIO, NOFAIL support to vmalloc(), you could
> consider KILLABLE support as well. Of course, direct reclaim makes it difficult
> to immediately give up upon SIGKILL, but killable allocation sounds still nice
> even if best-effort basis.

This is all fine but I am not sure how this is realated to this patch.
The previous patch already gives up in pagefault_out_of_memory on fatal
signal pending. So this code is not really reachable.

Also alowing more allocations to fail doesn't really suggest that we
should trigger OOM killer from #PF. I would argue that the opposite is
the case actually. Or I just haven't understood your concern?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-27 22:36                                                       ` Andrew Morton
  0 siblings, 0 replies; 131+ messages in thread
From: Andrew Morton @ 2021-10-27 22:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel,
	Greg Kroah-Hartman

On Mon, 25 Oct 2021 11:36:41 +0200 Michal Hocko <mhocko@suse.com> wrote:

> My view on stable backport is similar to the previous patch. If we want
> to have it there then let's wait for some time to see whether there are
> any fallouts as this patch depends on the PF_OOM change.

It's strange that [1/3] doesn't have cc:stable, but [2/3] and [3/3] do
not.  What is the thinking here?

I expect we'd be OK with merging these into 5.16-rc1.  This still gives
another couple of months in -rc to shake out any problems.  But I
suspect the -stable maintainers will merge and release the patches
before they are released in 5.16.

In which case an alternative would be not to mark these patches
cc:stable and to somehow remember to ask the -stable maintainers to
merge them after 5.16 has been on the streets for a suitable period.

Greg, thoughts?

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-27 22:36                                                       ` Andrew Morton
  0 siblings, 0 replies; 131+ messages in thread
From: Andrew Morton @ 2021-10-27 22:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A, Greg Kroah-Hartman

On Mon, 25 Oct 2021 11:36:41 +0200 Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:

> My view on stable backport is similar to the previous patch. If we want
> to have it there then let's wait for some time to see whether there are
> any fallouts as this patch depends on the PF_OOM change.

It's strange that [1/3] doesn't have cc:stable, but [2/3] and [3/3] do
not.  What is the thinking here?

I expect we'd be OK with merging these into 5.16-rc1.  This still gives
another couple of months in -rc to shake out any problems.  But I
suspect the -stable maintainers will merge and release the patches
before they are released in 5.16.

In which case an alternative would be not to mark these patches
cc:stable and to somehow remember to ask the -stable maintainers to
merge them after 5.16 has been on the streets for a suitable period.

Greg, thoughts?

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-28  7:22                                                         ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-28  7:22 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel,
	Greg Kroah-Hartman

On 28.10.2021 01:36, Andrew Morton wrote:
> On Mon, 25 Oct 2021 11:36:41 +0200 Michal Hocko <mhocko@suse.com> wrote:
> 
>> My view on stable backport is similar to the previous patch. If we want
>> to have it there then let's wait for some time to see whether there are
>> any fallouts as this patch depends on the PF_OOM change.
> 
> It's strange that [1/3] doesn't have cc:stable, but [2/3] and [3/3] do
> not.  What is the thinking here?

My fault, I missed it.
All 3 patches should be backported,
I did it already to stables kernels since 4.4 and I'm ready to submit it in demand.

> I expect we'd be OK with merging these into 5.16-rc1.  This still gives
> another couple of months in -rc to shake out any problems.  But I
> suspect the -stable maintainers will merge and release the patches
> before they are released in 5.16.
> 
> In which case an alternative would be not to mark these patches
> cc:stable and to somehow remember to ask the -stable maintainers to
> merge them after 5.16 has been on the streets for a suitable period.
> 
> Greg, thoughts?

If you wish I can remind Greg in a month or even after 5.17 release.

Thank you,
	Vasily Averin


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-28  7:22                                                         ` Vasily Averin
  0 siblings, 0 replies; 131+ messages in thread
From: Vasily Averin @ 2021-10-28  7:22 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A, Greg Kroah-Hartman

On 28.10.2021 01:36, Andrew Morton wrote:
> On Mon, 25 Oct 2021 11:36:41 +0200 Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> 
>> My view on stable backport is similar to the previous patch. If we want
>> to have it there then let's wait for some time to see whether there are
>> any fallouts as this patch depends on the PF_OOM change.
> 
> It's strange that [1/3] doesn't have cc:stable, but [2/3] and [3/3] do
> not.  What is the thinking here?

My fault, I missed it.
All 3 patches should be backported,
I did it already to stables kernels since 4.4 and I'm ready to submit it in demand.

> I expect we'd be OK with merging these into 5.16-rc1.  This still gives
> another couple of months in -rc to shake out any problems.  But I
> suspect the -stable maintainers will merge and release the patches
> before they are released in 5.16.
> 
> In which case an alternative would be not to mark these patches
> cc:stable and to somehow remember to ask the -stable maintainers to
> merge them after 5.16 has been on the streets for a suitable period.
> 
> Greg, thoughts?

If you wish I can remind Greg in a month or even after 5.17 release.

Thank you,
	Vasily Averin


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-29  7:46                                                           ` Greg Kroah-Hartman
  0 siblings, 0 replies; 131+ messages in thread
From: Greg Kroah-Hartman @ 2021-10-29  7:46 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Vladimir Davydov,
	Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups, linux-mm, linux-kernel,
	kernel

On Thu, Oct 28, 2021 at 10:22:56AM +0300, Vasily Averin wrote:
> On 28.10.2021 01:36, Andrew Morton wrote:
> > On Mon, 25 Oct 2021 11:36:41 +0200 Michal Hocko <mhocko@suse.com> wrote:
> > 
> >> My view on stable backport is similar to the previous patch. If we want
> >> to have it there then let's wait for some time to see whether there are
> >> any fallouts as this patch depends on the PF_OOM change.
> > 
> > It's strange that [1/3] doesn't have cc:stable, but [2/3] and [3/3] do
> > not.  What is the thinking here?
> 
> My fault, I missed it.
> All 3 patches should be backported,
> I did it already to stables kernels since 4.4 and I'm ready to submit it in demand.
> 
> > I expect we'd be OK with merging these into 5.16-rc1.  This still gives
> > another couple of months in -rc to shake out any problems.  But I
> > suspect the -stable maintainers will merge and release the patches
> > before they are released in 5.16.
> > 
> > In which case an alternative would be not to mark these patches
> > cc:stable and to somehow remember to ask the -stable maintainers to
> > merge them after 5.16 has been on the streets for a suitable period.
> > 
> > Greg, thoughts?
> 
> If you wish I can remind Greg in a month or even after 5.17 release.

Please remind us then, otherwise I will not remember :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-29  7:46                                                           ` Greg Kroah-Hartman
  0 siblings, 0 replies; 131+ messages in thread
From: Greg Kroah-Hartman @ 2021-10-29  7:46 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Vladimir Davydov,
	Roman Gushchin, Uladzislau Rezki, Vlastimil Babka, Shakeel Butt,
	Mel Gorman, Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A

On Thu, Oct 28, 2021 at 10:22:56AM +0300, Vasily Averin wrote:
> On 28.10.2021 01:36, Andrew Morton wrote:
> > On Mon, 25 Oct 2021 11:36:41 +0200 Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> > 
> >> My view on stable backport is similar to the previous patch. If we want
> >> to have it there then let's wait for some time to see whether there are
> >> any fallouts as this patch depends on the PF_OOM change.
> > 
> > It's strange that [1/3] doesn't have cc:stable, but [2/3] and [3/3] do
> > not.  What is the thinking here?
> 
> My fault, I missed it.
> All 3 patches should be backported,
> I did it already to stables kernels since 4.4 and I'm ready to submit it in demand.
> 
> > I expect we'd be OK with merging these into 5.16-rc1.  This still gives
> > another couple of months in -rc to shake out any problems.  But I
> > suspect the -stable maintainers will merge and release the patches
> > before they are released in 5.16.
> > 
> > In which case an alternative would be not to mark these patches
> > cc:stable and to somehow remember to ask the -stable maintainers to
> > merge them after 5.16 has been on the streets for a suitable period.
> > 
> > Greg, thoughts?
> 
> If you wish I can remind Greg in a month or even after 5.17 release.

Please remind us then, otherwise I will not remember :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-29  7:58                                                         ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-29  7:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups, linux-mm, linux-kernel, kernel,
	Greg Kroah-Hartman

On Wed 27-10-21 15:36:08, Andrew Morton wrote:
> On Mon, 25 Oct 2021 11:36:41 +0200 Michal Hocko <mhocko@suse.com> wrote:
> 
> > My view on stable backport is similar to the previous patch. If we want
> > to have it there then let's wait for some time to see whether there are
> > any fallouts as this patch depends on the PF_OOM change.
> 
> It's strange that [1/3] doesn't have cc:stable, but [2/3] and [3/3] do
> not.  What is the thinking here?
> 
> I expect we'd be OK with merging these into 5.16-rc1.  This still gives
> another couple of months in -rc to shake out any problems.  But I
> suspect the -stable maintainers will merge and release the patches
> before they are released in 5.16.
> 
> In which case an alternative would be not to mark these patches
> cc:stable and to somehow remember to ask the -stable maintainers to
> merge them after 5.16 has been on the streets for a suitable period.

My take on stable backports is http://lkml.kernel.org/r/YXZ6FMzJLEz4TA2d@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks
@ 2021-10-29  7:58                                                         ` Michal Hocko
  0 siblings, 0 replies; 131+ messages in thread
From: Michal Hocko @ 2021-10-29  7:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vasily Averin, Johannes Weiner, Vladimir Davydov, Roman Gushchin,
	Uladzislau Rezki, Vlastimil Babka, Shakeel Butt, Mel Gorman,
	Tetsuo Handa, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	kernel-GEFAQzZX7r8dnm+yROfE0A, Greg Kroah-Hartman

On Wed 27-10-21 15:36:08, Andrew Morton wrote:
> On Mon, 25 Oct 2021 11:36:41 +0200 Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org> wrote:
> 
> > My view on stable backport is similar to the previous patch. If we want
> > to have it there then let's wait for some time to see whether there are
> > any fallouts as this patch depends on the PF_OOM change.
> 
> It's strange that [1/3] doesn't have cc:stable, but [2/3] and [3/3] do
> not.  What is the thinking here?
> 
> I expect we'd be OK with merging these into 5.16-rc1.  This still gives
> another couple of months in -rc to shake out any problems.  But I
> suspect the -stable maintainers will merge and release the patches
> before they are released in 5.16.
> 
> In which case an alternative would be not to mark these patches
> cc:stable and to somehow remember to ask the -stable maintainers to
> merge them after 5.16 has been on the streets for a suitable period.

My take on stable backports is http://lkml.kernel.org/r/YXZ6FMzJLEz4TA2d-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v4] memcg: prohibit unconditional exceeding the limit of dying tasks
  2021-10-20  8:07                         ` Vasily Averin
  (?)
  (?)
@ 2021-11-12 23:48                         ` kernel test robot
  -1 siblings, 0 replies; 131+ messages in thread
From: kernel test robot @ 2021-11-12 23:48 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 6509 bytes --]

Hi Vasily,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on v5.15-rc6]
[cannot apply to hnaz-mm/master next-20211112]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Vasily-Averin/memcg-prohibit-unconditional-exceeding-the-limit-of-dying-tasks/20211020-160835
base:   DEBUG invalid remote for branch v5.15-rc6 519d81956ee277b4419c723adfb154603c2565ba
config: x86_64-rhel-8.3-kselftests (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.4-dirty
        # https://github.com/0day-ci/linux/commit/fd30aeb6287dfe68bcf02c4d9634e342d6743d90
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Vasily-Averin/memcg-prohibit-unconditional-exceeding-the-limit-of-dying-tasks/20211020-160835
        git checkout fd30aeb6287dfe68bcf02c4d9634e342d6743d90
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
   mm/memcontrol.c:4088:21: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/memcontrol.c:4088:21: sparse:    struct mem_cgroup_threshold_ary [noderef] __rcu *
   mm/memcontrol.c:4088:21: sparse:    struct mem_cgroup_threshold_ary *
   mm/memcontrol.c:4090:21: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/memcontrol.c:4090:21: sparse:    struct mem_cgroup_threshold_ary [noderef] __rcu *
   mm/memcontrol.c:4090:21: sparse:    struct mem_cgroup_threshold_ary *
   mm/memcontrol.c:4246:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/memcontrol.c:4246:9: sparse:    struct mem_cgroup_threshold_ary [noderef] __rcu *
   mm/memcontrol.c:4246:9: sparse:    struct mem_cgroup_threshold_ary *
   mm/memcontrol.c:4340:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/memcontrol.c:4340:9: sparse:    struct mem_cgroup_threshold_ary [noderef] __rcu *
   mm/memcontrol.c:4340:9: sparse:    struct mem_cgroup_threshold_ary *
   mm/memcontrol.c:5966:23: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/memcontrol.c:5966:23: sparse:    struct task_struct [noderef] __rcu *
   mm/memcontrol.c:5966:23: sparse:    struct task_struct *
   mm/memcontrol.c: note: in included file:
   include/linux/memcontrol.h:752:9: sparse: sparse: context imbalance in 'lock_page_lruvec' - wrong count at exit
   include/linux/memcontrol.h:752:9: sparse: sparse: context imbalance in 'lock_page_lruvec_irq' - wrong count at exit
   include/linux/memcontrol.h:752:9: sparse: sparse: context imbalance in 'lock_page_lruvec_irqsave' - wrong count at exit
>> mm/memcontrol.c:1826:70: sparse: sparse: non size-preserving integer to pointer cast
   mm/memcontrol.c:1987:6: sparse: sparse: context imbalance in 'lock_page_memcg' - wrong count at exit
   mm/memcontrol.c:2036:17: sparse: sparse: context imbalance in '__unlock_page_memcg' - unexpected unlock
   mm/memcontrol.c:5817:28: sparse: sparse: context imbalance in 'mem_cgroup_count_precharge_pte_range' - unexpected unlock
   mm/memcontrol.c:6011:36: sparse: sparse: context imbalance in 'mem_cgroup_move_charge_pte_range' - unexpected unlock

vim +1826 mm/memcontrol.c

  1765	
  1766	static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
  1767	{
  1768		enum oom_status ret;
  1769		bool locked;
  1770	
  1771		if (order > PAGE_ALLOC_COSTLY_ORDER)
  1772			return OOM_SKIPPED;
  1773	
  1774		memcg_memory_event(memcg, MEMCG_OOM);
  1775	
  1776		/*
  1777		 * We are in the middle of the charge context here, so we
  1778		 * don't want to block when potentially sitting on a callstack
  1779		 * that holds all kinds of filesystem and mm locks.
  1780		 *
  1781		 * cgroup1 allows disabling the OOM killer and waiting for outside
  1782		 * handling until the charge can succeed; remember the context and put
  1783		 * the task to sleep at the end of the page fault when all locks are
  1784		 * released.
  1785		 *
  1786		 * On the other hand, in-kernel OOM killer allows for an async victim
  1787		 * memory reclaim (oom_reaper) and that means that we are not solely
  1788		 * relying on the oom victim to make a forward progress and we can
  1789		 * invoke the oom killer here.
  1790		 *
  1791		 * Please note that mem_cgroup_out_of_memory might fail to find a
  1792		 * victim and then we have to bail out from the charge path.
  1793		 */
  1794		if (memcg->oom_kill_disable) {
  1795			if (!current->in_user_fault)
  1796				return OOM_SKIPPED;
  1797			css_get(&memcg->css);
  1798			current->memcg_in_oom = memcg;
  1799			current->memcg_oom_gfp_mask = mask;
  1800			current->memcg_oom_order = order;
  1801	
  1802			return OOM_ASYNC;
  1803		}
  1804	
  1805		mem_cgroup_mark_under_oom(memcg);
  1806	
  1807		locked = mem_cgroup_oom_trylock(memcg);
  1808	
  1809		if (locked)
  1810			mem_cgroup_oom_notify(memcg);
  1811	
  1812		mem_cgroup_unmark_under_oom(memcg);
  1813		if (mem_cgroup_out_of_memory(memcg, mask, order)) {
  1814			ret = OOM_SUCCESS;
  1815		} else {
  1816			ret = OOM_FAILED;
  1817			/*
  1818			 * In some rare cases mem_cgroup_out_of_memory() can return false.
  1819			 * If it was called from #PF it forces handle_mm_fault()
  1820			 * return VM_FAULT_OOM and executes pagefault_out_of_memory().
  1821			 * memcg_in_oom is set here to notify pagefault_out_of_memory()
  1822			 * that it was a memcg-related failure and not allow to run
  1823			 * global OOM.
  1824			 */
  1825			if (current->in_user_fault)
> 1826				current->memcg_in_oom = (struct mem_cgroup *)ret;
  1827		}
  1828		if (locked)
  1829			mem_cgroup_oom_unlock(memcg);
  1830	
  1831		return ret;
  1832	}
  1833	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 42180 bytes --]

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH memcg v4] memcg: prohibit unconditional exceeding the limit of dying tasks
  2021-10-20  8:07                         ` Vasily Averin
                                           ` (2 preceding siblings ...)
  (?)
@ 2021-11-26  4:32                         ` kernel test robot
  -1 siblings, 0 replies; 131+ messages in thread
From: kernel test robot @ 2021-11-26  4:32 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 6535 bytes --]

Hi Vasily,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on v5.15-rc6]
[cannot apply to hnaz-mm/master next-20211125]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Vasily-Averin/memcg-prohibit-unconditional-exceeding-the-limit-of-dying-tasks/20211020-160835
base:    519d81956ee277b4419c723adfb154603c2565ba
config: x86_64-rhel-8.3-kselftests (https://download.01.org/0day-ci/archive/20211126/202111261248.by6Q61po-lkp(a)intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.4-dirty
        # https://github.com/0day-ci/linux/commit/fd30aeb6287dfe68bcf02c4d9634e342d6743d90
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Vasily-Averin/memcg-prohibit-unconditional-exceeding-the-limit-of-dying-tasks/20211020-160835
        git checkout fd30aeb6287dfe68bcf02c4d9634e342d6743d90
        # save the config file to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
   mm/memcontrol.c:4088:21: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/memcontrol.c:4088:21: sparse:    struct mem_cgroup_threshold_ary [noderef] __rcu *
   mm/memcontrol.c:4088:21: sparse:    struct mem_cgroup_threshold_ary *
   mm/memcontrol.c:4090:21: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/memcontrol.c:4090:21: sparse:    struct mem_cgroup_threshold_ary [noderef] __rcu *
   mm/memcontrol.c:4090:21: sparse:    struct mem_cgroup_threshold_ary *
   mm/memcontrol.c:4246:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/memcontrol.c:4246:9: sparse:    struct mem_cgroup_threshold_ary [noderef] __rcu *
   mm/memcontrol.c:4246:9: sparse:    struct mem_cgroup_threshold_ary *
   mm/memcontrol.c:4340:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/memcontrol.c:4340:9: sparse:    struct mem_cgroup_threshold_ary [noderef] __rcu *
   mm/memcontrol.c:4340:9: sparse:    struct mem_cgroup_threshold_ary *
   mm/memcontrol.c:5966:23: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/memcontrol.c:5966:23: sparse:    struct task_struct [noderef] __rcu *
   mm/memcontrol.c:5966:23: sparse:    struct task_struct *
   mm/memcontrol.c: note: in included file:
   include/linux/memcontrol.h:752:9: sparse: sparse: context imbalance in 'lock_page_lruvec' - wrong count at exit
   include/linux/memcontrol.h:752:9: sparse: sparse: context imbalance in 'lock_page_lruvec_irq' - wrong count at exit
   include/linux/memcontrol.h:752:9: sparse: sparse: context imbalance in 'lock_page_lruvec_irqsave' - wrong count at exit
>> mm/memcontrol.c:1826:70: sparse: sparse: non size-preserving integer to pointer cast
   mm/memcontrol.c:1987:6: sparse: sparse: context imbalance in 'lock_page_memcg' - wrong count at exit
   mm/memcontrol.c:2036:17: sparse: sparse: context imbalance in '__unlock_page_memcg' - unexpected unlock
   mm/memcontrol.c:5817:28: sparse: sparse: context imbalance in 'mem_cgroup_count_precharge_pte_range' - unexpected unlock
   mm/memcontrol.c:6011:36: sparse: sparse: context imbalance in 'mem_cgroup_move_charge_pte_range' - unexpected unlock

vim +1826 mm/memcontrol.c

  1765	
  1766	static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
  1767	{
  1768		enum oom_status ret;
  1769		bool locked;
  1770	
  1771		if (order > PAGE_ALLOC_COSTLY_ORDER)
  1772			return OOM_SKIPPED;
  1773	
  1774		memcg_memory_event(memcg, MEMCG_OOM);
  1775	
  1776		/*
  1777		 * We are in the middle of the charge context here, so we
  1778		 * don't want to block when potentially sitting on a callstack
  1779		 * that holds all kinds of filesystem and mm locks.
  1780		 *
  1781		 * cgroup1 allows disabling the OOM killer and waiting for outside
  1782		 * handling until the charge can succeed; remember the context and put
  1783		 * the task to sleep at the end of the page fault when all locks are
  1784		 * released.
  1785		 *
  1786		 * On the other hand, in-kernel OOM killer allows for an async victim
  1787		 * memory reclaim (oom_reaper) and that means that we are not solely
  1788		 * relying on the oom victim to make a forward progress and we can
  1789		 * invoke the oom killer here.
  1790		 *
  1791		 * Please note that mem_cgroup_out_of_memory might fail to find a
  1792		 * victim and then we have to bail out from the charge path.
  1793		 */
  1794		if (memcg->oom_kill_disable) {
  1795			if (!current->in_user_fault)
  1796				return OOM_SKIPPED;
  1797			css_get(&memcg->css);
  1798			current->memcg_in_oom = memcg;
  1799			current->memcg_oom_gfp_mask = mask;
  1800			current->memcg_oom_order = order;
  1801	
  1802			return OOM_ASYNC;
  1803		}
  1804	
  1805		mem_cgroup_mark_under_oom(memcg);
  1806	
  1807		locked = mem_cgroup_oom_trylock(memcg);
  1808	
  1809		if (locked)
  1810			mem_cgroup_oom_notify(memcg);
  1811	
  1812		mem_cgroup_unmark_under_oom(memcg);
  1813		if (mem_cgroup_out_of_memory(memcg, mask, order)) {
  1814			ret = OOM_SUCCESS;
  1815		} else {
  1816			ret = OOM_FAILED;
  1817			/*
  1818			 * In some rare cases mem_cgroup_out_of_memory() can return false.
  1819			 * If it was called from #PF it forces handle_mm_fault()
  1820			 * return VM_FAULT_OOM and executes pagefault_out_of_memory().
  1821			 * memcg_in_oom is set here to notify pagefault_out_of_memory()
  1822			 * that it was a memcg-related failure and not allow to run
  1823			 * global OOM.
  1824			 */
  1825			if (current->in_user_fault)
> 1826				current->memcg_in_oom = (struct mem_cgroup *)ret;
  1827		}
  1828		if (locked)
  1829			mem_cgroup_oom_unlock(memcg);
  1830	
  1831		return ret;
  1832	}
  1833	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

^ permalink raw reply	[flat|nested] 131+ messages in thread

end of thread, other threads:[~2021-11-26  4:32 UTC | newest]

Thread overview: 131+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-18  8:13 [PATCH memcg 0/1] false global OOM triggered by memcg-limited task Vasily Averin
2021-10-18  8:13 ` Vasily Averin
2021-10-18  9:04 ` Michal Hocko
2021-10-18  9:04   ` Michal Hocko
2021-10-18 10:05   ` Vasily Averin
2021-10-18 10:05     ` Vasily Averin
2021-10-18 10:12     ` Vasily Averin
2021-10-18 10:12       ` Vasily Averin
2021-10-18 11:53     ` Michal Hocko
2021-10-18 11:53       ` Michal Hocko
     [not found]       ` <27dc0c49-a0d6-875b-49c6-0ef5c0cc3ac8@virtuozzo.com>
2021-10-18 12:27         ` Michal Hocko
2021-10-18 12:27           ` Michal Hocko
2021-10-18 15:07           ` Shakeel Butt
2021-10-18 15:07             ` Shakeel Butt
2021-10-18 16:51             ` Michal Hocko
2021-10-18 16:51               ` Michal Hocko
2021-10-18 17:13               ` Shakeel Butt
2021-10-18 18:52             ` Vasily Averin
2021-10-18 18:52               ` Vasily Averin
2021-10-18 19:18               ` Vasily Averin
2021-10-18 19:18                 ` Vasily Averin
2021-10-19  5:34                 ` Shakeel Butt
2021-10-19  5:34                   ` Shakeel Butt
2021-10-19  5:33               ` Shakeel Butt
2021-10-19  5:33                 ` Shakeel Butt
2021-10-19  6:42                 ` Vasily Averin
2021-10-19  6:42                   ` Vasily Averin
2021-10-19  8:47                   ` Michal Hocko
2021-10-19  8:47                     ` Michal Hocko
2021-10-19  6:30       ` Vasily Averin
2021-10-19  6:30         ` Vasily Averin
2021-10-19  8:49         ` Michal Hocko
2021-10-19  8:49           ` Michal Hocko
2021-10-19 10:30           ` Vasily Averin
2021-10-19 10:30             ` Vasily Averin
2021-10-19 11:54             ` Michal Hocko
2021-10-19 11:54               ` Michal Hocko
2021-10-19 12:04               ` Michal Hocko
2021-10-19 12:04                 ` Michal Hocko
2021-10-19 13:26                 ` Vasily Averin
2021-10-19 13:26                   ` Vasily Averin
2021-10-19 14:13                   ` Michal Hocko
2021-10-19 14:13                     ` Michal Hocko
2021-10-19 14:19                     ` Michal Hocko
2021-10-19 14:19                       ` Michal Hocko
2021-10-19 19:09                     ` Vasily Averin
2021-10-19 19:09                       ` Vasily Averin
2021-10-20  8:07                       ` [PATCH memcg v4] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
2021-10-20  8:07                         ` Vasily Averin
2021-10-20  8:43                         ` Michal Hocko
2021-10-20  8:43                           ` Michal Hocko
2021-10-20 12:11                           ` [PATCH memcg RFC 0/3] " Vasily Averin
2021-10-20 12:11                             ` Vasily Averin
     [not found]                           ` <cover.1634730787.git.vvs@virtuozzo.com>
2021-10-20 12:12                             ` [PATCH memcg 1/3] mm: do not firce global OOM from inside " Vasily Averin
2021-10-20 12:12                               ` Vasily Averin
2021-10-20 12:33                               ` Michal Hocko
2021-10-20 12:33                                 ` Michal Hocko
2021-10-20 13:52                                 ` Vasily Averin
2021-10-20 13:52                                   ` Vasily Averin
2021-10-20 12:13                             ` [PATCH memcg 2/3] memcg: remove charge forcinig for " Vasily Averin
2021-10-20 12:13                               ` Vasily Averin
2021-10-20 12:41                               ` Michal Hocko
2021-10-20 12:41                                 ` Michal Hocko
2021-10-20 14:21                                 ` Vasily Averin
2021-10-20 14:21                                   ` Vasily Averin
2021-10-20 14:57                                   ` Michal Hocko
2021-10-20 14:57                                     ` Michal Hocko
2021-10-20 15:20                                     ` Tetsuo Handa
2021-10-20 15:20                                       ` Tetsuo Handa
2021-10-21 10:03                                       ` Michal Hocko
2021-10-21 10:03                                         ` Michal Hocko
2021-10-20 12:14                             ` [PATCH memcg 3/3] memcg: handle memcg oom failures Vasily Averin
2021-10-20 12:14                               ` Vasily Averin
2021-10-20 13:02                               ` Michal Hocko
2021-10-20 15:46                                 ` Vasily Averin
2021-10-20 15:46                                   ` Vasily Averin
2021-10-21 11:49                                   ` Michal Hocko
2021-10-21 11:49                                     ` Michal Hocko
2021-10-21 15:05                                     ` Vasily Averin
2021-10-21 15:05                                       ` Vasily Averin
2021-10-21 16:47                                       ` Michal Hocko
2021-10-21 16:47                                         ` Michal Hocko
2021-10-22  8:10                                         ` [PATCH memcg v2 0/2] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
2021-10-22  8:10                                           ` Vasily Averin
     [not found]                                         ` <cover.1634889066.git.vvs@virtuozzo.com>
2021-10-22  8:11                                           ` [PATCH memcg v2 1/2] mm, oom: do not trigger out_of_memory from the #PF Vasily Averin
2021-10-22  8:11                                             ` Vasily Averin
2021-10-22  8:55                                             ` Michal Hocko
2021-10-22  8:55                                               ` Michal Hocko
2021-10-22  8:11                                           ` [PATCH memcg v2 2/2] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
2021-10-22  8:11                                             ` Vasily Averin
2021-10-22  9:10                                             ` Michal Hocko
2021-10-22  9:10                                               ` Michal Hocko
2021-10-23 13:18                                               ` [PATCH memcg v3 0/3] " Vasily Averin
2021-10-23 13:18                                                 ` Vasily Averin
     [not found]                                               ` <cover.1634994605.git.vvs@virtuozzo.com>
2021-10-23 13:19                                                 ` [PATCH memcg v3 1/3] mm, oom: pagefault_out_of_memory: don't force global OOM for " Vasily Averin
2021-10-23 13:19                                                   ` Vasily Averin
2021-10-25  9:27                                                   ` Michal Hocko
2021-10-25  9:27                                                     ` Michal Hocko
2021-10-23 13:20                                                 ` [PATCH memcg v3 2/3] mm, oom: do not trigger out_of_memory from the #PF Vasily Averin
2021-10-23 13:20                                                   ` Vasily Averin
2021-10-23 15:01                                                   ` Tetsuo Handa
2021-10-23 15:01                                                     ` Tetsuo Handa
2021-10-23 19:15                                                     ` Vasily Averin
2021-10-25  8:04                                                     ` Michal Hocko
2021-10-25  8:04                                                       ` Michal Hocko
2021-10-26 13:56                                                       ` Tetsuo Handa
2021-10-26 13:56                                                         ` Tetsuo Handa
2021-10-26 14:07                                                         ` Michal Hocko
2021-10-26 14:07                                                           ` Michal Hocko
2021-10-25  9:34                                                   ` Michal Hocko
2021-10-25  9:34                                                     ` Michal Hocko
2021-10-23 13:20                                                 ` [PATCH memcg v3 3/3] memcg: prohibit unconditional exceeding the limit of dying tasks Vasily Averin
2021-10-23 13:20                                                   ` Vasily Averin
2021-10-25  9:36                                                   ` Michal Hocko
2021-10-25  9:36                                                     ` Michal Hocko
2021-10-27 22:36                                                     ` Andrew Morton
2021-10-27 22:36                                                       ` Andrew Morton
2021-10-28  7:22                                                       ` Vasily Averin
2021-10-28  7:22                                                         ` Vasily Averin
2021-10-29  7:46                                                         ` Greg Kroah-Hartman
2021-10-29  7:46                                                           ` Greg Kroah-Hartman
2021-10-29  7:58                                                       ` Michal Hocko
2021-10-29  7:58                                                         ` Michal Hocko
2021-11-12 23:48                         ` [PATCH memcg v4] " kernel test robot
2021-11-26  4:32                         ` kernel test robot
2021-10-21  8:03   ` [PATCH memcg 0/1] false global OOM triggered by memcg-limited task Vasily Averin
2021-10-21  8:03     ` Vasily Averin
2021-10-21 11:49     ` Michal Hocko
2021-10-21 11:49       ` Michal Hocko
2021-10-21 13:24       ` Vasily Averin
2021-10-21 13:24         ` Vasily Averin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.