All of lore.kernel.org
 help / color / mirror / Atom feed
* mm: Can we bail out p?d_alloc() loops upon SIGKILL?
@ 2019-02-27  3:43 Tetsuo Handa
  2019-02-27  9:21 ` Michal Hocko
  0 siblings, 1 reply; 6+ messages in thread
From: Tetsuo Handa @ 2019-02-27  3:43 UTC (permalink / raw)
  To: linux-mm

I noticed that when a kdump kernel triggers the OOM killer because a too
small value was given to crashkernel= parameter, the OOM reaper tends to
fail to reclaim memory from OOM victims because they are in dup_mm() from
copy_mm() from copy_process() with mmap_sem held for write. A debug dump
reported that the OOM victim was merely sleeping at might_sleep_if() in
prepare_alloc_pages() from __alloc_pages_nodemask() despite the OOM victim
is ready to bail out.

Since copy_page_range() can be called with mmap_sem held for write, it is
not a good thing to continue the loop when killed by the OOM killer.

[    9.965654] systemd-udevd invoked oom-killer: gfp_mask=0x7080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=0, oom_score_adj=0
[    9.968941] CPU: 0 PID: 132 Comm: systemd-udevd Not tainted 5.0.0-rc8+ #838
[    9.970801] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[    9.973897] Call Trace:
[    9.974735]  dump_stack+0x86/0xca
[    9.975693]  dump_header+0x10a/0x9d0
[    9.976746]  ? ___ratelimit+0x1d1/0x3c5
[    9.977838]  oom_kill_process.cold.31+0xb/0x59f
[    9.979078]  ? check_flags.part.40+0x420/0x420
[    9.980727]  out_of_memory+0x287/0x800
[    9.981907]  ? oom_killer_disable+0x200/0x200
[    9.983067]  ? mutex_trylock+0x191/0x1e0
[    9.984183]  ? __alloc_pages_slowpath+0xa16/0x2380
[    9.985485]  __alloc_pages_slowpath+0x1cb2/0x2380
[    9.986767]  ? __zone_watermark_ok+0x213/0x370
[    9.988014]  ? warn_alloc+0x120/0x120
[    9.989089]  ? sched_clock_cpu+0x1b/0x170
[    9.990343]  ? __might_sleep+0x95/0x190
[    9.991569]  __alloc_pages_nodemask+0x515/0x610
[    9.992843]  ? __kasan_kmalloc.constprop.8+0xc5/0xd0
[    9.994215]  ? kasan_slab_alloc+0x11/0x20
[    9.995323]  ? __alloc_pages_slowpath+0x2380/0x2380
[    9.996649]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[    9.998168]  ? _raw_spin_unlock+0x22/0x30
[    9.999240]  __get_free_pages+0x14/0x90
[   10.000303]  get_zeroed_page+0x11/0x20
[   10.001391]  __pud_alloc+0x2e/0x120
[   10.002443]  copy_page_range+0xf78/0x1af0
[   10.003544]  ? sched_clock_cpu+0x1b/0x170
[   10.004658]  ? sched_clock+0x9/0x10
[   10.005646]  ? find_held_lock+0x40/0x1e0
[   10.006909]  ? check_flags.part.40+0x420/0x420
[   10.008450]  ? vma_gap_callbacks_rotate+0x5a/0x90
[   10.009766]  ? __pmd_alloc+0x370/0x370
[   10.010838]  ? __vma_link_rb+0x1fc/0x340
[   10.011963]  copy_process.part.56+0x2f0e/0x6c80
[   10.013184]  ? __cleanup_sighand+0x40/0x40
[   10.014331]  ? sched_clock_cpu+0x1b/0x170
[   10.015398]  ? find_held_lock+0x40/0x1e0
[   10.016489]  ? check_flags.part.40+0x420/0x420
[   10.017747]  _do_fork+0x15d/0xb90
[   10.018677]  ? __fd_install+0x16c/0x470
[   10.019760]  ? fork_idle+0x250/0x250
[   10.020777]  ? fd_install+0x47/0x60
[   10.021766]  ? do_pipe2+0x102/0x140
[   10.022793]  ? pci_mmcfg_check_reserved+0x120/0x120
[   10.024377]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[   10.025813]  ? do_syscall_64+0x18/0x3e0
[   10.027035]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   10.029358]  __x64_sys_clone+0xba/0x140
[   10.030779]  do_syscall_64+0x8f/0x3e0
[   10.031848]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   10.033300] RIP: 0033:0x7f674d010f42
[   10.034318] Code: f7 d8 64 89 04 25 d4 02 00 00 64 4c 8b 04 25 10 00 00 00 31 d2 4d 8d 90 d0 02 00 00 31 f6 bf 11 00 20 01 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 5d 01 00 00 85 c0 41 89 c5 0f 85 67 01 00
[   10.039645] RSP: 002b:00007ffcf9331600 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[   10.041806] RAX: ffffffffffffffda RBX: 00007ffcf9331600 RCX: 00007f674d010f42
[   10.043812] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[   10.045766] RBP: 00007ffcf9331640 R08: 00007f674e3ef8c0 R09: 0000000000000084
[   10.047728] R10: 00007f674e3efb90 R11: 0000000000000246 R12: 0000000000000000
[   10.049685] R13: 0000000000000000 R14: 00007ffcf9333d20 R15: 00007ffcf9333920
[   10.051705] Mem-Info:
[   10.052349] active_anon:3104 inactive_anon:7316 isolated_anon:0
[   10.052349]  active_file:0 inactive_file:0 isolated_file:0
[   10.052349]  unevictable:0 dirty:0 writeback:0 unstable:0
[   10.052349]  slab_reclaimable:5033 slab_unreclaimable:13704
[   10.052349]  mapped:1177 shmem:9911 pagetables:148 bounce:0
[   10.052349]  free:479 free_pcp:41 free_cma:0
[   10.060924] Node 0 active_anon:12416kB inactive_anon:29264kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:4708kB dirty:0kB writeback:0kB shmem:39644kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
[   10.069022] DMA free:508kB min:2052kB low:2052kB high:2052kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:600kB managed:516kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   10.076308] lowmem_reserve[]: 0 123 123 123
[   10.077655] DMA32 free:1408kB min:1416kB low:1768kB high:2120kB active_anon:12416kB inactive_anon:29252kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:261524kB managed:126532kB mlocked:0kB kernel_stack:2656kB pagetables:592kB bounce:0kB free_pcp:164kB local_pcp:164kB free_cma:0kB
[   10.085864] lowmem_reserve[]: 0 0 0 0
[   10.087035] DMA: 0*4kB 1*8kB (M) 1*16kB (M) 1*32kB (M) 1*64kB (U) 1*128kB (U) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 504kB
[   10.090710] DMA32: 14*4kB (UME) 9*8kB (UME) 14*16kB (UME) 7*32kB (UM) 3*64kB (UME) 1*128kB (M) 2*256kB (ME) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1408kB
[   10.094751] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   10.097676] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   10.100979] 9911 total pagecache pages
[   10.102191] 0 pages in swap cache
[   10.103238] Swap cache stats: add 0, delete 0, find 0/0
[   10.104849] Free swap  = 0kB
[   10.105839] Total swap = 0kB
[   10.106846] 65531 pages RAM
[   10.107724] 0 pages HighMem/MovableOnly
[   10.108903] 33769 pages reserved
[   10.109902] 0 pages cma reserved
[   10.110915] Unreclaimable slab info:
[   10.112080] Name                      Used          Total
[   10.113686] fib6_nodes                 0KB          4KB
[   10.115512] RAWv6                     10KB         16KB
[   10.117043] sgpool-128                 8KB         31KB
[   10.118553] sgpool-64                  4KB         31KB
[   10.120101] sgpool-32                  2KB         15KB
[   10.121577] sgpool-16                  1KB          7KB
[   10.123167] sgpool-8                   1KB          7KB
[   10.124725] mqueue_inode_cache          1KB         15KB
[   10.126273] bio-1                      2KB          7KB
[   10.127752] UNIX                      67KB         90KB
[   10.129229] ip_fib_trie                1KB          3KB
[   10.130686] ip_fib_alias               1KB          3KB
[   10.132116] RAW                        3KB         30KB
[   10.133631] UDP                        2KB         30KB
[   10.135063] hugetlbfs_inode_cache          2KB         31KB
[   10.136657] eventpoll_pwq             14KB         23KB
[   10.138054] eventpoll_epi             20KB         31KB
[   10.139521] inotify_inode_mark          2KB          3KB
[   10.141039] request_queue              3KB         31KB
[   10.142472] bio-0                      2KB          7KB
[   10.143905] biovec-max                84KB        101KB
[   10.145381] bio_integrity_payload          1KB          7KB
[   10.146988] dmaengine-unmap-2          0KB          4KB
[   10.148415] audit_buffer               0KB          7KB
[   10.149869] skbuff_head_cache        244KB        311KB
[   10.151264] configfs_dir_cache          1KB          3KB
[   10.152759] fsnotify_mark_connector          2KB          3KB
[   10.154326] task_delay_info           43KB         47KB
[   10.155821] proc_dir_entry           385KB        393KB
[   10.157388] pde_opener                 1KB          7KB
[   10.158846] seq_file                  13KB         38KB
[   10.160273] sigqueue                   0KB          7KB
[   10.161766] shmem_inode_cache       1086KB       1099KB
[   10.163256] kernfs_node_cache      23189KB      23193KB
[   10.164688] mnt_cache                 30KB         31KB
[   10.166166] filp                     281KB        285KB
[   10.167596] names_cache              980KB        994KB
[   10.169095] key_jar                    3KB          7KB
[   10.170528] nsproxy                    0KB          3KB
[   10.171954] vm_area_struct           483KB        489KB
[   10.173540] mm_struct                 30KB         48KB
[   10.175039] fs_cache                   6KB         15KB
[   10.176538] files_cache               13KB         30KB
[   10.177977] signal_cache             157KB        184KB
[   10.179469] sighand_cache            217KB        252KB
[   10.180919] task_struct              592KB        626KB
[   10.182349] cred_jar                  63KB         78KB
[   10.183772] anon_vma_chain           364KB        368KB
[   10.185231] anon_vma                 121KB        137KB
[   10.186724] pid                       45KB         48KB
[   10.188546] Acpi-Operand            3938KB       4232KB
[   10.190127] Acpi-ParseExt              0KB         15KB
[   10.191627] Acpi-Parse                 0KB         15KB
[   10.193048] Acpi-State                 0KB         15KB
[   10.194670] Acpi-Namespace          3112KB       3127KB
[   10.196245] trace_event_file         241KB        243KB
[   10.197717] ftrace_event_field        553KB        554KB
[   10.199211] pool_workqueue            18KB         30KB
[   10.200701] task_group                 6KB         15KB
[   10.202331] debug_objects_cache       1675KB       1676KB
[   10.203790] page->ptl                121KB        125KB
[   10.205269] kmalloc-8k               116KB        125KB
[   10.206769] kmalloc-4k               660KB       1033KB
[   10.208563] kmalloc-2k              3480KB       3503KB
[   10.210005] kmalloc-1k               506KB        525KB
[   10.211509] kmalloc-512              413KB        493KB
[   10.212940] kmalloc-256             1042KB       1049KB
[   10.214364] kmalloc-192               96KB        103KB
[   10.215800] kmalloc-128              503KB        506KB
[   10.217203] kmalloc-96               257KB        496KB
[   10.218730] kmalloc-64               962KB        995KB
[   10.220155] kmalloc-32              1755KB       1770KB
[   10.221622] kmalloc-16              1597KB       1604KB
[   10.223047] kmalloc-8               1370KB       1392KB
[   10.224774] kmem_cache_node           91KB         94KB
[   10.226223] kmem_cache               142KB        149KB
[   10.227648] Tasks state (memory values in pages):
[   10.228981] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[   10.231399] [    128]     0   128     8930      954   114688        0         -1000 systemd-udevd
[   10.233952] [    130]     0   130     8765      523   110592        0             0 systemd-udevd
[   10.236312] [    132]     0   132     8765      524   110592        0             0 systemd-udevd
[   10.238702] [    180]     0   180     1162       75    45056        0             0 systemd-detect-
[   10.241295] [    181]     0   181     7725        0   110592        0             0 systemd-journal
[   10.243763] [    185]     0   185     2400        0    81920        0             0 dracut-initqueu
[   10.246177] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),global_oom,task_memcg=/,task=systemd-udevd,pid=132,uid=0
[   10.249123] Out of memory: Kill process 132 (systemd-udevd) score 17 or sacrifice child
[   10.251446] Killed process 132 (systemd-udevd) total-vm:35060kB, anon-rss:400kB, file-rss:4kB, shmem-rss:1692kB
[   11.295270] oom_reaper: unable to reap pid:132 (systemd-udevd)
[   11.296965]   task                        PC stack   pid father
(...snipped...)
[   12.965253] systemd-udevd   R  running task    27168   132    128 0x80100004
[   12.967313] Call Trace:
[   12.968074]  __schedule+0x6c0/0x1a00
[   12.969115]  ? __lock_is_held+0xbc/0x140
[   12.970270]  ? pci_mmcfg_check_reserved+0x120/0x120
[   12.971690]  preempt_schedule_common+0x22/0x60
[   12.973055]  _cond_resched+0x1d/0x30
[   12.974087]  __alloc_pages_nodemask+0x3bd/0x610
[   12.975386]  ? __alloc_pages_slowpath+0x2380/0x2380
[   12.976801]  ? kasan_check_read+0x11/0x20
[   12.978054]  __pmd_alloc+0x36/0x370
[   12.979037]  ? __pud_alloc+0x83/0x120
[   12.980073]  copy_page_range+0x1024/0x1af0
[   12.981183]  ? sched_clock_cpu+0x1b/0x170
[   12.982272]  ? sched_clock+0x9/0x10
[   12.983266]  ? find_held_lock+0x40/0x1e0
[   12.984474]  ? check_flags.part.40+0x420/0x420
[   12.985709]  ? vma_gap_callbacks_rotate+0x5a/0x90
[   12.987059]  ? __pmd_alloc+0x370/0x370
[   12.988112]  ? __vma_link_rb+0x1fc/0x340
[   12.989283]  copy_process.part.56+0x2f0e/0x6c80
[   12.990617]  ? __cleanup_sighand+0x40/0x40
[   12.991724]  ? sched_clock_cpu+0x1b/0x170
[   12.992839]  ? find_held_lock+0x40/0x1e0
[   12.993919]  ? check_flags.part.40+0x420/0x420
[   12.995193]  _do_fork+0x15d/0xb90
[   12.996176]  ? __fd_install+0x16c/0x470
[   12.997216]  ? fork_idle+0x250/0x250
[   12.998252]  ? fd_install+0x47/0x60
[   12.999399]  ? do_pipe2+0x102/0x140
[   13.000389]  ? pci_mmcfg_check_reserved+0x120/0x120
[   13.001735]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[   13.003004]  ? do_syscall_64+0x18/0x3e0
[   13.004050]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   13.005490]  __x64_sys_clone+0xba/0x140
[   13.006786]  do_syscall_64+0x8f/0x3e0
[   13.007781]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   13.009242] RIP: 0033:0x7f674d010f42
[   13.010293] Code: f7 d8 64 89 04 25 d4 02 00 00 64 4c 8b 04 25 10 00 00 00 31 d2 4d 8d 90 d0 02 00 00 31 f6 bf 11 00 20 01 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 5d 01 00 00 85 c0 41 89 c5 0f 85 67 01 00
[   13.015179] RSP: 002b:00007ffcf9331600 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[   13.017273] RAX: ffffffffffffffda RBX: 00007ffcf9331600 RCX: 00007f674d010f42
[   13.019208] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[   13.021216] RBP: 00007ffcf9331640 R08: 00007f674e3ef8c0 R09: 0000000000000084
[   13.023186] R10: 00007f674e3efb90 R11: 0000000000000246 R12: 0000000000000000
[   13.025061] R13: 0000000000000000 R14: 00007ffcf9333d20 R15: 00007ffcf9333920
(...snipped...)
[   13.249697] Showing all locks held in the system:
[   13.251378] 1 lock held by oom_reaper/18:
[   13.252499]  #0: 00000000c8a61e24 (rcu_read_lock){....}, at: debug_show_all_locks+0x5b/0x27e
[   13.254906] 1 lock held by systemd-udevd/128:
[   13.256071]  #0: 00000000e09c1ed1 (&mm->mmap_sem){++++}, at: __do_page_fault+0x23a/0x900
[   13.258336] 2 locks held by systemd-udevd/132:
[   13.259559]  #0: 00000000b4432d13 (&mm->mmap_sem){++++}, at: copy_process.part.56+0x23e5/0x6c80
[   13.261868]  #1: 0000000084913324 (&mm->mmap_sem/1){+.+.}, at: copy_process.part.56+0x2408/0x6c80
[   13.264306] 1 lock held by systemd-detect-/180:
[   13.265584]  #0: 000000001cfadba8 (&mm->mmap_sem){++++}, at: __do_page_fault+0x23a/0x900
[   13.267773] 2 locks held by systemd-journal/189:
[   13.269005]  #0: 000000003687636a (&p->lock){+.+.}, at: seq_read+0x66/0x1030
[   13.270934]  #1: 00000000a4d62cb5 (&mm->mmap_sem){++++}, at: __do_page_fault+0x23a/0x900
[   13.273155] 2 locks held by systemctl/190:
[   13.274390]  #0: 000000000f41a6cc (&p->lock){+.+.}, at: seq_read+0x66/0x1030
[   13.276349]  #1: 00000000a1ed5f2f (&mm->mmap_sem){++++}, at: __do_page_fault+0x23a/0x900
[   13.278527] 
[   13.278976] =============================================
[   13.278976] 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mm: Can we bail out p?d_alloc() loops upon SIGKILL?
  2019-02-27  3:43 mm: Can we bail out p?d_alloc() loops upon SIGKILL? Tetsuo Handa
@ 2019-02-27  9:21 ` Michal Hocko
  2019-02-27 10:39   ` Tetsuo Handa
  0 siblings, 1 reply; 6+ messages in thread
From: Michal Hocko @ 2019-02-27  9:21 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Wed 27-02-19 12:43:51, Tetsuo Handa wrote:
> I noticed that when a kdump kernel triggers the OOM killer because a too
> small value was given to crashkernel= parameter, the OOM reaper tends to
> fail to reclaim memory from OOM victims because they are in dup_mm() from
> copy_mm() from copy_process() with mmap_sem held for write.

I would presume that a page table allocation would fail for the oom
victim as soon as the oom memory reserves get depleted and then
copy_page_range would bail out and release the lock. That being
said, the oom_reaper might bail out before then but does sprinkling
fatal_signal_pending checks into copy_*_range really help reliably?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mm: Can we bail out p?d_alloc() loops upon SIGKILL?
  2019-02-27  9:21 ` Michal Hocko
@ 2019-02-27 10:39   ` Tetsuo Handa
  2019-02-28  9:26     ` Michal Hocko
  0 siblings, 1 reply; 6+ messages in thread
From: Tetsuo Handa @ 2019-02-27 10:39 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm

On 2019/02/27 18:21, Michal Hocko wrote:
> On Wed 27-02-19 12:43:51, Tetsuo Handa wrote:
>> I noticed that when a kdump kernel triggers the OOM killer because a too
>> small value was given to crashkernel= parameter, the OOM reaper tends to
>> fail to reclaim memory from OOM victims because they are in dup_mm() from
>> copy_mm() from copy_process() with mmap_sem held for write.
> 
> I would presume that a page table allocation would fail for the oom
> victim as soon as the oom memory reserves get depleted and then
> copy_page_range would bail out and release the lock. That being
> said, the oom_reaper might bail out before then but does sprinkling
> fatal_signal_pending checks into copy_*_range really help reliably?
> 

Yes, I think so. The OOM victim was just sleeping at might_sleep_if()
rather than continue allocations until ALLOC_OOM allocation fails.
Maybe the kdump kernel enables only one CPU somehow contributed that
the OOM reaper gave up before ALLOC_OOM allocation fails. But if the OOM
victim in a normal kernel had huge memory mapping where p?d_alloc() is
called for so many times, and kernel frequently prevented the OOM victim
 from continuing ALLOC_OOM allocations, it might not be rare cases (I
don't have a huge machine for testing intensive p?d_alloc() loop) to
hit this problem.

Technically, it would be possible to use a per task_struct flag
which allows __alloc_pages_nodemask() to check early and bail out:

  down_write(&current->mm->mmap_sem);
  current->no_oom_alloc = 1;
  while (...) {
      p?d_alloc();
  }
  current->no_oom_alloc = 0;
  up_write(&current->mm->mmap_sem);


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mm: Can we bail out p?d_alloc() loops upon SIGKILL?
  2019-02-27 10:39   ` Tetsuo Handa
@ 2019-02-28  9:26     ` Michal Hocko
  2019-03-01 10:30       ` Tetsuo Handa
  0 siblings, 1 reply; 6+ messages in thread
From: Michal Hocko @ 2019-02-28  9:26 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Wed 27-02-19 19:39:19, Tetsuo Handa wrote:
> On 2019/02/27 18:21, Michal Hocko wrote:
> > On Wed 27-02-19 12:43:51, Tetsuo Handa wrote:
> >> I noticed that when a kdump kernel triggers the OOM killer because a too
> >> small value was given to crashkernel= parameter, the OOM reaper tends to
> >> fail to reclaim memory from OOM victims because they are in dup_mm() from
> >> copy_mm() from copy_process() with mmap_sem held for write.
> > 
> > I would presume that a page table allocation would fail for the oom
> > victim as soon as the oom memory reserves get depleted and then
> > copy_page_range would bail out and release the lock. That being
> > said, the oom_reaper might bail out before then but does sprinkling
> > fatal_signal_pending checks into copy_*_range really help reliably?
> > 
> 
> Yes, I think so. The OOM victim was just sleeping at might_sleep_if()
> rather than continue allocations until ALLOC_OOM allocation fails.
> Maybe the kdump kernel enables only one CPU somehow contributed that
> the OOM reaper gave up before ALLOC_OOM allocation fails. But if the OOM
> victim in a normal kernel had huge memory mapping where p?d_alloc() is
> called for so many times, and kernel frequently prevented the OOM victim
>  from continuing ALLOC_OOM allocations, it might not be rare cases (I
> don't have a huge machine for testing intensive p?d_alloc() loop) to
> hit this problem.

We cannot do anything about the preemption so that is moot. ALLOC_OOM
reserve is limited so the failure should happen sooner or later. But
I would be OK to check for fatal_signal_pending once per pmd or so if
that helps and it doesn't add a noticeable overhead.

> Technically, it would be possible to use a per task_struct flag
> which allows __alloc_pages_nodemask() to check early and bail out:
> 
>   down_write(&current->mm->mmap_sem);
>   current->no_oom_alloc = 1;
>   while (...) {
>       p?d_alloc();
>   }
>   current->no_oom_alloc = 0;
>   up_write(&current->mm->mmap_sem);

Looks like a hack to me. We already do have __GFP_NOMEMALLOC,
__GFP_MEMALLOC and PF_MEMALLOC and you want yet another way to control
access to reserves. This is a mess. If anything then PF_NOMEMALLOC would
be a better fit but the flag space is quite tight already. Besides that
is this really worth doing when the caller can bail out?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mm: Can we bail out p?d_alloc() loops upon SIGKILL?
  2019-02-28  9:26     ` Michal Hocko
@ 2019-03-01 10:30       ` Tetsuo Handa
  2019-03-01 11:49         ` Michal Hocko
  0 siblings, 1 reply; 6+ messages in thread
From: Tetsuo Handa @ 2019-03-01 10:30 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm

On 2019/02/28 18:26, Michal Hocko wrote:
> We cannot do anything about the preemption so that is moot. ALLOC_OOM
> reserve is limited so the failure should happen sooner or later. But

The problem is that preemption can slowdown ALLOC_OOM allocations (at e.g.
cond_resched() from direct reclaim path). Since concurrently allocating
threads can consume CPU time, the OOM reaper can fail to wait for the OOM
victim to complete (or fail) ALLOC_OOM allocations.

> I would be OK to check for fatal_signal_pending once per pmd or so if
> that helps and it doesn't add a noticeable overhead.

Another option is to scatter __GFP_NOMEMALLOC to allocations which might
be used from fork() path.

> 
>> Technically, it would be possible to use a per task_struct flag
>> which allows __alloc_pages_nodemask() to check early and bail out:
>>
>>   down_write(&current->mm->mmap_sem);
>>   current->no_oom_alloc = 1;
>>   while (...) {
>>       p?d_alloc();
>>   }
>>   current->no_oom_alloc = 0;
>>   up_write(&current->mm->mmap_sem);
> 
> Looks like a hack to me. We already do have __GFP_NOMEMALLOC,
> __GFP_MEMALLOC and PF_MEMALLOC and you want yet another way to control
> access to reserves. This is a mess.

The intention is to fail the allocation as quick as possible rather than
avoid consumption of memory reserves. Since the OOM reaper gives up after
just one second, being able to quickly exit the allocation loop and release
mmap_sem held for write is important for allowing the OOM reaper to reclaim
memory from the OOM victim. (I wish __GFP_KILLABLE were there...)

>                                     If anything then PF_NOMEMALLOC would
> be a better fit but the flag space is quite tight already. Besides that
> is this really worth doing when the caller can bail out?

Scattering __GFP_NOMEMALLOC (like draft patch shown below) reduces frequency of
failing to reclaim memory from the OOM victim. Though it cannot become perfect
because the OOM victim might be still blocked at e.g. down_write() or
cond_resched() in __alloc_pages_nodemask(), callers using GFP_KERNEL_ACCOUNT
allocations could afford __GFP_NOMEMALLOC ?



diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index a281e61..fef88fb 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -102,7 +102,7 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
 	struct page *page;
-	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO;
+	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_NOMEMALLOC;
 
 	if (mm == &init_mm)
 		gfp &= ~__GFP_ACCOUNT;
@@ -162,7 +162,7 @@ static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d, pud_t *pu
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	gfp_t gfp = GFP_KERNEL_ACCOUNT;
+	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_NOMEMALLOC;
 
 	if (mm == &init_mm)
 		gfp &= ~__GFP_ACCOUNT;
@@ -202,7 +202,7 @@ static inline void pgd_populate_safe(struct mm_struct *mm, pgd_t *pgd, p4d_t *p4
 
 static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	gfp_t gfp = GFP_KERNEL_ACCOUNT;
+	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_NOMEMALLOC;
 
 	if (mm == &init_mm)
 		gfp &= ~__GFP_ACCOUNT;
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 7bd0170..2a36287 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -21,7 +21,7 @@
 #define PGALLOC_USER_GFP 0
 #endif
 
-gfp_t __userpte_alloc_gfp = PGALLOC_GFP | PGALLOC_USER_GFP;
+gfp_t __userpte_alloc_gfp = PGALLOC_GFP | PGALLOC_USER_GFP | __GFP_NOMEMALLOC;
 
 pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index b69248e..57f0b54 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -338,7 +338,7 @@ struct vm_area_struct *vm_area_alloc(struct mm_struct *mm)
 
 struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 {
-	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);
+	struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL | __GFP_NOMEMALLOC);
 
 	if (new) {
 		*new = *orig;
diff --git a/mm/memory.c b/mm/memory.c
index e11ca9d..0f27d67 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4574,7 +4574,7 @@ bool ptlock_alloc(struct page *page)
 {
 	spinlock_t *ptl;
 
-	ptl = kmem_cache_alloc(page_ptl_cachep, GFP_KERNEL);
+	ptl = kmem_cache_alloc(page_ptl_cachep, GFP_KERNEL | __GFP_NOMEMALLOC);
 	if (!ptl)
 		return false;
 	page->ptl = ptl;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 26ea863..d81b0f8 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -41,6 +41,7 @@
 #include <linux/kthread.h>
 #include <linux/init.h>
 #include <linux/mmu_notifier.h>
+#include <linux/sched/debug.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -610,6 +611,7 @@ static void oom_reap_task(struct task_struct *tsk)
 
 	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
 		task_pid_nr(tsk), tsk->comm);
+	sched_show_task(tsk);
 	debug_show_all_locks();
 
 done:
diff --git a/mm/rmap.c b/mm/rmap.c
index 0454ecc2..332743c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -270,7 +270,7 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src)
 		if (unlikely(!avc)) {
 			unlock_anon_vma_root(root);
 			root = NULL;
-			avc = anon_vma_chain_alloc(GFP_KERNEL);
+			avc = anon_vma_chain_alloc(GFP_KERNEL | __GFP_NOMEMALLOC);
 			if (!avc)
 				goto enomem_failure;
 		}
@@ -341,7 +341,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	anon_vma = anon_vma_alloc();
 	if (!anon_vma)
 		goto out_error;
-	avc = anon_vma_chain_alloc(GFP_KERNEL);
+	avc = anon_vma_chain_alloc(GFP_KERNEL | __GFP_NOMEMALLOC);
 	if (!avc)
 		goto out_error_free_anon_vma;
 


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: mm: Can we bail out p?d_alloc() loops upon SIGKILL?
  2019-03-01 10:30       ` Tetsuo Handa
@ 2019-03-01 11:49         ` Michal Hocko
  0 siblings, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2019-03-01 11:49 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On Fri 01-03-19 19:30:54, Tetsuo Handa wrote:
> On 2019/02/28 18:26, Michal Hocko wrote:
> > We cannot do anything about the preemption so that is moot. ALLOC_OOM
> > reserve is limited so the failure should happen sooner or later. But
> 
> The problem is that preemption can slowdown ALLOC_OOM allocations (at e.g.
> cond_resched() from direct reclaim path). Since concurrently allocating
> threads can consume CPU time, the OOM reaper can fail to wait for the OOM
> victim to complete (or fail) ALLOC_OOM allocations.

But this is an inherent problem and we cannot do anything about it
except for increasing the time the reaper keeps retrying.

> > I would be OK to check for fatal_signal_pending once per pmd or so if
> > that helps and it doesn't add a noticeable overhead.
> 
> Another option is to scatter __GFP_NOMEMALLOC to allocations which might
> be used from fork() path.

This is not really maintainable. Page table allocations are used for
other purposes as well, not to mention that each arch would have to do
the same. Why don't you simply try the fatal_signal_panding per pmd for
starter. Then we can tune the retry cound for the oom reaper.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-03-01 11:49 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-27  3:43 mm: Can we bail out p?d_alloc() loops upon SIGKILL? Tetsuo Handa
2019-02-27  9:21 ` Michal Hocko
2019-02-27 10:39   ` Tetsuo Handa
2019-02-28  9:26     ` Michal Hocko
2019-03-01 10:30       ` Tetsuo Handa
2019-03-01 11:49         ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.