linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [BUG] Early OOM and kernel NULL pointer dereference in 4.19.69
@ 2019-09-01 20:43 Thomas Lindroth
  2019-09-02  7:16 ` Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Thomas Lindroth @ 2019-09-01 20:43 UTC (permalink / raw)
  To: linux-mm; +Cc: stable

After upgrading to the 4.19 series I've started getting problems with
early OOM.

I run a Gentoo system and do large compiles like the chromium browser in a
v1 memory cgroup. When I build chromium in the memory cgroup the OOM killer
runs and kills programs outside of the cgroup. This happens even when there
is plenty of free memory both in and outside of the cgroup.

The memory cgroup is named "12G" and is setup like this:
   /sys/fs/cgroup/memory/12G/memory.kmem.limit_in_bytes:1073741824
   /sys/fs/cgroup/memory/12G/memory.kmem.tcp.limit_in_bytes:1073741824
   /sys/fs/cgroup/memory/12G/memory.limit_in_bytes:12884901888
   /sys/fs/cgroup/memory/12G/memory.memsw.limit_in_bytes:12884901888
   /sys/fs/cgroup/memory/12G/memory.soft_limit_in_bytes:9223372036854771712

The system has MemTotal: 16244996 kB

I run the chromium compile job using Gentoo's package manager like this:
cgexec -g memory:12G emerge -1 www-client/chromium

That compile job usually fails and dmesg looks like this:

[ 1084.634827] SLUB: Unable to allocate memory on node -1, gfp=0x6000c0(GFP_KERNEL)
[ 1084.634836]   cache: dentry(100:12G), object size: 192, buffer size: 192, default order: 0, min order: 0
[ 1084.634838]   node 0: slabs: 26888, objs: 564648, free: 0
[ 1084.634857] SLUB: Unable to allocate memory on node -1, gfp=0x6000c0(GFP_KERNEL)
[ 1084.634860]   cache: dentry(100:12G), object size: 192, buffer size: 192, default order: 0, min order: 0
[ 1084.634861]   node 0: slabs: 26888, objs: 564648, free: 0
[ 1084.648583] SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
[ 1084.648593]   cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
[ 1084.648595]   node 0: slabs: 19132, objs: 592952, free: 0
[ 1084.648695] SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
[ 1084.648700]   cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
[ 1084.648702]   node 0: slabs: 19132, objs: 592952, free: 0
[ 1084.648794] SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
[ 1084.648799]   cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
[ 1084.648801]   node 0: slabs: 19132, objs: 592952, free: 0
[ 1084.648898] SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
[ 1084.648900]   cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
[ 1084.648908]   node 0: slabs: 19132, objs: 592952, free: 0
[ 1084.649000] SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
[ 1084.649004]   cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
[ 1084.649006]   node 0: slabs: 19132, objs: 592952, free: 0
[ 1084.649103] SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
[ 1084.649107]   cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
[ 1084.649109]   node 0: slabs: 19132, objs: 592952, free: 0
[ 1084.649198] SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
[ 1084.649199]   cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
[ 1084.649199]   node 0: slabs: 19132, objs: 592952, free: 0
[ 1084.649293] SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
[ 1084.649299]   cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
[ 1084.649299]   node 0: slabs: 19132, objs: 592952, free: 0
[ 1146.798499] Purging GPU memory, 41040 pages freed, 6933 pages still pinned.
[ 1146.798512] 4 and 0 pages still available in the bound and unbound GPU page lists.
[ 1146.798649] Purging GPU memory, 0 pages freed, 6933 pages still pinned.
[ 1146.798653] 4 and 0 pages still available in the bound and unbound GPU page lists.
[ 1146.798696] emerge invoked oom-killer: gfp_mask=0x0(), nodemask=(null), order=0, oom_score_adj=0
[ 1146.798699] emerge cpuset=
[ 1146.798701] /
[ 1146.798703]  mems_allowed=0
[ 1146.798705] CPU: 4 PID: 16719 Comm: emerge Not tainted 4.19.69 #43
[ 1146.798707] Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
[ 1146.798708] Call Trace:
[ 1146.798713]  dump_stack+0x46/0x60
[ 1146.798718]  dump_header+0x67/0x28d
[ 1146.798721]  oom_kill_process.cold.31+0xb/0x1f3
[ 1146.798723]  out_of_memory+0x129/0x250
[ 1146.798728]  pagefault_out_of_memory+0x64/0x77
[ 1146.798732]  __do_page_fault+0x3c1/0x3d0
[ 1146.798735]  do_page_fault+0x2c/0x123
[ 1146.798738]  ? page_fault+0x8/0x30
[ 1146.798740]  page_fault+0x1e/0x30
[ 1146.798743] RIP: 0033:0x7f3745ccf628
[ 1146.798744] Code: 43 68 48 8b 40 08 48 89 44 24 08 48 8b 83 00 03 00 00 48 85 c0 0f 84 ff 00 00 00 48 8b 74 24 20 8b 54 24 30 23 93 f8 02 00 00 <48> 8b 14 d0 8b 83 fc 02 00 00 89 f7 c4 e2 fb f7 c6 c4 e2 fb f7 c2
[ 1146.798746] RSP: 002b:00007ffd34c11460 EFLAGS: 00010202
[ 1146.798748] RAX: 000056439d3e5288 RBX: 00007f3745cf0130 RCX: 0000000000000013
[ 1146.798750] RDX: 0000000000000001 RSI: 00000000777f2f73 RDI: 00007f37458c7303
[ 1146.798752] RBP: 0000000000000000 R08: 00007ffd34c11590 R09: 00007f3745cf03f0
[ 1146.798754] R10: 00007f3745c91000 R11: 00007f374557b5c0 R12: 0000000000000000
[ 1146.798756] R13: 0000000000000001 R14: 0000000000000000 R15: 00007f3745c92f58
[ 1146.798758] Mem-Info:
[ 1146.798761] active_anon:438779 inactive_anon:41657 isolated_anon:0
                 active_file:591199 inactive_file:2307684 isolated_file:0
                 unevictable:1 dirty:499 writeback:0 unstable:0
                 slab_reclaimable:299846 slab_unreclaimable:22656
                 mapped:134859 shmem:51982 pagetables:5460 bounce:0
                 free:311407 free_pcp:4594 free_cma:0
[ 1146.798765] Node 0 active_anon:1755116kB inactive_anon:166628kB active_file:2364796kB inactive_file:9230736kB unevictable:4kB isolated(anon):0kB isolated(file):0kB mapped:539436kB dirty:1996kB writeback:0kB shmem:207928kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 598016kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 1146.798770] DMA free:15900kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15984kB managed:15900kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1146.798772] lowmem_reserve[]:
[ 1146.798773]  0
[ 1146.798774]  3030
[ 1146.798775]  15777
[ 1146.798776]  15777
[ 1146.798780] DMA32 free:1111400kB min:12968kB low:16208kB high:19448kB active_anon:253352kB inactive_anon:1160kB active_file:48104kB inactive_file:1489456kB unevictable:0kB writepending:0kB present:3172224kB managed:3172224kB mlocked:0kB kernel_stack:16kB pagetables:248kB bounce:0kB free_pcp:10952kB local_pcp:1356kB free_cma:0kB
[ 1146.798782] lowmem_reserve[]:
[ 1146.798783]  0
[ 1146.798784]  0
[ 1146.798786]  12746
[ 1146.798787]  12746
[ 1146.798791] Normal free:118328kB min:54544kB low:68180kB high:81816kB active_anon:1501764kB inactive_anon:165596kB active_file:2316692kB inactive_file:7741280kB unevictable:4kB writepending:1996kB present:13367296kB managed:13056872kB mlocked:4kB kernel_stack:9216kB pagetables:21592kB bounce:0kB free_pcp:7412kB local_pcp:1320kB free_cma:0kB
[ 1146.798793] lowmem_reserve[]:
[ 1146.798794]  0
[ 1146.798795]  0
[ 1146.798796]  0
[ 1146.798797]  0
[ 1146.798799] DMA:
[ 1146.798800] 1*4kB
[ 1146.798802] (U)
[ 1146.798803] 1*8kB
[ 1146.798805] (U)
[ 1146.798806] 1*16kB
[ 1146.798807] (U)
[ 1146.798808] 0*32kB
[ 1146.798810] 2*64kB
[ 1146.798811] (U)
[ 1146.798812] 1*128kB
[ 1146.798813] (U)
[ 1146.798815] 1*256kB
[ 1146.798816] (U)
[ 1146.798817] 0*512kB
[ 1146.798818] 1*1024kB
[ 1146.798819] (U)
[ 1146.798821] 1*2048kB
[ 1146.798822] (M)
[ 1146.798823] 3*4096kB
[ 1146.798824] (M)
[ 1146.798826] = 15900kB
[ 1146.798828] DMA32:
[ 1146.798829] 6346*4kB
[ 1146.798831] (UME)
[ 1146.798977] 2650*8kB
[ 1146.798980] (UME)
[ 1146.798981] 1118*16kB
[ 1146.799118] (UME)
[ 1146.799119] 774*32kB
[ 1146.799121] (UME)
[ 1146.799122] 367*64kB
[ 1146.799123] (UME)
[ 1146.799125] 191*128kB
[ 1146.799126] (UME)
[ 1146.799128] 64*256kB
[ 1146.799290] (UME)
[ 1146.799291] 37*512kB
[ 1146.799291] (UM)
[ 1146.799291] 21*1024kB
[ 1146.799292] (UME)
[ 1146.799292] 194*2048kB
[ 1146.799292] (UM)
[ 1146.799292] 127*4096kB
[ 1146.799293] (M)
[ 1146.799293] = 1111512kB
[ 1146.799293] Normal:
[ 1146.799294] 2360*4kB
[ 1146.799294] (UME)
[ 1146.799294] 1483*8kB
[ 1146.799295] (UME)
[ 1146.799295] 506*16kB
[ 1146.799295] (UME)
[ 1146.799295] 377*32kB
[ 1146.799296] (UME)
[ 1146.799296] 171*64kB
[ 1146.799296] (UME)
[ 1146.799296] 59*128kB
[ 1146.799297] (UME)
[ 1146.799297] 46*256kB
[ 1146.799297] (UME)
[ 1146.799297] 7*512kB
[ 1146.799298] (UME)
[ 1146.799298] 4*1024kB
[ 1146.799298] (UE)
[ 1146.799299] 5*2048kB
[ 1146.799299] (ME)
[ 1146.799299] 7*4096kB
[ 1146.799299] (M)
[ 1146.799300] = 118328kB
[ 1146.799301] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1146.799301] 2950857 total pagecache pages
[ 1146.799303] 0 pages in swap cache
[ 1146.799576] Swap cache stats: add 0, delete 0, find 0/0
[ 1146.799578] Free swap  = 4194300kB
[ 1146.799579] Total swap = 4194300kB
[ 1146.799581] 4138876 pages RAM
[ 1146.799582] 0 pages HighMem/MovableOnly
[ 1146.799584] 77627 pages reserved
[ 1146.799585] Tasks state (memory values in pages):
[ 1146.799586] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 1146.799601] [    971]     0   971     3161      967    57344        0             0 systemd-udevd
[ 1146.799605] [   2152]   103  2152      902      692    40960        0             0 dbus-daemon
[ 1146.799607] [   2185]     0  2185     3040      597    57344        0             0 syslog-ng
[ 1146.799609] [   2186]     0  2186    77434     1839    94208        0             0 syslog-ng
[ 1146.799611] [   2187]     0  2187     3299     1286    69632        0             0 syslog_redirect
[ 1146.799798] [   2222]     0  2222    96102     1318   110592        0             0 console-kit-dae
[ 1146.799800] [   2239]   119  2239   486941     5736   237568        0             0 polkitd
[ 1146.799801] [   2259]     0  2259      612       23    45056        0             0 gpm
[ 1146.799802] [   2735]     0  2735      691       66    45056        0             0 dhcpcd
[ 1146.799803] [   2806]   123  2806      983      412    45056        0             0 ntpd
[ 1146.799805] [   2807]   123  2807      999      483    49152        0             0 ntpd
[ 1146.799814] [   2817]     0  2817      945       54    45056        0             0 ntpd
[ 1146.799815] [   2847]     0  2847    18576      902   118784        0             0 virtlogd
[ 1146.799817] [   2878]     0  2878    18577      885   110592        0             0 virtlockd
[ 1146.799822] [   2912]     0  2912   352359     5586   299008        0             0 libvirtd
[ 1146.799823] [   3055]     0  3055     1917      374    53248        0             0 smartd
[ 1146.799824] [   3084]     0  3084     2010      454    53248        0             0 cron
[ 1146.799977] [   3159]     0  3159    78172     1652   106496        0             0 lightdm
[ 1146.799978] [   3178]     0  3178   223859    45920   868352        0             0 X
[ 1146.799984] [   3218]     0  3218    42557     1645    98304        0             0 lightdm
[ 1146.799985] [   3230]  1000  3230     2302      795    61440        0             0 sh
[ 1146.799986] [   3238]  1000  3238     2160      559    57344        0             0 dbus-launch
[ 1146.799992] [   3239]  1000  3239     1023      705    49152        0             0 dbus-daemon
[ 1146.799994] [   3245]  1000  3245    47306     4073   274432        0             0 xfce4-session
[ 1146.799998] [   3247]  1000  3247     3564     1246    73728        0             0 xfconfd
[ 1146.799999] [   3250]  1000  3250     1776       82    49152        0             0 ssh-agent
[ 1146.800000] [   3252]  1000  3252    40853      214    73728        0             0 gpg-agent
[ 1146.800001] [   3255]  1000  3255    37070     6711   294912        0             0 xfwm4
[ 1146.800002] [   3257]  1000  3257    29454     4047   266240        0             0 Thunar
[ 1146.800004] [   3259]  1000  3259    56521     6429   315392        0             0 xfce4-panel
[ 1146.800005] [   3260]  1000  3260    76198     5296   307200        0             0 xfsettingsd
[ 1146.800006] [   3264]  1000  3264   164200    15514   430080        0             0 xfdesktop
[ 1146.800007] [   3269]     0  3269    61046     1602   102400        0             0 upowerd
[ 1146.800008] [   3285]  1000  3285    66179     4569   290816        0             0 panel-2-actions
[ 1146.800009] [   3287]  1000  3287    36585     5112   278528        0             0 panel-8-quickla
[ 1146.800010] [   3288]  1000  3288   633361    62058  1359872        0             0 akregator
[ 1146.800011] [   3290]  1000  3290    71292     7355   294912        0             0 gkrellm
[ 1146.800012] [   3292]  1000  3292    60506     1728   102400        0             0 gvfsd
[ 1146.800013] [   3294]  1000  3294    28673     3873   262144        0             0 panel-6-systray
[ 1146.800014] [   3295]  1000  3295   155110    18850   507904        0             0 konversation
[ 1146.800015] [   3296]  1000  3296    35017     5801   274432        0             0 panel-1-genmon
[ 1146.800016] [   3297]  1000  3297    95199     7413   335872        0             0 panel-4-xfce4-t
[ 1146.800018] [   3298]  1000  3298   118574    21860   516096        0             0 kate
[ 1146.800045] [   3300]  1000  3300    35016     5745   278528        0             0 panel-7-datetim
[ 1146.800046] [   3301]  1000  3301   116227    18944   458752        0             0 konsole
[ 1146.800047] [   3303]  1000  3303   389058    54055   917504        0             0 clementine
[ 1146.800048] [   3372]  1000  3372    77563     1551   102400        0             0 at-spi-bus-laun
[ 1146.800201] [   3381]  1000  3381      865      687    49152        0             0 dbus-daemon
[ 1146.800203] [   3383]  1000  3383    41371     1516    90112        0             0 at-spi2-registr
[ 1146.800205] [   3399]  1000  3399   150830     9120   372736        0             0 kactivitymanage
[ 1146.800212] [   3418]  1000  3418    34971     4175   126976        0             0 clementine-tagr
[ 1146.800214] [   3419]  1000  3419    34971     4210   131072        0             0 clementine-tagr
[ 1146.800218] [   3420]  1000  3420    34971     4198   126976        0             0 clementine-tagr
[ 1146.800219] [   3421]  1000  3421    34971     4193   126976        0             0 clementine-tagr
[ 1146.800220] [   3422]  1000  3422    34971     4166   135168        0             0 clementine-tagr
[ 1146.800222] [   3427]  1000  3427    72028     6718   278528        0             0 kglobalaccel5
[ 1146.800228] [   3428]  1000  3428    34971     4195   126976        0             0 clementine-tagr
[ 1146.800229] [   3429]  1000  3429    34971     4215   131072        0             0 clementine-tagr
[ 1146.800231] [   3431]  1000  3431    34971     4219   126976        0             0 clementine-tagr
[ 1146.800236] [   3444]  1000  3444     6474     5049    98304        0             0 bash
[ 1146.800237] [   3447]  1000  3447    90666    12645   548864        0             0 QtWebEngineProc
[ 1146.800242] [   3452]  1000  3452   149454      569   122880        0             0 mergerfs
[ 1146.800243] [   3487]     0  3487   100615     3753   151552        0             0 udisksd
[ 1146.800244] [   3512]  1000  3512     8823     1162   102400        0             0 sd_cicero
[ 1146.800246] [   3516]  1000  3516     8823     1123    98304        0             0 sd_dummy
[ 1146.800247] [   3519]  1000  3519     8829     1121    94208        0             0 sd_generic
[ 1146.800251] [   3521]  1000  3521    31475     2098   131072        0             0 sd_espeak
[ 1146.800252] [   3523]  1000  3523    74986     7943   311296        0             0 kiod5
[ 1146.800253] [   3546]     0  3546     1992      411    53248        0             0 agetty
[ 1146.800258] [   3547]     0  3547     2025      601    53248        0             0 agetty
[ 1146.800259] [   3548]     0  3548     2025      588    57344        0             0 agetty
[ 1146.800264] [   3549]     0  3549     2025      649    49152        0             0 agetty
[ 1146.800265] [   3550]     0  3550     2025      608    49152        0             0 agetty
[ 1146.800269] [   3551]     0  3551     2025      604    49152        0             0 agetty
[ 1146.800270] [   3559]  1000  3559    39834      554    81920        0             0 speech-dispatch
[ 1146.800274] [   3569]  1000  3569    60094     2084    94208        0             0 gvfs-udisks2-vo
[ 1146.800276] [   3574]  1000  3574    24587     2321   172032        0             0 kdeinit5
[ 1146.800279] [   3575]  1000  3575    76563     9081   331776        0             0 klauncher
[ 1146.800280] [   3596]  1000  3596    74713     9413   311296        0             0 kded5
[ 1146.800281] [   3623]  1000  3623    46085     4836   212992        0             0 kio_http_cache_
[ 1146.800282] [   3653]  1000  3653   138832    11871   372736        0             0 easystroke
[ 1146.800289] [   3654]  1000  3654    83783    12054   348160        0             0 redshift-gtk
[ 1146.800290] [   3655]  1000  3655    41365     7457   294912        0             0 orage
[ 1146.800291] [   3661]  1000  3661    60416     5023   249856        0             0 polkit-gnome-au
[ 1146.800292] [   3664]  1000  3664    60900     4510   245760        0             0 parcellite
[ 1146.800293] [   3680]  1000  3680    21790     2021   122880        0             0 xbindkeys
[ 1146.800297] [   3695]  1000  3695     3987      390    69632        0             0 redshift
[ 1146.800298] [   3696]  1000  3696     2198      691    57344        0             0 su
[ 1146.800299] [   3701]  1000  3701    57610     9559   380928        0             0 fcitx
[ 1146.800305] [   3706]  1000  3706      934      551    53248        0             0 dbus-daemon
[ 1146.800306] [   3711]  1000  3711     1224       28    49152        0             0 fcitx-dbus-watc
[ 1146.800307] [   3713]  1000  3713    89866     7908   303104        0             0 xfce4-notifyd
[ 1146.800433] [   3719]  1000  3719    27953     3652   196608        0             0 file.so
[ 1146.800435] [   3722]     0  3722     6339     4921    90112        0             0 bash
[ 1146.800436] [   3754]     0  3754     2899      865    65536        0             0 tmux: client
[ 1146.800437] [   3756]     0  3756     3593     1512    65536        0             0 tmux: server
[ 1146.800438] [   3760]     0  3760     1256      492    49152        0             0 tomoyo-queryd
[ 1146.800439] [   3764]     0  3764     6369     4973    90112        0             0 bash
[ 1146.800440] [   3771]  1000  3771     6473     5063    94208        0             0 bash
[ 1146.800441] [   3831]  1000  3831     6474     5057    90112        0             0 bash
[ 1146.800442] [   3847]  1000  3847     2084      226    57344        0             0 dmesg
[ 1146.800444] [   4187]  1000  4187   726438   221563  3211264        0             0 firefox
[ 1146.800459] [  23437]  1000 23437   416243    18201  1122304        0           300 QtWebEngineProc
[ 1146.800460] [  16704]     0 16704    69439    48429   610304        0             0 emerge
[ 1146.800462] [  16719]     0 16719    69439    46469   585728        0             0 emerge
[ 1146.800467] [  16724]     0 16724     1764      316    53248        0             0 rssnotify.pl
[ 1146.800468] Out of memory: Kill process 23437 (QtWebEngineProc) score 303 or sacrifice child
[ 1146.800491] Killed process 23437 (QtWebEngineProc) total-vm:1664972kB, anon-rss:16924kB, file-rss:55892kB, shmem-rss:0kB
[ 1146.803320] oom_reaper: reaped process 23437 (QtWebEngineProc), now anon-rss:0kB, file-rss:0kB, shmem-rss:8kB
[ 1146.837984] emerge: vmalloc: allocation failure, allocated 8192 of 20480 bytes, mode:0x7080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), nodemask=(null)
[ 1146.837993] emerge cpuset=
[ 1146.837995] /
[ 1146.837997]  mems_allowed=0
[ 1146.837999] CPU: 4 PID: 16742 Comm: emerge Not tainted 4.19.69 #43
[ 1146.838001] Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
[ 1146.838002] Call Trace:
[ 1146.838008]  dump_stack+0x46/0x60
[ 1146.838013]  warn_alloc.cold.133+0x68/0xe8
[ 1146.838016]  ? __alloc_pages_nodemask+0x1f7/0x290
[ 1146.838019]  __vmalloc_node_range+0x148/0x220
[ 1146.838023]  copy_process+0xaab/0x27b0
[ 1146.838025]  ? _do_fork+0xb2/0x390
[ 1146.838028]  _do_fork+0xb2/0x390
[ 1146.838030]  do_syscall_64+0x59/0x180
[ 1146.838033]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1146.838036] RIP: 0033:0x7f3745793aae
[ 1146.838038] Code: db 0f 85 25 01 00 00 64 4c 8b 0c 25 10 00 00 00 45 31 c0 4d 8d 91 d0 02 00 00 31 d2 31 f6 bf 11 00 20 01 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 a6 00 00 00 41 89 c4 85 c0 0f 85 b3 00 00
[ 1146.838040] RSP: 002b:00007ffd34c12ed0 EFLAGS: 00000246
[ 1146.838041]  ORIG_RAX: 0000000000000038
[ 1146.838045] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3745793aae
[ 1146.838046] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[ 1146.838048] RBP: 0000000000000000 R08: 0000000000000000 R09: 00007f374557b5c0
[ 1146.838050] R10: 00007f374557b890 R11: 0000000000000246 R12: 00007f3744cb17b0
[ 1146.838051] R13: 00007f3745545ad4 R14: 00007f373a32d168 R15: 00007f3745538050
[ 1146.838053] Mem-Info:
[ 1146.838056] active_anon:439660 inactive_anon:41529 isolated_anon:0
                 active_file:591639 inactive_file:2307377 isolated_file:0
                 unevictable:1 dirty:500 writeback:0 unstable:0
                 slab_reclaimable:299846 slab_unreclaimable:22643
                 mapped:133960 shmem:51818 pagetables:5302 bounce:0
                 free:310992 free_pcp:4130 free_cma:0
[ 1146.838060] Node 0 active_anon:1758640kB inactive_anon:166116kB active_file:2366556kB inactive_file:9229508kB unevictable:4kB isolated(anon):0kB isolated(file):0kB mapped:535840kB dirty:2000kB writeback:0kB shmem:207272kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 598016kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 1146.838063] DMA free:15900kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15984kB managed:15900kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1146.838064] lowmem_reserve[]:
[ 1146.838065]  0
[ 1146.838067]  3030
[ 1146.838068]  15777
[ 1146.838069]  15777
[ 1146.838071] DMA32 free:1115072kB min:12968kB low:16208kB high:19448kB active_anon:250368kB inactive_anon:416kB active_file:48104kB inactive_file:1489456kB unevictable:0kB writepending:0kB present:3172224kB managed:3172224kB mlocked:0kB kernel_stack:16kB pagetables:228kB bounce:0kB free_pcp:11000kB local_pcp:1348kB free_cma:0kB
[ 1146.838073] lowmem_reserve[]:
[ 1146.838074]  0
[ 1146.838075]  0
[ 1146.838076]  12746
[ 1146.838077]  12746
[ 1146.838080] Normal free:112996kB min:54544kB low:68180kB high:81816kB active_anon:1508272kB inactive_anon:165700kB active_file:2318452kB inactive_file:7740052kB unevictable:4kB writepending:2000kB present:13367296kB managed:13056872kB mlocked:4kB kernel_stack:9220kB pagetables:20980kB bounce:0kB free_pcp:5472kB local_pcp:508kB free_cma:0kB
[ 1146.838081] lowmem_reserve[]:
[ 1146.838082]  0
[ 1146.838083]  0
[ 1146.838085]  0
[ 1146.838086]  0
[ 1146.838088] DMA:
[ 1146.838089] 1*4kB
[ 1146.838090] (U)
[ 1146.838091] 1*8kB
[ 1146.838092] (U)
[ 1146.838093] 1*16kB
[ 1146.838094] (U)
[ 1146.838095] 0*32kB
[ 1146.838097] 2*64kB
[ 1146.838098] (U)
[ 1146.838099] 1*128kB
[ 1146.838100] (U)
[ 1146.838101] 1*256kB
[ 1146.838102] (U)
[ 1146.838103] 0*512kB
[ 1146.838104] 1*1024kB
[ 1146.838106] (U)
[ 1146.838107] 1*2048kB
[ 1146.838279] (M)
[ 1146.838282] 3*4096kB
[ 1146.838284] (M)
[ 1146.838287] = 15900kB
[ 1146.838290] DMA32:
[ 1146.838292] 6524*4kB
[ 1146.838294] (UME)
[ 1146.838295] 2692*8kB
[ 1146.838297] (UME)
[ 1146.838298] 1125*16kB
[ 1146.838300] (UME)
[ 1146.838301] 775*32kB
[ 1146.838303] (UME)
[ 1146.838304] 368*64kB
[ 1146.838306] (UME)
[ 1146.838307] 193*128kB
[ 1146.838308] (UME)
[ 1146.838310] 64*256kB
[ 1146.838311] (UME)
[ 1146.838312] 37*512kB
[ 1146.838313] (UM)
[ 1146.838315] 21*1024kB
[ 1146.838316] (UME)
[ 1146.838318] 195*2048kB
[ 1146.838319] (UM)
[ 1146.838321] 127*4096kB
[ 1146.838322] (M)
[ 1146.838323] = 1115072kB
[ 1146.838324] Normal:
[ 1146.838325] 1083*4kB
[ 1146.838326] (UME)
[ 1146.838327] 1426*8kB
[ 1146.838328] (UM)
[ 1146.838329] 493*16kB
[ 1146.838330] (UE)
[ 1146.838331] 359*32kB
[ 1146.838333] (U)
[ 1146.838334] 161*64kB
[ 1146.838335] (UME)
[ 1146.838336] 54*128kB
[ 1146.838337] (UM)
[ 1146.838339] 43*256kB
[ 1146.838340] (UM)
[ 1146.838341] 9*512kB
[ 1146.838342] (UME)
[ 1146.838343] 5*1024kB
[ 1146.838345] (UME)
[ 1146.838347] 5*2048kB
[ 1146.838348] (ME)
[ 1146.838349] 7*4096kB
[ 1146.838350] (M)
[ 1146.838351] = 111980kB
[ 1146.838353] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1146.838355] 2950826 total pagecache pages
[ 1146.838357] 0 pages in swap cache
[ 1146.838358] Swap cache stats: add 0, delete 0, find 0/0
[ 1146.838360] Free swap  = 4194300kB
[ 1146.838362] Total swap = 4194300kB
[ 1146.838363] 4138876 pages RAM
[ 1146.838364] 0 pages HighMem/MovableOnly
[ 1146.838365] 77627 pages reserved

 From the output it looks like there was about 9G inactive file memory and no
swap used at all when the OOM killer triggered. The output is messy because I
use netconsole.

Swap is using a zram dev setup like this:
   echo 8 > /sys/block/zram0/max_comp_streams
   echo lz4 > /sys/block/zram0/comp_algorithm
   echo 4G > /sys/block/zram0/disksize
   mkswap /dev/zram0
   swapon -v --discard /dev/zram0

Those kernel memory allocation failures can also cause kernel NULL pointer
dereference. Here is a dmesg captured over netconsole when that happens:

4,1210,922134743,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1211,922134751,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1212,922134753,-;  node 0: slabs: 18346, objs: 568698, free: 0
4,1213,922134762,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1214,922134764,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1215,922134765,-;  node 0: slabs: 18346, objs: 568698, free: 0
4,1216,922134770,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1217,922134771,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1218,922134773,-;  node 0: slabs: 18346, objs: 568698, free: 0
4,1219,922134776,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1220,922134778,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1221,922134779,-;  node 0: slabs: 18346, objs: 568698, free: 0
4,1222,922134784,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1223,922134872,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1224,922134874,-;  node 0: slabs: 18346, objs: 568698, free: 0
4,1225,922135143,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1226,922135147,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1227,922135148,-;  node 0: slabs: 18351, objs: 568713, free: 0
4,1228,922135152,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1229,922135154,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1230,922135162,-;  node 0: slabs: 18351, objs: 568713, free: 0
4,1231,922135165,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1232,922135166,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1233,922135166,-;  node 0: slabs: 18351, objs: 568713, free: 0
4,1234,922135168,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1235,922135175,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1236,922135176,-;  node 0: slabs: 18351, objs: 568713, free: 0
4,1237,922135181,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1238,922135183,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1239,922135183,-;  node 0: slabs: 18351, objs: 568713, free: 0
1,1240,922137835,-;BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
6,1241,922137839,-;PGD 0
4,1242,922137840,c;P4D 0
4,1243,922137842,-;Oops: 0000 [#1] PREEMPT SMP PTI
4,1244,922137844,-;CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
4,1245,922137846,-;Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
4,1246,922137850,-;RIP: 0010:create_empty_buffers+0x24/0x100
4,1247,922137852,-;Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
4,1248,922137853,-;RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
4,1249,922137855,-;RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
4,1250,922137856,-;RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
4,1251,922137857,-;RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
4,1252,922137859,-;R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
4,1253,922137860,-;R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
4,1254,922137861,-;FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
4,1255,922137863,-;CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
4,1256,922137865,-;CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
4,1257,922137866,-;Call Trace:
4,1258,922137967,-; create_page_buffers+0x4d/0x60
4,1259,922137969,-; __block_write_begin_int+0x8e/0x5a0
4,1260,922137972,-; ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
4,1261,922137975,-; ? jbd2__journal_start+0xd7/0x1f0
4,1262,922137977,-; ext4_da_write_begin+0x112/0x3d0
4,1263,922137980,-; generic_perform_write+0xf1/0x1b0
4,1264,922137983,-; ? file_update_time+0x70/0x140
4,1265,922137985,-; __generic_file_write_iter+0x141/0x1a0
4,1266,922137988,-; ext4_file_write_iter+0xef/0x3b0
4,1267,922137990,-; __vfs_write+0x17e/0x1e0
4,1268,922137992,-; vfs_write+0xa5/0x1a0
4,1269,922137994,-; ksys_write+0x57/0xd0
4,1270,922137997,-; do_syscall_64+0x55/0x160
4,1271,922138000,-; entry_SYSCALL_64_after_hwframe+0x44/0xa9
4,1272,922138003,-;RIP: 0033:0x7fb55ba9c0d8
4,1273,922138004,-;Code: 00 90 48 83 ec 38 64 48 8b 04 25 28 00 00 00 48 89 44 24 28 31 c0 48 8d 05 05 96 0d 00 8b 00 85 c0 75 27 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 60 48 8b 4c 24 28 64 48 33 0c 25 28 00 00 00
4,1274,922138006,-;RSP: 002b:00007fff718c1260 EFLAGS: 00000246
4,1275,922138007,c; ORIG_RAX: 0000000000000001
4,1276,922138124,-;RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb55ba9c0d8
4,1277,922138125,-;RDX: 0000000000001000 RSI: 000056395ab78710 RDI: 0000000000000003
4,1278,922138126,-;RBP: 000056395ab78710 R08: 00007fb55b288bc0 R09: 0000000000000000
4,1279,922138128,-;R10: 0000000000000002 R11: 0000000000000246 R12: 000056395ab69900
4,1280,922138129,-;R13: 0000000000001000 R14: 00007fb55bb6c760 R15: 0000000000001000
4,1281,922138131,-;Modules linked in:
4,1282,922138133,c; 8021q
4,1283,922138134,c; iptable_mangle
4,1284,922138135,c; xt_limit
4,1285,922138136,c; xt_conntrack
4,1286,922138137,c; iptable_filter
4,1287,922138138,c; iptable_nat
4,1288,922138139,c; nf_nat_ipv4
4,1289,922138141,c; nf_nat
4,1290,922138142,c; ip_tables
4,1291,922138143,c; arc4
4,1292,922138144,c; ath9k_htc
4,1293,922138145,c; ath9k_common
4,1294,922138146,c; ath9k_hw
4,1295,922138147,c; ath
4,1296,922138148,c; mac80211
4,1297,922138150,c; kvm_intel
4,1298,922138151,c; cfg80211
4,1299,922138152,c; kvm
4,1300,922138153,c; uas
4,1301,922138154,c; crc32_pclmul
4,1302,922138156,c; usb_storage
4,1303,922138157,c; joydev
4,1304,922138201,c; cdc_acm
4,1305,922138203,-;CR2: 0000000000000008
4,1306,922138270,-;---[ end trace ee8624c121072f8e ]---
4,1307,922138273,-;RIP: 0010:create_empty_buffers+0x24/0x100
4,1308,922138275,-;Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
4,1309,922138278,-;RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
4,1310,922138280,-;RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
4,1311,922138281,-;RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
4,1312,922138282,-;RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
4,1313,922138284,-;R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
4,1314,922138285,-;R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
4,1315,922138286,-;FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
4,1316,922138288,-;CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
4,1317,922138289,-;CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
0,1318,922138290,-;Kernel panic - not syncing: Fatal exception
0,1319,922138296,-;Kernel Offset: 0x6000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
0,1320,922138299,-;---[ end Kernel panic - not syncing: Fatal exception ]---

Since I never had this problem with 4.14 I tested the old 4.14.115 that I used
before updating to 4.19 and I couldn't reproduce the problem with it. I can
easily reproduce the problem at least back to 4.19.42 in the 4.19 series.
I could bisect the problem but that's going to take forever so I'm hoping I
can avoid that.

The problem only occurs when that "12G" memory cgroups is used to build
chromium and nothing else. Chromium is the largest package I regularly build
and I selectively enable ccache for the chromium build. My gut feeling tells
the that the massive number of file operations needed for the build is what is
triggering the problem. Perhaps when the memory.kmem.limit_in_bytes limit is
reached?

Here is some more info I don't think fit in a single mail.

full dmesg https://pastebin.com/raw/tKgTCTJ2
.config https://pastebin.com/raw/jKhSqqCX

Some relevant parts of /etc/sysctl.conf:
   vm.dirty_writeback_centisecs = 3000
   vm.dirty_background_bytes = 52428800
   vm.dirty_bytes = 262144000
   vm.swappiness = 99
   vm.vfs_cache_pressure = 25


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] Early OOM and kernel NULL pointer dereference in 4.19.69
  2019-09-01 20:43 [BUG] Early OOM and kernel NULL pointer dereference in 4.19.69 Thomas Lindroth
@ 2019-09-02  7:16 ` Michal Hocko
  2019-09-02  7:27   ` Michal Hocko
  2019-09-02 19:34   ` Thomas Lindroth
       [not found] ` <666dbcde-1b8a-9e2d-7d1f-48a117c78ae1@I-love.SAKURA.ne.jp>
       [not found] ` <20190906125608.32129-1-mhocko@kernel.org>
  2 siblings, 2 replies; 26+ messages in thread
From: Michal Hocko @ 2019-09-02  7:16 UTC (permalink / raw)
  To: Thomas Lindroth; +Cc: linux-mm, stable

On Sun 01-09-19 22:43:05, Thomas Lindroth wrote:
> After upgrading to the 4.19 series I've started getting problems with
> early OOM.

What is the kenrel you have updated from? Would it be possible to try
the current Linus' tree?

> I run a Gentoo system and do large compiles like the chromium browser in a
> v1 memory cgroup. When I build chromium in the memory cgroup the OOM killer
> runs and kills programs outside of the cgroup. This happens even when there
> is plenty of free memory both in and outside of the cgroup.
[...]
> [ 1146.798696] emerge invoked oom-killer: gfp_mask=0x0(), nodemask=(null), order=0, oom_score_adj=0
> [ 1146.798699] emerge cpuset=
> [ 1146.798701] /
> [ 1146.798703]  mems_allowed=0
> [ 1146.798705] CPU: 4 PID: 16719 Comm: emerge Not tainted 4.19.69 #43
> [ 1146.798707] Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
> [ 1146.798708] Call Trace:
> [ 1146.798713]  dump_stack+0x46/0x60
> [ 1146.798718]  dump_header+0x67/0x28d
> [ 1146.798721]  oom_kill_process.cold.31+0xb/0x1f3
> [ 1146.798723]  out_of_memory+0x129/0x250
> [ 1146.798728]  pagefault_out_of_memory+0x64/0x77
> [ 1146.798732]  __do_page_fault+0x3c1/0x3d0
> [ 1146.798735]  do_page_fault+0x2c/0x123
> [ 1146.798738]  ? page_fault+0x8/0x30
> [ 1146.798740]  page_fault+0x1e/0x30

This is not a memcg oom killer and the oom killer itself is a reaction
to the allocation not making a forward progress. It smells like
something in the page fault path has return ENOMEM leading to
VM_FAULT_OOM. Seeing unexpected SLUB allocation failures would suggest
that something is not really working properly there.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] Early OOM and kernel NULL pointer dereference in 4.19.69
  2019-09-02  7:16 ` Michal Hocko
@ 2019-09-02  7:27   ` Michal Hocko
  2019-09-02 19:34   ` Thomas Lindroth
  1 sibling, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2019-09-02  7:27 UTC (permalink / raw)
  To: Thomas Lindroth; +Cc: linux-mm, stable

On Mon 02-09-19 09:16:17, Michal Hocko wrote:
> On Sun 01-09-19 22:43:05, Thomas Lindroth wrote:
> > After upgrading to the 4.19 series I've started getting problems with
> > early OOM.
> 
> What is the kenrel you have updated from? Would it be possible to try
> the current Linus' tree?

Btw. checking vanilla 4.19 without stable patches might be interesting
as well.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] Early OOM and kernel NULL pointer dereference in 4.19.69
  2019-09-02  7:16 ` Michal Hocko
  2019-09-02  7:27   ` Michal Hocko
@ 2019-09-02 19:34   ` Thomas Lindroth
  2019-09-03  7:41     ` Michal Hocko
  1 sibling, 1 reply; 26+ messages in thread
From: Thomas Lindroth @ 2019-09-02 19:34 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, stable

On 9/2/19 9:16 AM, Michal Hocko wrote:
> On Sun 01-09-19 22:43:05, Thomas Lindroth wrote:
>> After upgrading to the 4.19 series I've started getting problems with
>> early OOM.
> 
> What is the kenrel you have updated from? Would it be possible to try
> the current Linus' tree?

I did some more testing and it turns out this is not a regression after all.

I followed up on my hunch and monitored memory.kmem.max_usage_in_bytes while
running cgexec -g memory:12G bash -c 'find / -xdev -type f -print0 | \
         xargs -0 -n 1 -P 8 stat > /dev/null'

Just as memory.kmem.max_usage_in_bytes = memory.kmem.limit_in_bytes the OOM
killer kicked in and killed my X server.

Using the find|stat approach it was easy to test the problem in a testing VM.
I was able to reproduce the problem in all these kernels:
   4.9.0
   4.14.0
   4.14.115
   4.19.0
   5.2.11

5.3-rc6 didn't build in the VM. The build environment is too old probably.

I was curious why I initially couldn't reproduce the problem in 4.14 by
building chromium. I was again able to successfully build chromium using
4.14.115. Turns out memory.kmem.max_usage_in_bytes was 1015689216 after
building and my limit is set to 1073741824. I guess some unrelated change in
memory management raised that slightly for 4.19 triggering the problem.

If you want to reproduce for yourself here are the steps:
1. build any kernel above 4.9 using something like my .config
2. setup a v1 memory cgroup with memory.kmem.limit_in_bytes lower than
    memory.limit_in_bytes. I used 100M in my testing VM.
3. Run "find / -xdev -type f -print0 | xargs -0 -n 1 -P 8 stat > /dev/null"
    in the cgroup.
4. Assuming there is enough inodes on the rootfs the global OOM killer
    should kick in when memory.kmem.max_usage_in_bytes =
    memory.kmem.limit_in_bytes and kill something outside the cgroup.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] Early OOM and kernel NULL pointer dereference in 4.19.69
  2019-09-02 19:34   ` Thomas Lindroth
@ 2019-09-03  7:41     ` Michal Hocko
  2019-09-03 12:01       ` Thomas Lindroth
  2019-09-03 12:05       ` Andrey Ryabinin
  0 siblings, 2 replies; 26+ messages in thread
From: Michal Hocko @ 2019-09-03  7:41 UTC (permalink / raw)
  To: Thomas Lindroth; +Cc: linux-mm, stable

On Mon 02-09-19 21:34:29, Thomas Lindroth wrote:
> On 9/2/19 9:16 AM, Michal Hocko wrote:
> > On Sun 01-09-19 22:43:05, Thomas Lindroth wrote:
> > > After upgrading to the 4.19 series I've started getting problems with
> > > early OOM.
> > 
> > What is the kenrel you have updated from? Would it be possible to try
> > the current Linus' tree?
> 
> I did some more testing and it turns out this is not a regression after all.
> 
> I followed up on my hunch and monitored memory.kmem.max_usage_in_bytes while
> running cgexec -g memory:12G bash -c 'find / -xdev -type f -print0 | \
>         xargs -0 -n 1 -P 8 stat > /dev/null'
> 
> Just as memory.kmem.max_usage_in_bytes = memory.kmem.limit_in_bytes the OOM
> killer kicked in and killed my X server.
> 
> Using the find|stat approach it was easy to test the problem in a testing VM.
> I was able to reproduce the problem in all these kernels:
>   4.9.0
>   4.14.0
>   4.14.115
>   4.19.0
>   5.2.11
> 
> 5.3-rc6 didn't build in the VM. The build environment is too old probably.
> 
> I was curious why I initially couldn't reproduce the problem in 4.14 by
> building chromium. I was again able to successfully build chromium using
> 4.14.115. Turns out memory.kmem.max_usage_in_bytes was 1015689216 after
> building and my limit is set to 1073741824. I guess some unrelated change in
> memory management raised that slightly for 4.19 triggering the problem.
> 
> If you want to reproduce for yourself here are the steps:
> 1. build any kernel above 4.9 using something like my .config
> 2. setup a v1 memory cgroup with memory.kmem.limit_in_bytes lower than
>    memory.limit_in_bytes. I used 100M in my testing VM.
> 3. Run "find / -xdev -type f -print0 | xargs -0 -n 1 -P 8 stat > /dev/null"
>    in the cgroup.
> 4. Assuming there is enough inodes on the rootfs the global OOM killer
>    should kick in when memory.kmem.max_usage_in_bytes =
>    memory.kmem.limit_in_bytes and kill something outside the cgroup.

This is certainly a bug. Is this still an OOM triggered from
pagefault_out_of_memory? Since 4.19 (29ef680ae7c21) the memcg charge
path should invoke the memcg oom killer directly from the charge path.
If that doesn't happen then the failing charge is either GFP_NOFS or a
large allocation.

The former has been fixed just recently by http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp
and I suspect this is a fix you are looking for. Although it is curious
that you can see a global oom even before because the charge path would
mark an oom situation even for NOFS context and it should trigger the
memcg oom killer on the way out from the page fault path. So essentially
the same call trace except the oom killer should be constrained to the
memcg context.

Could you try the above patch please?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] Early OOM and kernel NULL pointer dereference in 4.19.69
  2019-09-03  7:41     ` Michal Hocko
@ 2019-09-03 12:01       ` Thomas Lindroth
  2019-09-03 12:05       ` Andrey Ryabinin
  1 sibling, 0 replies; 26+ messages in thread
From: Thomas Lindroth @ 2019-09-03 12:01 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, stable

On 9/3/19 9:41 AM, Michal Hocko wrote:
> On Mon 02-09-19 21:34:29, Thomas Lindroth wrote:
>> On 9/2/19 9:16 AM, Michal Hocko wrote:
>>> On Sun 01-09-19 22:43:05, Thomas Lindroth wrote:
>>>> After upgrading to the 4.19 series I've started getting problems with
>>>> early OOM.
>>>
>>> What is the kenrel you have updated from? Would it be possible to try
>>> the current Linus' tree?
>>
>> I did some more testing and it turns out this is not a regression after all.
>>
>> I followed up on my hunch and monitored memory.kmem.max_usage_in_bytes while
>> running cgexec -g memory:12G bash -c 'find / -xdev -type f -print0 | \
>>          xargs -0 -n 1 -P 8 stat > /dev/null'
>>
>> Just as memory.kmem.max_usage_in_bytes = memory.kmem.limit_in_bytes the OOM
>> killer kicked in and killed my X server.
>>
>> Using the find|stat approach it was easy to test the problem in a testing VM.
>> I was able to reproduce the problem in all these kernels:
>>    4.9.0
>>    4.14.0
>>    4.14.115
>>    4.19.0
>>    5.2.11
>>
>> 5.3-rc6 didn't build in the VM. The build environment is too old probably.
>>
>> I was curious why I initially couldn't reproduce the problem in 4.14 by
>> building chromium. I was again able to successfully build chromium using
>> 4.14.115. Turns out memory.kmem.max_usage_in_bytes was 1015689216 after
>> building and my limit is set to 1073741824. I guess some unrelated change in
>> memory management raised that slightly for 4.19 triggering the problem.
>>
>> If you want to reproduce for yourself here are the steps:
>> 1. build any kernel above 4.9 using something like my .config
>> 2. setup a v1 memory cgroup with memory.kmem.limit_in_bytes lower than
>>     memory.limit_in_bytes. I used 100M in my testing VM.
>> 3. Run "find / -xdev -type f -print0 | xargs -0 -n 1 -P 8 stat > /dev/null"
>>     in the cgroup.
>> 4. Assuming there is enough inodes on the rootfs the global OOM killer
>>     should kick in when memory.kmem.max_usage_in_bytes =
>>     memory.kmem.limit_in_bytes and kill something outside the cgroup.
> 
> This is certainly a bug. Is this still an OOM triggered from
> pagefault_out_of_memory? Since 4.19 (29ef680ae7c21) the memcg charge
> path should invoke the memcg oom killer directly from the charge path.
> If that doesn't happen then the failing charge is either GFP_NOFS or a
> large allocation.
> 
> The former has been fixed just recently by http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp
> and I suspect this is a fix you are looking for. Although it is curious
> that you can see a global oom even before because the charge path would
> mark an oom situation even for NOFS context and it should trigger the
> memcg oom killer on the way out from the page fault path. So essentially
> the same call trace except the oom killer should be constrained to the
> memcg context.
> 
> Could you try the above patch please?

I tried the patch in my testing VM on top of 5.2.11. The VM got 8G ram and
these cgroup settings:
   memory.kmem.limit_in_bytes:107374182
   memory.kmem.tcp.limit_in_bytes:1073741824
   memory.limit_in_bytes:1073741824
   memory.memsw.limit_in_bytes:12884901888

As kmem.limit_in_bytes was hit the OOM killer killed Xorg. Here is the
full dmesg:

[    0.000000] Linux version 5.2.11+ (root@debian) (gcc version 4.7.2 (Debian 4.7.2-5)) #5 SMP Tue Sep 3 08:33:32 EDT 2019
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.2.11+ root=UUID=d51ad2bd-595d-4dad-abf3-21cddbb2aee5 ro quiet
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'standard' format.
[    0.000000] BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007ffddfff] usable
[    0.000000] BIOS-e820: [mem 0x000000007ffde000-0x000000007fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000b0000000-0x00000000bfffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000027fffffff] usable
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] SMBIOS 2.8 present.
[    0.000000] DMI: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.11.0-1.fc28 04/01/2014
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 4000.128 MHz processor
[    0.001244] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.001245] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.001247] last_pfn = 0x280000 max_arch_pfn = 0x400000000
[    0.001266] MTRR default type: write-back
[    0.001267] MTRR fixed ranges enabled:
[    0.001268]   00000-9FFFF write-back
[    0.001268]   A0000-BFFFF uncachable
[    0.001268]   C0000-FFFFF write-protect
[    0.001269] MTRR variable ranges enabled:
[    0.001269]   0 base 00C0000000 mask FFC0000000 uncachable
[    0.001270]   1 disabled
[    0.001270]   2 disabled
[    0.001270]   3 disabled
[    0.001271]   4 disabled
[    0.001271]   5 disabled
[    0.001271]   6 disabled
[    0.001271]   7 disabled
[    0.001278] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT
[    0.001282] last_pfn = 0x7ffde max_arch_pfn = 0x400000000
[    0.006102] found SMP MP-table at [mem 0x000f5d10-0x000f5d1f]
[    0.006309] Using GB pages for direct mapping
[    0.006311] BRK [0x3b201000, 0x3b201fff] PGTABLE
[    0.006312] BRK [0x3b202000, 0x3b202fff] PGTABLE
[    0.006313] BRK [0x3b203000, 0x3b203fff] PGTABLE
[    0.006326] BRK [0x3b204000, 0x3b204fff] PGTABLE
[    0.006377] RAMDISK: [mem 0x246fa000-0x2e374fff]
[    0.006385] ACPI: Early table checksum verification disabled
[    0.006417] ACPI: RSDP 0x00000000000F5B40 000014 (v00 BOCHS )
[    0.006419] ACPI: RSDT 0x000000007FFE21CF 000034 (v01 BOCHS  BXPCRSDT 00000001 BXPC 00000001)
[    0.006423] ACPI: FACP 0x000000007FFE1FD7 0000F4 (v03 BOCHS  BXPCFACP 00000001 BXPC 00000001)
[    0.006426] ACPI: DSDT 0x000000007FFE0040 001F97 (v01 BOCHS  BXPCDSDT 00000001 BXPC 00000001)
[    0.006429] ACPI: FACS 0x000000007FFE0000 000040
[    0.006430] ACPI: APIC 0x000000007FFE20CB 000090 (v01 BOCHS  BXPCAPIC 00000001 BXPC 00000001)
[    0.006432] ACPI: HPET 0x000000007FFE215B 000038 (v01 BOCHS  BXPCHPET 00000001 BXPC 00000001)
[    0.006435] ACPI: MCFG 0x000000007FFE2193 00003C (v01 BOCHS  BXPCMCFG 00000001 BXPC 00000001)
[    0.006440] ACPI: Local APIC address 0xfee00000
[    0.006610] No NUMA configuration found
[    0.006611] Faking a node at [mem 0x0000000000000000-0x000000027fffffff]
[    0.006613] NODE_DATA(0) allocated [mem 0x27fffa000-0x27fffdfff]
[    0.006629] Zone ranges:
[    0.006630]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.006631]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.006631]   Normal   [mem 0x0000000100000000-0x000000027fffffff]
[    0.006632] Movable zone start for each node
[    0.006632] Early memory node ranges
[    0.006633]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.006633]   node   0: [mem 0x0000000000100000-0x000000007ffddfff]
[    0.006633]   node   0: [mem 0x0000000100000000-0x000000027fffffff]
[    0.006641] Zeroed struct page in unavailable ranges: 132 pages
[    0.006642] Initmem setup node 0 [mem 0x0000000000001000-0x000000027fffffff]
[    0.006643] On node 0 totalpages: 2097020
[    0.006643]   DMA zone: 64 pages used for memmap
[    0.006643]   DMA zone: 21 pages reserved
[    0.006644]   DMA zone: 3998 pages, LIFO batch:0
[    0.006677]   DMA32 zone: 8128 pages used for memmap
[    0.006678]   DMA32 zone: 520158 pages, LIFO batch:63
[    0.012564]   Normal zone: 24576 pages used for memmap
[    0.012565]   Normal zone: 1572864 pages, LIFO batch:63
[    0.025222] ACPI: PM-Timer IO Port: 0x608
[    0.025224] ACPI: Local APIC address 0xfee00000
[    0.025231] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
[    0.025837] IOAPIC[0]: apic_id 0, version 32, address 0xfec00000, GSI 0-23
[    0.025840] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    0.025841] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
[    0.025841] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    0.025842] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
[    0.025843] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
[    0.025844] ACPI: IRQ0 used by override.
[    0.025844] ACPI: IRQ5 used by override.
[    0.025845] ACPI: IRQ9 used by override.
[    0.025845] ACPI: IRQ10 used by override.
[    0.025845] ACPI: IRQ11 used by override.
[    0.025847] Using ACPI (MADT) for SMP configuration information
[    0.025848] ACPI: HPET id: 0x8086a201 base: 0xfed00000
[    0.025855] smpboot: Allowing 4 CPUs, 0 hotplug CPUs
[    0.025863] PM: Registered nosave memory: [mem 0x00000000-0x00000fff]
[    0.025864] PM: Registered nosave memory: [mem 0x0009f000-0x0009ffff]
[    0.025864] PM: Registered nosave memory: [mem 0x000a0000-0x000effff]
[    0.025864] PM: Registered nosave memory: [mem 0x000f0000-0x000fffff]
[    0.025865] PM: Registered nosave memory: [mem 0x7ffde000-0x7fffffff]
[    0.025865] PM: Registered nosave memory: [mem 0x80000000-0xafffffff]
[    0.025866] PM: Registered nosave memory: [mem 0xb0000000-0xbfffffff]
[    0.025866] PM: Registered nosave memory: [mem 0xc0000000-0xfed1bfff]
[    0.025866] PM: Registered nosave memory: [mem 0xfed1c000-0xfed1ffff]
[    0.025867] PM: Registered nosave memory: [mem 0xfed20000-0xfeffbfff]
[    0.025867] PM: Registered nosave memory: [mem 0xfeffc000-0xfeffffff]
[    0.025867] PM: Registered nosave memory: [mem 0xff000000-0xfffbffff]
[    0.025868] PM: Registered nosave memory: [mem 0xfffc0000-0xffffffff]
[    0.025869] [mem 0xc0000000-0xfed1bfff] available for PCI devices
[    0.025871] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[    0.025875] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4 nr_node_ids:1
[    0.025993] percpu: Embedded 51 pages/cpu s168088 r8192 d32616 u524288
[    0.025996] pcpu-alloc: s168088 r8192 d32616 u524288 alloc=1*2097152
[    0.025997] pcpu-alloc: [0] 0 1 2 3
[    0.026008] Built 1 zonelists, mobility grouping on.  Total pages: 2064231
[    0.026008] Policy zone: Normal
[    0.026009] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.2.11+ root=UUID=d51ad2bd-595d-4dad-abf3-21cddbb2aee5 ro quiet
[    0.028738] Calgary: detecting Calgary via BIOS EBDA area
[    0.028739] Calgary: Unable to locate Rio Grande table in EBDA - bailing!
[    0.047644] Memory: 8011344K/8388080K available (8194K kernel code, 823K rwdata, 2088K rodata, 1148K init, 2048K bss, 376736K reserved, 0K cma-reserved)
[    0.047707] Kernel/User page tables isolation: enabled
[    0.047759] rcu: Hierarchical RCU implementation.
[    0.047760] rcu:     RCU restricting CPUs from NR_CPUS=512 to nr_cpu_ids=4.
[    0.047761] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.047761] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=4
[    0.047849] NR_IRQS: 33024, nr_irqs: 456, preallocated irqs: 16
[    0.048065] random: get_random_bytes called from start_kernel+0x2e9/0x4b3 with crng_init=0
[    0.060768] Console: colour VGA+ 80x25
[    0.060772] printk: console [tty0] enabled
[    0.060783] ACPI: Core revision 20190509
[    0.060925] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns
[    0.060966] hpet clockevent registered
[    0.060974] APIC: Switch to symmetric I/O mode setup
[    0.062688] x2apic: IRQ remapping doesn't support X2APIC mode
[    0.070234] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    0.089009] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x39a8d58b854, max_idle_ns: 440795351064 ns
[    0.089011] Calibrating delay loop (skipped), value calculated using timer frequency.. 8000.25 BogoMIPS (lpj=16000512)
[    0.089012] pid_max: default: 32768 minimum: 301
[    0.089040] LSM: Security Framework initializing
[    0.089883] Dentry cache hash table entries: 1048576 (order: 11, 8388608 bytes)
[    0.090314] Inode-cache hash table entries: 524288 (order: 10, 4194304 bytes)
[    0.090336] Mount-cache hash table entries: 16384 (order: 5, 131072 bytes)
[    0.090349] Mountpoint-cache hash table entries: 16384 (order: 5, 131072 bytes)
[    0.090531] x86/cpu: User Mode Instruction Prevention (UMIP) activated
[    0.090565] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[    0.090565] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0
[    0.090567] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[    0.090568] Spectre V2 : Spectre mitigation: kernel not compiled with retpoline; no mitigation available!
[    0.090568] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
[    0.090773] MDS: Mitigation: Clear CPU buffers
[    0.090880] Freeing SMP alternatives memory: 16K
[    0.090954] TSC deadline timer enabled
[    0.090963] smpboot: CPU0: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz (family: 0x6, model: 0x3c, stepping: 0x3)
[    0.091020] Performance Events: Haswell events, Intel PMU driver.
[    0.091039] ... version:                2
[    0.091039] ... bit width:              48
[    0.091040] ... generic registers:      4
[    0.091040] ... value mask:             0000ffffffffffff
[    0.091041] ... max period:             000000007fffffff
[    0.091041] ... fixed-purpose events:   3
[    0.091041] ... event mask:             000000070000000f
[    0.091065] rcu: Hierarchical SRCU implementation.
[    0.091114] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[    0.093010] smp: Bringing up secondary CPUs ...
[    0.093010] x86: Booting SMP configuration:
[    0.093010] .... node  #0, CPUs:      #1 #2 #3
[    0.093329] smp: Brought up 1 node, 4 CPUs
[    0.093329] smpboot: Max logical packages: 1
[    0.093329] smpboot: Total of 4 processors activated (32001.02 BogoMIPS)
[    0.093345] devtmpfs: initialized
[    0.093345] x86/mm: Memory block size: 128MB
[    0.093538] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.093538] futex hash table entries: 1024 (order: 4, 65536 bytes)
[    0.093538] NET: Registered protocol family 16
[    0.093538] audit: initializing netlink subsys (disabled)
[    0.093538] audit: type=2000 audit(1567514293.032:1): state=initialized audit_enabled=0 res=1
[    0.093538] cpuidle: using governor ladder
[    0.093538] cpuidle: using governor menu
[    0.093538] ACPI: bus type PCI registered
[    0.093538] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[    0.093538] PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0xb0000000-0xbfffffff] (base 0xb0000000)
[    0.093538] PCI: MMCONFIG at [mem 0xb0000000-0xbfffffff] reserved in E820
[    0.093538] PCI: Using configuration type 1 for base access
[    0.093538] core: PMU erratum BJ122, BV98, HSD29 workaround disabled, HT off
[    0.093918] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
[    0.093918] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
[    0.245109] ACPI: Added _OSI(Module Device)
[    0.245110] ACPI: Added _OSI(Processor Device)
[    0.245111] ACPI: Added _OSI(3.0 _SCP Extensions)
[    0.245111] ACPI: Added _OSI(Processor Aggregator Device)
[    0.245113] ACPI: Added _OSI(Linux-Dell-Video)
[    0.245113] ACPI: Added _OSI(Linux-Lenovo-NV-HDMI-Audio)
[    0.245114] ACPI: Added _OSI(Linux-HPI-Hybrid-Graphics)
[    0.245894] ACPI: 1 ACPI AML tables successfully acquired and loaded
[    0.246706] ACPI: Interpreter enabled
[    0.246714] ACPI: (supports S0 S3 S4 S5)
[    0.246715] ACPI: Using IOAPIC for interrupt routing
[    0.246730] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
[    0.246782] ACPI: Enabled 1 GPEs in block 00 to 3F
[    0.248024] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-ff])
[    0.248027] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
[    0.248074] acpi PNP0A08:00: _OSC: platform does not support [LTR]
[    0.248113] acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME AER PCIeCapability]
[    0.248168] PCI host bridge to bus 0000:00
[    0.248169] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7 window]
[    0.248170] pci_bus 0000:00: root bus resource [io  0x0d00-0xffff window]
[    0.248171] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
[    0.248172] pci_bus 0000:00: root bus resource [mem 0xc0000000-0xfebfffff window]
[    0.248172] pci_bus 0000:00: root bus resource [mem 0x280000000-0xa7fffffff window]
[    0.248173] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.248191] pci 0000:00:00.0: [8086:29c0] type 00 class 0x060000
[    0.248413] pci 0000:00:01.0: [1b36:0100] type 00 class 0x030000
[    0.250024] pci 0000:00:01.0: reg 0x10: [mem 0xf4000000-0xf7ffffff]
[    0.251636] pci 0000:00:01.0: reg 0x14: [mem 0xf8000000-0xfbffffff]
[    0.253062] pci 0000:00:01.0: reg 0x18: [mem 0xfc094000-0xfc095fff]
[    0.254713] pci 0000:00:01.0: reg 0x1c: [io  0xc040-0xc05f]
[    0.260010] pci 0000:00:01.0: reg 0x30: [mem 0xfc080000-0xfc08ffff pref]
[    0.260274] pci 0000:00:02.0: [8086:10d3] type 00 class 0x020000
[    0.261010] pci 0000:00:02.0: reg 0x10: [mem 0xfc040000-0xfc05ffff]
[    0.261353] pci 0000:00:02.0: reg 0x14: [mem 0xfc060000-0xfc07ffff]
[    0.262094] pci 0000:00:02.0: reg 0x18: [io  0xc060-0xc07f]
[    0.262757] pci 0000:00:02.0: reg 0x1c: [mem 0xfc090000-0xfc093fff]
[    0.265016] pci 0000:00:02.0: reg 0x30: [mem 0xfc000000-0xfc03ffff pref]
[    0.265498] pci 0000:00:1f.0: [8086:2918] type 00 class 0x060100
[    0.265732] pci 0000:00:1f.0: quirk: [io  0x0600-0x067f] claimed by ICH6 ACPI/GPIO/TCO
[    0.265841] pci 0000:00:1f.2: [8086:2922] type 00 class 0x010601
[    0.269007] pci 0000:00:1f.2: reg 0x20: [io  0xc080-0xc09f]
[    0.269470] pci 0000:00:1f.2: reg 0x24: [mem 0xfc096000-0xfc096fff]
[    0.270038] pci 0000:00:1f.3: [8086:2930] type 00 class 0x0c0500
[    0.271177] pci 0000:00:1f.3: reg 0x20: [io  0x0700-0x073f]
[    0.271942] ACPI: PCI Interrupt Link [LNKA] (IRQs 5 *10 11)
[    0.271989] ACPI: PCI Interrupt Link [LNKB] (IRQs 5 *10 11)
[    0.272031] ACPI: PCI Interrupt Link [LNKC] (IRQs 5 10 *11)
[    0.272072] ACPI: PCI Interrupt Link [LNKD] (IRQs 5 10 *11)
[    0.272114] ACPI: PCI Interrupt Link [LNKE] (IRQs 5 *10 11)
[    0.272155] ACPI: PCI Interrupt Link [LNKF] (IRQs 5 *10 11)
[    0.272260] ACPI: PCI Interrupt Link [LNKG] (IRQs 5 10 *11)
[    0.272304] ACPI: PCI Interrupt Link [LNKH] (IRQs 5 10 *11)
[    0.272320] ACPI: PCI Interrupt Link [GSIA] (IRQs *16)
[    0.272326] ACPI: PCI Interrupt Link [GSIB] (IRQs *17)
[    0.272331] ACPI: PCI Interrupt Link [GSIC] (IRQs *18)
[    0.272337] ACPI: PCI Interrupt Link [GSID] (IRQs *19)
[    0.272341] ACPI: PCI Interrupt Link [GSIE] (IRQs *20)
[    0.272346] ACPI: PCI Interrupt Link [GSIF] (IRQs *21)
[    0.272355] ACPI: PCI Interrupt Link [GSIG] (IRQs *22)
[    0.272360] ACPI: PCI Interrupt Link [GSIH] (IRQs *23)
[    0.272646] pci 0000:00:01.0: vgaarb: setting as boot VGA device
[    0.272646] pci 0000:00:01.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none
[    0.272646] pci 0000:00:01.0: vgaarb: bridge control possible
[    0.272646] vgaarb: loaded
[    0.272646] pps_core: LinuxPPS API ver. 1 registered
[    0.272646] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
[    0.272646] PTP clock support registered
[    0.272646] EDAC MC: Ver: 3.0.0
[    0.272646] PCI: Using ACPI for IRQ routing
[    0.303863] PCI: pci_cache_line_size set to 64 bytes
[    0.303921] e820: reserve RAM buffer [mem 0x0009fc00-0x0009ffff]
[    0.303924] e820: reserve RAM buffer [mem 0x7ffde000-0x7fffffff]
[    0.304074] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
[    0.304074] hpet0: 3 comparators, 64-bit 100.000000 MHz counter
[    0.305026] clocksource: Switched to clocksource tsc-early
[    0.309208] VFS: Disk quotas dquot_6.6.0
[    0.309223] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    0.309262] pnp: PnP ACPI init
[    0.309303] pnp 00:00: Plug and Play ACPI device, IDs PNP0b00 (active)
[    0.309324] pnp 00:01: Plug and Play ACPI device, IDs PNP0303 (active)
[    0.309340] pnp 00:02: Plug and Play ACPI device, IDs PNP0f13 (active)
[    0.309373] pnp 00:03: Plug and Play ACPI device, IDs PNP0400 (active)
[    0.309400] pnp 00:04: Plug and Play ACPI device, IDs PNP0501 (active)
[    0.309542] pnp: PnP ACPI: found 5 devices
[    0.315938] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns
[    0.315947] pci_bus 0000:00: resource 4 [io  0x0000-0x0cf7 window]
[    0.315948] pci_bus 0000:00: resource 5 [io  0x0d00-0xffff window]
[    0.315948] pci_bus 0000:00: resource 6 [mem 0x000a0000-0x000bffff window]
[    0.315949] pci_bus 0000:00: resource 7 [mem 0xc0000000-0xfebfffff window]
[    0.315950] pci_bus 0000:00: resource 8 [mem 0x280000000-0xa7fffffff window]
[    0.315988] NET: Registered protocol family 2
[    0.316139] tcp_listen_portaddr_hash hash table entries: 4096 (order: 4, 65536 bytes)
[    0.316157] TCP established hash table entries: 65536 (order: 7, 524288 bytes)
[    0.316223] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
[    0.316317] TCP: Hash tables configured (established 65536 bind 65536)
[    0.316349] UDP hash table entries: 4096 (order: 5, 131072 bytes)
[    0.316364] UDP-Lite hash table entries: 4096 (order: 5, 131072 bytes)
[    0.316408] NET: Registered protocol family 1
[    0.316445] pci 0000:00:01.0: Video device with shadowed ROM at [mem 0x000c0000-0x000dffff]
[    0.316472] PCI: CLS 0 bytes, default 64
[    0.316504] Trying to unpack rootfs image as initramfs...
[    1.807053] Freeing initrd memory: 160236K
[    1.807076] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[    1.807077] software IO TLB: mapped [mem 0x7bfde000-0x7ffde000] (64MB)
[    1.807295] RAPL PMU: API unit is 2^-32 Joules, 4 fixed counters, 10737418240 ms ovfl timer
[    1.807296] RAPL PMU: hw unit of domain pp0-core 2^-0 Joules
[    1.807296] RAPL PMU: hw unit of domain package 2^-0 Joules
[    1.807297] RAPL PMU: hw unit of domain dram 2^-0 Joules
[    1.807297] RAPL PMU: hw unit of domain pp1-gpu 2^-0 Joules
[    1.807819] workingset: timestamp_bits=40 max_order=21 bucket_order=0
[    1.807988] Key type asymmetric registered
[    1.807990] Asymmetric key parser 'x509' registered
[    1.807995] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 249)
[    1.807996] io scheduler mq-deadline registered
[    1.807996] io scheduler kyber registered
[    1.808138] intel_idle: Please enable MWAIT in BIOS SETUP
[    1.808316] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[    1.830738] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
[    1.831156] Linux agpgart interface v0.103
[    1.831488] i8042: PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12
[    1.833069] serio: i8042 KBD port at 0x60,0x64 irq 1
[    1.833073] serio: i8042 AUX port at 0x60,0x64 irq 12
[    1.833231] mousedev: PS/2 mouse device common for all mice
[    1.833306] rtc_cmos 00:00: RTC can wake from S4
[    1.833979] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input0
[    1.834109] rtc_cmos 00:00: registered as rtc0
[    1.834124] rtc_cmos 00:00: alarms up to one day, y3k, 114 bytes nvram, hpet irqs
[    1.834130] intel_pstate: CPU model not supported
[    1.834158] drop_monitor: Initializing network drop monitor service
[    1.834231] NET: Registered protocol family 10
[    1.834423] Segment Routing with IPv6
[    1.834435] mip6: Mobile IPv6
[    1.834436] NET: Registered protocol family 17
[    1.834445] Key type dns_resolver registered
[    1.834652] mce: Using 10 MCE banks
[    1.834666] sched_clock: Marking stable (1821634086, 12978712)->(1842031235, -7418437)
[    1.834795] registered taskstats version 1
[    1.835220] rtc_cmos 00:00: setting system clock to 2019-09-03T12:38:15 UTC (1567514295)
[    1.835531] Freeing unused kernel image memory: 1148K
[    1.849072] Write protecting the kernel read-only data: 14336k
[    1.849927] Freeing unused kernel image memory: 2028K
[    1.850647] Freeing unused kernel image memory: 2008K
[    1.850679] Run /init as init process
[    1.858108] udevd[91]: starting version 175
[    1.874277] SCSI subsystem initialized
[    1.882016] libata version 3.00 loaded.
[    1.882917] ahci 0000:00:1f.2: version 3.0
[    1.883133] PCI Interrupt Link [GSIA] enabled at IRQ 16
[    1.883483] ahci 0000:00:1f.2: AHCI 0001.0000 32 slots 6 ports 1.5 Gbps 0x3f impl SATA mode
[    1.883485] ahci 0000:00:1f.2: flags: 64bit ncq only
[    1.890825] scsi host0: ahci
[    1.891020] e1000e: Intel(R) PRO/1000 Network Driver - 3.2.6-k
[    1.891020] e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
[    1.891330] PCI Interrupt Link [GSIG] enabled at IRQ 22
[    1.891870] e1000e 0000:00:02.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode
[    1.895622] scsi host1: ahci
[    1.896917] scsi host2: ahci
[    1.897483] scsi host3: ahci
[    1.899133] scsi host4: ahci
[    1.901667] scsi host5: ahci
[    1.901762] ata1: SATA max UDMA/133 abar m4096@0xfc096000 port 0xfc096100 irq 24
[    1.901769] ata2: SATA max UDMA/133 abar m4096@0xfc096000 port 0xfc096180 irq 24
[    1.901774] ata3: SATA max UDMA/133 abar m4096@0xfc096000 port 0xfc096200 irq 24
[    1.901778] ata4: SATA max UDMA/133 abar m4096@0xfc096000 port 0xfc096280 irq 24
[    1.901782] ata5: SATA max UDMA/133 abar m4096@0xfc096000 port 0xfc096300 irq 24
[    1.901785] ata6: SATA max UDMA/133 abar m4096@0xfc096000 port 0xfc096380 irq 24
[    1.957727] e1000e 0000:00:02.0 0000:00:02.0 (uninitialized): registered PHC clock
[    2.020118] e1000e 0000:00:02.0 eth0: (PCI Express:2.5GT/s:Width x1) 52:54:00:12:34:56
[    2.020119] e1000e 0000:00:02.0 eth0: Intel(R) PRO/1000 Network Connection
[    2.020138] e1000e 0000:00:02.0 eth0: MAC: 3, PHY: 8, PBA No: 000000-000
[    2.216698] ata6: SATA link down (SStatus 0 SControl 300)
[    2.216796] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[    2.216943] ata5: SATA link down (SStatus 0 SControl 300)
[    2.217049] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[    2.217184] ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[    2.217303] ata4: SATA link down (SStatus 0 SControl 300)
[    2.217340] ata1.00: ATA-7: QEMU HARDDISK, 2.5+, max UDMA/100
[    2.217341] ata1.00: 31457280 sectors, multi 16: LBA48 NCQ (depth 32)
[    2.217343] ata1.00: applying bridge limits
[    2.217392] ata3.00: ATAPI: QEMU DVD-ROM, 2.5+, max UDMA/100
[    2.217394] ata3.00: applying bridge limits
[    2.217434] ata2.00: ATA-7: QEMU HARDDISK, 2.5+, max UDMA/100
[    2.217434] ata2.00: 104857600 sectors, multi 16: LBA48 NCQ (depth 32)
[    2.217436] ata2.00: applying bridge limits
[    2.217576] ata2.00: configured for UDMA/100
[    2.217625] ata1.00: configured for UDMA/100
[    2.217638] ata3.00: configured for UDMA/100
[    2.217817] scsi 0:0:0:0: Direct-Access     ATA      QEMU HARDDISK    2.5+ PQ: 0 ANSI: 5
[    2.218168] scsi 1:0:0:0: Direct-Access     ATA      QEMU HARDDISK    2.5+ PQ: 0 ANSI: 5
[    2.218434] scsi 2:0:0:0: CD-ROM            QEMU     QEMU DVD-ROM     2.5+ PQ: 0 ANSI: 5
[    2.221788] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB)
[    2.221793] sd 0:0:0:0: [sda] Write Protect is off
[    2.221793] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    2.221798] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    2.222009] sd 1:0:0:0: [sdb] 104857600 512-byte logical blocks: (53.7 GB/50.0 GiB)
[    2.222013] sd 1:0:0:0: [sdb] Write Protect is off
[    2.222014] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[    2.222019] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    2.222384]  sda: sda1 sda2 < sda5 >
[    2.222422]  sdb: sdb1
[    2.222741] sd 1:0:0:0: [sdb] Attached SCSI disk
[    2.222838] sd 0:0:0:0: [sda] Attached SCSI disk
[    2.223977] sd 0:0:0:0: Attached scsi generic sg0 type 0
[    2.224010] sd 1:0:0:0: Attached scsi generic sg1 type 0
[    2.224037] sr 2:0:0:0: Attached scsi generic sg2 type 5
[    2.241692] random: fast init done
[    2.257275] sr 2:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
[    2.257277] cdrom: Uniform CD-ROM driver Revision: 3.20
[    2.257450] sr 2:0:0:0: Attached scsi CD-ROM sr0
[    2.484802] PM: Image not found (code -22)
[    2.500786] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
[    2.809075] tsc: Refined TSC clocksource calibration: 4000.021 MHz
[    2.809088] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x39a87016a1f, max_idle_ns: 440795206381 ns
[    2.809128] clocksource: Switched to clocksource tsc
[    3.598106] udevd[437]: starting version 175
[    4.166877] i801_smbus 0000:00:1f.3: SMBus using PCI interrupt
[    4.167943] parport_pc 00:03: reported by Plug and Play ACPI
[    4.168032] parport0: PC-style at 0x378, irq 7 [PCSPP,TRISTATE]
[    4.188464] input: PC Speaker as /devices/platform/pcspkr/input/input2
[    4.189832] lpc_ich 0000:00:1f.0: I/O space for GPIO uninitialized
[    4.207278] iTCO_vendor_support: vendor-support=0
[    4.207682] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
[    4.207723] iTCO_wdt: Found a ICH9 TCO device (Version=2, TCOBASE=0x0660)
[    4.207852] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
[    4.241841] input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input4
[    4.241919] ACPI: Power Button [PWRF]
[    4.448579] random: crng init done
[    5.045981] input: ImExPS/2 Generic Explorer Mouse as /devices/platform/i8042/serio1/input/input3
[    5.649077] Adding 688124k swap on /dev/sda5.  Priority:-2 extents:1 across:688124k
[    5.657555] EXT4-fs (sda1): re-mounted. Opts: (null)
[    5.821773] EXT4-fs (sda1): re-mounted. Opts: errors=remount-ro
[    5.931729] loop: module loaded
[    6.589006] RPC: Registered named UNIX socket transport module.
[    6.589007] RPC: Registered udp transport module.
[    6.589007] RPC: Registered tcp transport module.
[    6.589008] RPC: Registered tcp NFSv4.1 backchannel transport module.
[    6.592778] FS-Cache: Loaded
[    6.604859] FS-Cache: Netfs 'nfs' registered for caching
[    6.614700] Installing knfsd (copyright (C) 1996 okir@monad.swb.de).
[    9.281335] Bluetooth: Core ver 2.22
[    9.281351] NET: Registered protocol family 31
[    9.281352] Bluetooth: HCI device and connection manager initialized
[    9.281354] Bluetooth: HCI socket layer initialized
[    9.281355] Bluetooth: L2CAP socket layer initialized
[    9.281357] Bluetooth: SCO socket layer initialized
[    9.283691] Bluetooth: RFCOMM TTY layer initialized
[    9.283695] Bluetooth: RFCOMM socket layer initialized
[    9.283699] Bluetooth: RFCOMM ver 1.11
[    9.284992] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
[    9.284993] Bluetooth: BNEP filters: protocol multicast
[    9.284995] Bluetooth: BNEP socket layer initialized
[   12.058154] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
[   12.058575] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   96.893773] perf: interrupt took too long (2513 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[   99.136902] stat invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=0
[   99.136904] CPU: 2 PID: 20281 Comm: stat Not tainted 5.2.11+ #5
[   99.136905] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.11.0-1.fc28 04/01/2014
[   99.136905] Call Trace:
[   99.136925]  dump_stack+0x4d/0x64
[   99.136934]  dump_header+0x54/0x2f5
[   99.136937]  ? do_raw_spin_trylock+0x1f/0x28
[   99.136938]  ? ___ratelimit+0xc3/0xe4
[   99.136939]  ? task_will_free_mem+0x25/0xa0
[   99.136940]  oom_kill_process+0x7a/0xec
[   99.136942]  out_of_memory+0x3dd/0x3f8
[   99.136943]  ? __mutex_trylock_or_owner+0x4b/0x63
[   99.136944]  pagefault_out_of_memory+0x3c/0x4b
[   99.136947]  mm_fault_error+0x66/0x150
[   99.136948]  do_user_addr_fault+0x29f/0x3a4
[   99.136954]  ? fpregs_assert_state_consistent+0x16/0x43
[   99.136955]  __do_page_fault+0x44/0x46
[   99.136955]  do_page_fault+0x9c/0xdf
[   99.136957]  ? page_fault+0x8/0x30
[   99.136958]  page_fault+0x1e/0x30
[   99.136959] RIP: 0033:0x7f6c27463f84
[   99.136960] Code: 10 4d 89 4b 18 5b 5d c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec 08 48 8b 87 98 02 00 00 48 85 c0 74 5f 48 8b 40 08 <8b> 08 89 8f ec 02 00 00 8b 50 08 44 8b 40 04 8d 72 ff 85 d6 75 72
[   99.136961] RSP: 002b:00007ffc57285840 EFLAGS: 00010206
[   99.136962] RAX: 00007f6c26473280 RBX: 00000000026161b0 RCX: 00007f6c274711d7
[   99.136962] RDX: 00007f6c2667de48 RSI: 0000000000000030 RDI: 00000000026161b0
[   99.136962] RBP: 00007ffc572859b0 R08: 0000000070000029 R09: 000000006ffffdff
[   99.136964] R10: 000000006ffffeff R11: 0000000000000246 R12: 00007ffc57285a98
[   99.136964] R13: 000000006fffff48 R14: 00007ffc57285730 R15: 00007ffc572856d0
[   99.136965] Mem-Info:
[   99.136968] active_anon:24008 inactive_anon:391 isolated_anon:0
[   99.136968]  active_file:22836 inactive_file:33264 isolated_file:0
[   99.136968]  unevictable:0 dirty:13 writeback:0 unstable:0
[   99.136968]  slab_reclaimable:28549 slab_unreclaimable:4068
[   99.136968]  mapped:16364 shmem:498 pagetables:2551 bounce:0
[   99.136968]  free:1918966 free_pcp:1781 free_cma:0
[   99.136970] Node 0 active_anon:96032kB inactive_anon:1564kB active_file:91344kB inactive_file:133056kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:65456kB dirty:52kB writeback:0kB shmem:1992kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[   99.136970] Node 0 DMA free:15908kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[   99.136972] lowmem_reserve[]: 0 1793 7808 7808
[   99.136973] Node 0 DMA32 free:1998916kB min:15488kB low:19360kB high:23232kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:2080632kB managed:2001812kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:2896kB local_pcp:0kB free_cma:0kB
[   99.136975] lowmem_reserve[]: 0 0 6014 6014
[   99.136976] Node 0 Normal free:5661040kB min:51956kB low:64944kB high:77932kB active_anon:96032kB inactive_anon:1564kB active_file:91344kB inactive_file:133056kB unevictable:0kB writepending:52kB present:6291456kB managed:6159060kB mlocked:0kB kernel_stack:3920kB pagetables:10204kB bounce:0kB free_pcp:4220kB local_pcp:468kB free_cma:0kB
[   99.136978] lowmem_reserve[]: 0 0 0 0
[   99.136978] Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
[   99.136982] Node 0 DMA32: 5*4kB (M) 6*8kB (M) 4*16kB (M) 4*32kB (M) 5*64kB (M) 6*128kB (M) 5*256kB (M) 5*512kB (M) 3*1024kB (M) 2*2048kB (M) 485*4096kB (M) = 1998916kB
[   99.136985] Node 0 Normal: 19*4kB (ME) 11*8kB (ME) 35*16kB (UME) 11*32kB (UME) 4*64kB (UE) 2*128kB (UE) 1*256kB (E) 1*512kB (E) 8*1024kB (UME) 5*2048kB (UME) 1377*4096kB (UM) = 5660980kB
[   99.136989] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[   99.136989] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[   99.136990] 56617 total pagecache pages
[   99.136990] 0 pages in swap cache
[   99.136991] Swap cache stats: add 0, delete 0, find 0/0
[   99.136991] Free swap  = 688124kB
[   99.136992] Total swap = 688124kB
[   99.136992] 2097020 pages RAM
[   99.136992] 0 pages HighMem/MovableOnly
[   99.136992] 52825 pages reserved
[   99.136993] 0 pages hwpoisoned
[   99.136993] Tasks state (memory values in pages):
[   99.136993] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[   99.136995] [    437]     0   437     5428      690    90112        0         -1000 udevd
[   99.136996] [    565]     0   565     5427      614    86016        0         -1000 udevd
[   99.136997] [    567]     0   567     5427      615    86016        0         -1000 udevd
[   99.136998] [   1800]     0  1800     4747      513    86016        0             0 rpcbind
[   99.136999] [   1833]   105  1833     5840      617    90112        0             0 rpc.statd
[   99.137000] [   1849]     0  1849     6328       55    94208        0             0 rpc.idmapd
[   99.137000] [   2193]     0  2193    13198      690   110592        0             0 rsyslogd
[   99.137001] [   2260]     7  2260     3147      357    65536        0             0 lpd
[   99.137002] [   2320]     0  2320     4172       38    73728        0             0 atd
[   99.137003] [   2566]     0  2566     5106      538    86016        0             0 cron
[   99.137005] [   2584]     0  2584     1033      395    57344        0             0 acpid
[   99.137005] [   2611]   104  2611    12736      758   143360        0             0 exim4
[   99.137006] [   2632]   101  2632     7567      659   106496        0             0 dbus-daemon
[   99.137007] [   2682]     0  2682    35396     1326   167936        0             0 lightdm
[   99.137008] [   2700]     0  2700     5255      680    81920        0             0 bluetoothd
[   99.137009] [   2752]     0  2752    41044     1935   221184        0             0 NetworkManager
[   99.137009] [   2782]     0  2782    47228     9503   368640        0             0 Xorg
[   99.137011] [   2787]     0  2787    33036     1542   159744        0             0 polkitd
[   99.137024] [   2814]     0  2814     4068      464    77824        0             0 getty
[   99.137025] [   2815]     0  2815     4068      462    77824        0             0 getty
[   99.137026] [   2816]     0  2816     4068      488    73728        0             0 getty
[   99.137027] [   2817]     0  2817     4068      486    77824        0             0 getty
[   99.137027] [   2818]     0  2818     4068      485    77824        0             0 getty
[   99.137028] [   2819]     0  2819     4068      489    69632        0             0 getty
[   99.137029] [   2823]     0  2823    20217     1297   200704        0             0 modem-manager
[   99.137030] [   2842]     0  2842    31895     1439   151552        0             0 console-kit-dae
[   99.137030] [   2916]     0  2916     2494     1229    69632        0             0 dhclient
[   99.137031] [   2922]     0  2922    39491     1693   192512        0             0 upowerd
[   99.137032] [   3016]     0  3016    39659     1450   229376        0             0 lightdm
[   99.137032] [   3096]     0  3096    17323      732   159744        0             0 gnome-keyring-d
[   99.137033] [   3107]     0  3107     1049      358    53248        0             0 sh
[   99.137034] [   3123]     0  3123     3122       83    65536        0             0 ssh-agent
[   99.137035] [   3126]     0  3126     6051      504    90112        0             0 dbus-launch
[   99.137035] [   3127]     0  3127     7554      609   102400        0             0 dbus-daemon
[   99.137036] [   3135]     0  3135    12370     1172   143360        0             0 xfconfd
[   99.137037] [   3140]     0  3140    38776     2956   323584        0             0 xfce4-session
[   99.137038] [   3145]     0  3145    37828     3832   344064        0             0 xfwm4
[   99.137038] [   3147]     0  3147    31422     2231   286720        0             0 xfsettingsd
[   99.137039] [   3148]     0  3148    39021     3116   360448        0             0 Thunar
[   99.137040] [   3150]     0  3150    15479     1242   159744        0             0 gvfsd
[   99.137040] [   3153]     0  3153    72257     5559   466944        0             0 xfce4-panel
[   99.137041] [   3154]     0  3154    91757     6682   487424        0             0 xfdesktop
[   99.137042] [   3157]     0  3157    38106     1856   323584        0             0 xfce4-settings-
[   99.137043] [   3158]     0  3158    53548     2101   315392        0             0 xfce4-power-man
[   99.137044] [   3165]     0  3165    46647     2774   266240        0             0 polkit-gnome-au
[   99.137044] [   3167]     0  3167    17743     1675   180224        0             0 gvfs-gdu-volume
[   99.137045] [   3170]     0  3170    30373     1444   147456        0             0 udisks-daemon
[   99.137046] [   3172]     0  3172   120880     5112   434176        0             0 nm-applet
[   99.137046] [   3173]     0  3173    11857      705   131072        0             0 udisks-daemon
[   99.137047] [   3176]     0  3176    15095     1259   167936        0             0 gvfs-gphoto2-vo
[   99.137048] [   3179]     0  3179    35081     1389   303104        0             0 xfce4-power-man
[   99.137050] [   3181]     0  3181    58995     8092   507904        0             0 system-config-p
[   99.137051] [   3182]     0  3182    57804     3624   356352        0             0 xfce4-volumed
[   99.137051] [   3184]     0  3184    64742     5974   471040        0             0 xfce4-terminal
[   99.137052] [   3187]     0  3187    34538     2410   311296        0             0 xfce4-notifyd
[   99.137053] [   3189]     0  3189    13290     1180   143360        0             0 gconfd-2
[   99.137054] [   3192]     0  3192    19734      864   176128        0             0 gvfs-afc-volume
[   99.137054] [   3195]     0  3195    16606     1567   176128        0             0 gvfsd-trash
[   99.137055] [   3199]     0  3199    36763     3111   331776        0             0 panel-6-systray
[   99.137056] [   3201]     0  3201     3642      419    73728        0             0 gnome-pty-helpe
[   99.137056] [   3202]     0  3202     4869      880    81920        0             0 bash
[   99.137057] [   3206]     0  3206     4869      897    77824        0             0 bash
[   99.137058] [   3207]     0  3207     3655      668    73728        0             0 watch
[   99.137058] [   3223]     0  3223     3183      615    69632        0             0 find
[   99.137059] [   3224]     0  3224     1473      437    57344        0             0 xargs
[   99.137060] [  20281]     0 20281     4602      561    77824        0             0 stat
[   99.137060] [  20282]     0 20282     4078      583    69632        0             0 stat
[   99.137061] [  20283]     0 20283     3019      526    65536        0             0 stat
[   99.137062] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=Xorg,pid=2782,uid=0
[   99.137077] Out of memory: Killed process 2782 (Xorg) total-vm:188912kB, anon-rss:24884kB, file-rss:11664kB, shmem-rss:1464kB
[   99.137873] oom_reaper: reaped process 2782 (Xorg), now anon-rss:0kB, file-rss:0kB, shmem-rss:1468kB
[  192.530507] perf: interrupt took too long (3155 > 3141), lowering kernel.perf_event_max_sample_rate to 63250

All I'm doing is running
"find / -xdev -type f -print0 | xargs -0 -n 1 -P 8 stat > /dev/null"
inside the memory cgroup. Find, xargs and stat only use a tiny amount of ram
by themselves so most of the ram usage in the cgroup is ext4 inode cache.
That should never trigger the OOM killer (outside or inside the cgroup).
Instead old cache data should be evicted.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] Early OOM and kernel NULL pointer dereference in 4.19.69
  2019-09-03  7:41     ` Michal Hocko
  2019-09-03 12:01       ` Thomas Lindroth
@ 2019-09-03 12:05       ` Andrey Ryabinin
  2019-09-03 12:22         ` Michal Hocko
  1 sibling, 1 reply; 26+ messages in thread
From: Andrey Ryabinin @ 2019-09-03 12:05 UTC (permalink / raw)
  To: Michal Hocko, Thomas Lindroth; +Cc: linux-mm, stable



On 9/3/19 10:41 AM, Michal Hocko wrote:
> On Mon 02-09-19 21:34:29, Thomas Lindroth wrote:
>> On 9/2/19 9:16 AM, Michal Hocko wrote:
>>> On Sun 01-09-19 22:43:05, Thomas Lindroth wrote:
>>>> After upgrading to the 4.19 series I've started getting problems with
>>>> early OOM.
>>>
>>> What is the kenrel you have updated from? Would it be possible to try
>>> the current Linus' tree?
>>
>> I did some more testing and it turns out this is not a regression after all.
>>
>> I followed up on my hunch and monitored memory.kmem.max_usage_in_bytes while
>> running cgexec -g memory:12G bash -c 'find / -xdev -type f -print0 | \
>>         xargs -0 -n 1 -P 8 stat > /dev/null'
>>
>> Just as memory.kmem.max_usage_in_bytes = memory.kmem.limit_in_bytes the OOM
>> killer kicked in and killed my X server.
>>
>> Using the find|stat approach it was easy to test the problem in a testing VM.
>> I was able to reproduce the problem in all these kernels:
>>   4.9.0
>>   4.14.0
>>   4.14.115
>>   4.19.0
>>   5.2.11
>>
>> 5.3-rc6 didn't build in the VM. The build environment is too old probably.
>>
>> I was curious why I initially couldn't reproduce the problem in 4.14 by
>> building chromium. I was again able to successfully build chromium using
>> 4.14.115. Turns out memory.kmem.max_usage_in_bytes was 1015689216 after
>> building and my limit is set to 1073741824. I guess some unrelated change in
>> memory management raised that slightly for 4.19 triggering the problem.
>>
>> If you want to reproduce for yourself here are the steps:
>> 1. build any kernel above 4.9 using something like my .config
>> 2. setup a v1 memory cgroup with memory.kmem.limit_in_bytes lower than
>>    memory.limit_in_bytes. I used 100M in my testing VM.
>> 3. Run "find / -xdev -type f -print0 | xargs -0 -n 1 -P 8 stat > /dev/null"
>>    in the cgroup.
>> 4. Assuming there is enough inodes on the rootfs the global OOM killer
>>    should kick in when memory.kmem.max_usage_in_bytes =
>>    memory.kmem.limit_in_bytes and kill something outside the cgroup.
> 
> This is certainly a bug. Is this still an OOM triggered from
> pagefault_out_of_memory? Since 4.19 (29ef680ae7c21) the memcg charge
> path should invoke the memcg oom killer directly from the charge path.
> If that doesn't happen then the failing charge is either GFP_NOFS or a
> large allocation.
> 
> The former has been fixed just recently by http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp
> and I suspect this is a fix you are looking for. Although it is curious
> that you can see a global oom even before because the charge path would
> mark an oom situation even for NOFS context and it should trigger the
> memcg oom killer on the way out from the page fault path. So essentially
> the same call trace except the oom killer should be constrained to the
> memcg context.
> 
> Could you try the above patch please?
> 

It won't help. We hitting ->kmem limit here, not the ->memory or ->memsw, so try_charge() is successful and
only __memcg_kmem_charge_memcg() fails to charge ->kmem and returns -ENOMEM.

Limiting kmem just never worked and it doesn't work now. AFAIK this feature hasn't been finished because 
there was no clear purpose/use case found. I remember that there was some discussion on lsfmm about this https://lwn.net/Articles/636331/
but I don't remember the discussion itself.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] Early OOM and kernel NULL pointer dereference in 4.19.69
  2019-09-03 12:05       ` Andrey Ryabinin
@ 2019-09-03 12:22         ` Michal Hocko
  2019-09-03 18:20           ` Thomas Lindroth
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2019-09-03 12:22 UTC (permalink / raw)
  To: Andrey Ryabinin; +Cc: Thomas Lindroth, linux-mm, stable

On Tue 03-09-19 15:05:22, Andrey Ryabinin wrote:
> 
> 
> On 9/3/19 10:41 AM, Michal Hocko wrote:
> > On Mon 02-09-19 21:34:29, Thomas Lindroth wrote:
> >> On 9/2/19 9:16 AM, Michal Hocko wrote:
> >>> On Sun 01-09-19 22:43:05, Thomas Lindroth wrote:
> >>>> After upgrading to the 4.19 series I've started getting problems with
> >>>> early OOM.
> >>>
> >>> What is the kenrel you have updated from? Would it be possible to try
> >>> the current Linus' tree?
> >>
> >> I did some more testing and it turns out this is not a regression after all.
> >>
> >> I followed up on my hunch and monitored memory.kmem.max_usage_in_bytes while
> >> running cgexec -g memory:12G bash -c 'find / -xdev -type f -print0 | \
> >>         xargs -0 -n 1 -P 8 stat > /dev/null'
> >>
> >> Just as memory.kmem.max_usage_in_bytes = memory.kmem.limit_in_bytes the OOM
> >> killer kicked in and killed my X server.
> >>
> >> Using the find|stat approach it was easy to test the problem in a testing VM.
> >> I was able to reproduce the problem in all these kernels:
> >>   4.9.0
> >>   4.14.0
> >>   4.14.115
> >>   4.19.0
> >>   5.2.11
> >>
> >> 5.3-rc6 didn't build in the VM. The build environment is too old probably.
> >>
> >> I was curious why I initially couldn't reproduce the problem in 4.14 by
> >> building chromium. I was again able to successfully build chromium using
> >> 4.14.115. Turns out memory.kmem.max_usage_in_bytes was 1015689216 after
> >> building and my limit is set to 1073741824. I guess some unrelated change in
> >> memory management raised that slightly for 4.19 triggering the problem.
> >>
> >> If you want to reproduce for yourself here are the steps:
> >> 1. build any kernel above 4.9 using something like my .config
> >> 2. setup a v1 memory cgroup with memory.kmem.limit_in_bytes lower than
> >>    memory.limit_in_bytes. I used 100M in my testing VM.
> >> 3. Run "find / -xdev -type f -print0 | xargs -0 -n 1 -P 8 stat > /dev/null"
> >>    in the cgroup.
> >> 4. Assuming there is enough inodes on the rootfs the global OOM killer
> >>    should kick in when memory.kmem.max_usage_in_bytes =
> >>    memory.kmem.limit_in_bytes and kill something outside the cgroup.
> > 
> > This is certainly a bug. Is this still an OOM triggered from
> > pagefault_out_of_memory? Since 4.19 (29ef680ae7c21) the memcg charge
> > path should invoke the memcg oom killer directly from the charge path.
> > If that doesn't happen then the failing charge is either GFP_NOFS or a
> > large allocation.
> > 
> > The former has been fixed just recently by http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp
> > and I suspect this is a fix you are looking for. Although it is curious
> > that you can see a global oom even before because the charge path would
> > mark an oom situation even for NOFS context and it should trigger the
> > memcg oom killer on the way out from the page fault path. So essentially
> > the same call trace except the oom killer should be constrained to the
> > memcg context.
> > 
> > Could you try the above patch please?
> > 
> 
> It won't help. We hitting ->kmem limit here, not the ->memory or ->memsw, so try_charge() is successful and
> only __memcg_kmem_charge_memcg() fails to charge ->kmem and returns -ENOMEM.
> 
> Limiting kmem just never worked and it doesn't work now. AFAIK this feature hasn't been finished because 
> there was no clear purpose/use case found. I remember that there was some discussion on lsfmm about this https://lwn.net/Articles/636331/
> but I don't remember the discussion itself.

Ohh, right you are. I completely forgot that __memcg_kmem_charge_memcg
doesn't really trigger the normal charge path but rather charge the
counter directly.

So you are right. The v1 kmem accounting is broken and probably
unfixable. Do not use it.

Thanks!

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] Early OOM and kernel NULL pointer dereference in 4.19.69
  2019-09-03 12:22         ` Michal Hocko
@ 2019-09-03 18:20           ` Thomas Lindroth
  2019-09-03 19:36             ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: Thomas Lindroth @ 2019-09-03 18:20 UTC (permalink / raw)
  To: Michal Hocko, Andrey Ryabinin; +Cc: linux-mm, stable

On 9/3/19 2:22 PM, Michal Hocko wrote:
> On Tue 03-09-19 15:05:22, Andrey Ryabinin wrote:
>>
>>
>> On 9/3/19 10:41 AM, Michal Hocko wrote:
>>> On Mon 02-09-19 21:34:29, Thomas Lindroth wrote:
>>>> On 9/2/19 9:16 AM, Michal Hocko wrote:
>>>>> On Sun 01-09-19 22:43:05, Thomas Lindroth wrote:
>>>>>> After upgrading to the 4.19 series I've started getting problems with
>>>>>> early OOM.
>>>>>
>>>>> What is the kenrel you have updated from? Would it be possible to try
>>>>> the current Linus' tree?
>>>>
>>>> I did some more testing and it turns out this is not a regression after all.
>>>>
>>>> I followed up on my hunch and monitored memory.kmem.max_usage_in_bytes while
>>>> running cgexec -g memory:12G bash -c 'find / -xdev -type f -print0 | \
>>>>          xargs -0 -n 1 -P 8 stat > /dev/null'
>>>>
>>>> Just as memory.kmem.max_usage_in_bytes = memory.kmem.limit_in_bytes the OOM
>>>> killer kicked in and killed my X server.
>>>>
>>>> Using the find|stat approach it was easy to test the problem in a testing VM.
>>>> I was able to reproduce the problem in all these kernels:
>>>>    4.9.0
>>>>    4.14.0
>>>>    4.14.115
>>>>    4.19.0
>>>>    5.2.11
>>>>
>>>> 5.3-rc6 didn't build in the VM. The build environment is too old probably.
>>>>
>>>> I was curious why I initially couldn't reproduce the problem in 4.14 by
>>>> building chromium. I was again able to successfully build chromium using
>>>> 4.14.115. Turns out memory.kmem.max_usage_in_bytes was 1015689216 after
>>>> building and my limit is set to 1073741824. I guess some unrelated change in
>>>> memory management raised that slightly for 4.19 triggering the problem.
>>>>
>>>> If you want to reproduce for yourself here are the steps:
>>>> 1. build any kernel above 4.9 using something like my .config
>>>> 2. setup a v1 memory cgroup with memory.kmem.limit_in_bytes lower than
>>>>     memory.limit_in_bytes. I used 100M in my testing VM.
>>>> 3. Run "find / -xdev -type f -print0 | xargs -0 -n 1 -P 8 stat > /dev/null"
>>>>     in the cgroup.
>>>> 4. Assuming there is enough inodes on the rootfs the global OOM killer
>>>>     should kick in when memory.kmem.max_usage_in_bytes =
>>>>     memory.kmem.limit_in_bytes and kill something outside the cgroup.
>>>
>>> This is certainly a bug. Is this still an OOM triggered from
>>> pagefault_out_of_memory? Since 4.19 (29ef680ae7c21) the memcg charge
>>> path should invoke the memcg oom killer directly from the charge path.
>>> If that doesn't happen then the failing charge is either GFP_NOFS or a
>>> large allocation.
>>>
>>> The former has been fixed just recently by http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp
>>> and I suspect this is a fix you are looking for. Although it is curious
>>> that you can see a global oom even before because the charge path would
>>> mark an oom situation even for NOFS context and it should trigger the
>>> memcg oom killer on the way out from the page fault path. So essentially
>>> the same call trace except the oom killer should be constrained to the
>>> memcg context.
>>>
>>> Could you try the above patch please?
>>>
>>
>> It won't help. We hitting ->kmem limit here, not the ->memory or ->memsw, so try_charge() is successful and
>> only __memcg_kmem_charge_memcg() fails to charge ->kmem and returns -ENOMEM.
>>
>> Limiting kmem just never worked and it doesn't work now. AFAIK this feature hasn't been finished because
>> there was no clear purpose/use case found. I remember that there was some discussion on lsfmm about this https://lwn.net/Articles/636331/
>> but I don't remember the discussion itself.
> 
> Ohh, right you are. I completely forgot that __memcg_kmem_charge_memcg
> doesn't really trigger the normal charge path but rather charge the
> counter directly.
> 
> So you are right. The v1 kmem accounting is broken and probably
> unfixable. Do not use it.

I don't know why I setup a kmem limit. I think the documentation I followed
when setting up the cgroup said that kmem is counted separately from the
regular memory limit so if you want to limit total memory you have to limit
both. That's what I did.

If kmem accounting is both broken, unfixable and cause kernel crashes when
used why not remove it? Or perhaps disable it per default like
cgroup.memory=nokmem or at least print a warning to dmesg if the user tries
to user it in a way that cause crashes?


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] Early OOM and kernel NULL pointer dereference in 4.19.69
       [not found] ` <666dbcde-1b8a-9e2d-7d1f-48a117c78ae1@I-love.SAKURA.ne.jp>
@ 2019-09-03 18:25   ` Thomas Lindroth
       [not found]     ` <4d0eda9a-319d-1a7d-1eed-71da90902367@i-love.sakura.ne.jp>
  0 siblings, 1 reply; 26+ messages in thread
From: Thomas Lindroth @ 2019-09-03 18:25 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm

On 9/3/19 3:33 PM, Tetsuo Handa wrote:
> On 2019/09/02 5:43, Thomas Lindroth wrote:
>> Those kernel memory allocation failures can also cause kernel NULL pointer
>> dereference. Here is a dmesg captured over netconsole when that happens:
> 
> Can you establish steps to reproduce this crash?
> Since it seems that __GFP_NOFAIL allocation is failing for some reason, we should fix it.

I have no reliable way to reproduce the crash. I just setup a v1 memory cgroup
with memory.kmem.limit_in_bytes < memory.limit_in_bytes then run something that
allocates SLUB memory and deplete the kmem limit. Usually the OOM killer is
triggered when the kmem limit is hit but sometimes I get warnings like
"SLUB: Unable to allocate memory on node -1" and kernel null pointer
dereference.

Running "find / -xdev -type f -print0 | xargs -0 -n 1 -P 8 stat > /dev/null"
in the cgroup is an easy way to allocate ext4_inode_cache and deplete the kmem
limit but I never got any null pointer deref that way. Building the chromium
browser in the cgroup can also trigger the kmem limit and will sometimes cause
null pointer deref.

Here is another null pointer deref I got while building chromium in the cgroup.
4,1180,556857645,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1181,556857652,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1182,556857654,-;  node 0: slabs: 17997, objs: 557851, free: 0
4,1183,556857675,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1184,556857677,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1185,556857679,-;  node 0: slabs: 17997, objs: 557851, free: 0
4,1186,556857955,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1187,556857957,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1188,556857959,-;  node 0: slabs: 18003, objs: 557869, free: 0
4,1189,556857974,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1190,556857976,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1191,556857979,-;  node 0: slabs: 18003, objs: 557869, free: 0
4,1192,556857989,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1193,556857992,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1194,556857994,-;  node 0: slabs: 18003, objs: 557869, free: 0
4,1195,556858518,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1196,556858522,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1197,556858523,-;  node 0: slabs: 18003, objs: 557869, free: 0
4,1198,556858535,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1199,556858537,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1200,556858538,-;  node 0: slabs: 18003, objs: 557869, free: 0
4,1201,556858545,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1202,556858547,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1203,556858548,-;  node 0: slabs: 18003, objs: 557869, free: 0
4,1204,556858554,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1205,556858556,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1206,556858558,-;  node 0: slabs: 18003, objs: 557869, free: 0
4,1207,556858748,-;SLUB: Unable to allocate memory on node -1, gfp=0x600040(GFP_NOFS)
4,1208,556858751,-;  cache: ext4_inode_cache(100:12G), object size: 1024, buffer size: 1032, default order: 3, min order: 0
4,1209,556858753,-;  node 0: slabs: 18003, objs: 557869, free: 0
1,1210,556861832,-;BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
6,1211,556861836,-;PGD 0
4,1212,556861837,c;P4D 0
4,1213,556861839,-;Oops: 0000 [#1] PREEMPT SMP PTI
4,1214,556861841,-;CPU: 7 PID: 12228 Comm: find Not tainted 4.19.69 #43
4,1215,556861842,-;Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
4,1216,556861846,-;RIP: 0010:__getblk_gfp+0x181/0x240
4,1217,556861848,-;Code: e8 e4 ee ff ff 48 89 04 24 49 8b 46 30 48 8d b8 80 00 00 00 e8 20 5e 67 00 48 8b 04 24 44 8b 4c 24 1c 48 89 c1 eb 03 48 89 d1 <48> 8b 51 08 48 85 d2 75 f4 48 89 41 08 49 8b 4f 08 48 8d 51 ff 83
4,1218,556861850,-;RSP: 0018:ffffaba441853be8 EFLAGS: 00010246
4,1219,556861851,-;RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000000
4,1220,556861853,-;RDX: 0000000000000001 RSI: 0000000000000082 RDI: ffff9824dd8943c8
4,1221,556861854,-;RBP: 0000000000000000 R08: ffffd552cd660e48 R09: 0000000000000000
4,1222,556861855,-;R10: 0000000000000000 R11: 0000000000000036 R12: ffff9824dd894100
4,1223,556861856,-;R13: 0000000001301775 R14: ffff9824dd8941d8 R15: ffffd552c84f1380
4,1224,556861858,-;FS:  00007fdd32a0cb80(0000) GS:ffff9824df9c0000(0000) knlGS:0000000000000000
4,1225,556861859,-;CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
4,1226,556861861,-;CR2: 0000000000000008 CR3: 00000003614b6002 CR4: 00000000001606e0
4,1227,556861862,-;Call Trace:
4,1228,556861866,-; ext4_getblk+0x91/0x1a0
4,1229,556861868,-; ext4_bread+0x1e/0xa0
4,1230,556861871,-; ? tomoyo_path_perm+0xa3/0x200
4,1231,556861873,-; __ext4_read_dirblock+0x2c/0x2e0
4,1232,556861875,-; htree_dirblock_to_tree+0x6a/0x1e0
4,1233,556861877,-; ext4_htree_fill_tree+0xcd/0x2f0
4,1234,556861880,-; ? kmem_cache_alloc_trace+0x163/0x1c0
4,1235,556861882,-; ext4_readdir+0x472/0x870
4,1236,556861886,-; iterate_dir+0x138/0x180
4,1237,556861967,-; ksys_getdents64+0x9c/0x130
4,1238,556861969,-; ? iterate_dir+0x180/0x180
4,1239,556861972,-; __x64_sys_getdents64+0x16/0x20
4,1240,556861974,-; do_syscall_64+0x59/0x180
4,1241,556861977,-; entry_SYSCALL_64_after_hwframe+0x44/0xa9
4,1242,556861979,-;RIP: 0033:0x7fdd32adef3b
4,1243,556861981,-;Code: 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 83 ec 18 64 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 b8 d9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 1d 48 8b 4c 24 08 64 48 33 0c 25 28 00 00 00
4,1244,556861982,-;RSP: 002b:00007ffdf210cc10 EFLAGS: 00000246
4,1245,556861984,c; ORIG_RAX: 00000000000000d9
4,1246,556861985,-;RAX: ffffffffffffffda RBX: 0000563985f7f110 RCX: 00007fdd32adef3b
4,1247,556861986,-;RDX: 0000000000008000 RSI: 0000563985f7f140 RDI: 0000000000000006
4,1248,556861987,-;RBP: 0000563985f7f140 R08: 0000563985f740a8 R09: 0000563985f768f0
4,1249,556861988,-;R10: 0000000000000100 R11: 0000000000000246 R12: ffffffffffffff80
4,1250,556861990,-;R13: 0000000000000000 R14: 0000563985f73c00 R15: 0000563985f74040
4,1251,556861991,-;Modules linked in:
4,1252,556861993,c; 8021q
4,1253,556861994,c; iptable_mangle
4,1254,556861996,c; xt_limit
4,1255,556861997,c; xt_conntrack
4,1256,556861998,c; iptable_filter
4,1257,556862000,c; iptable_nat
4,1258,556862001,c; nf_nat_ipv4
4,1259,556862002,c; nf_nat
4,1260,556862101,c; ip_tables
4,1261,556862102,c; arc4
4,1262,556862103,c; ath9k_htc
4,1263,556862104,c; ath9k_common
4,1264,556862105,c; ath9k_hw
4,1265,556862107,c; ath
4,1266,556862108,c; mac80211
4,1267,556862109,c; kvm_intel
4,1268,556862110,c; cfg80211
4,1269,556862111,c; kvm
4,1270,556862112,c; crc32_pclmul
4,1271,556862113,c; uas
4,1272,556862115,c; usb_storage
4,1273,556862116,c; cdc_acm
4,1274,556862117,c; joydev
4,1275,556862118,-;CR2: 0000000000000008
4,1276,556862120,-;---[ end trace b7a234b0d1e0ec38 ]---
4,1277,556862122,-;RIP: 0010:__getblk_gfp+0x181/0x240
4,1278,556862123,-;Code: e8 e4 ee ff ff 48 89 04 24 49 8b 46 30 48 8d b8 80 00 00 00 e8 20 5e 67 00 48 8b 04 24 44 8b 4c 24 1c 48 89 c1 eb 03 48 89 d1 <48> 8b 51 08 48 85 d2 75 f4 48 89 41 08 49 8b 4f 08 48 8d 51 ff 83
4,1279,556862125,-;RSP: 0018:ffffaba441853be8 EFLAGS: 00010246
4,1280,556862126,-;RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000000
4,1281,556862127,-;RDX: 0000000000000001 RSI: 0000000000000082 RDI: ffff9824dd8943c8
4,1282,556862129,-;RBP: 0000000000000000 R08: ffffd552cd660e48 R09: 0000000000000000
4,1283,556862130,-;R10: 0000000000000000 R11: 0000000000000036 R12: ffff9824dd894100
4,1284,556862131,-;R13: 0000000001301775 R14: ffff9824dd8941d8 R15: ffffd552c84f1380
4,1285,556862132,-;FS:  00007fdd32a0cb80(0000) GS:ffff9824df9c0000(0000) knlGS:0000000000000000
4,1286,556862134,-;CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
4,1287,556862176,-;CR2: 0000000000000008 CR3: 00000003614b6002 CR4: 00000000001606e0
0,1288,556862178,-;Kernel panic - not syncing: Fatal exception
0,1289,556862184,-;Kernel Offset: 0x30000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
0,1290,556862186,-;---[ end Kernel panic - not syncing: Fatal exception ]---


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] Early OOM and kernel NULL pointer dereference in 4.19.69
  2019-09-03 18:20           ` Thomas Lindroth
@ 2019-09-03 19:36             ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2019-09-03 19:36 UTC (permalink / raw)
  To: Thomas Lindroth; +Cc: Andrey Ryabinin, linux-mm, stable

On Tue 03-09-19 20:20:20, Thomas Lindroth wrote:
[...]
> If kmem accounting is both broken, unfixable and cause kernel crashes when
> used why not remove it? Or perhaps disable it per default like
> cgroup.memory=nokmem or at least print a warning to dmesg if the user tries
> to user it in a way that cause crashes?

Well, cgroup v1 interfaces and implementation is mostly frozen and users
are advised to use v2 interface that doesn't suffer from this problem
because there is no separate kmem limit and both user and kernel charges
are tight to the same counter.

We can be more explicit about shortcomings in the documentation but in
general v1 is deprecated.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] kmemcg limit defeats __GFP_NOFAIL allocation
       [not found]     ` <4d0eda9a-319d-1a7d-1eed-71da90902367@i-love.sakura.ne.jp>
@ 2019-09-04 11:25       ` Michal Hocko
       [not found]         ` <4d87d770-c110-224f-6c0c-d6fada90417d@i-love.sakura.ne.jp>
       [not found]         ` <0056063b-46ff-0ebd-ff0d-c96a1f9ae6b1@i-love.sakura.ne.jp>
  0 siblings, 2 replies; 26+ messages in thread
From: Michal Hocko @ 2019-09-04 11:25 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Andrey Ryabinin, Thomas Lindroth, linux-mm

On Wed 04-09-19 18:36:06, Tetsuo Handa wrote:
[...]
> The first bug is that __memcg_kmem_charge_memcg() in mm/memcontrol.c is
> failing to return 0 when it is a __GFP_NOFAIL allocation request.
> We should ignore limits when it is a __GFP_NOFAIL allocation request.

OK, fixing that sounds like a reasonable thing to do.
 
> If we force __memcg_kmem_charge_memcg() to return 0, then
> 
> ----------
>         struct page_counter *counter;
>         int ret;
> 
> +       if (gfp & __GFP_NOFAIL)
> +               return 0;
> +
>         ret = try_charge(memcg, gfp, nr_pages);
>         if (ret)
>                 return ret;
> ----------

This should be more likely something like

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9ec5e12486a7..05a4828edf9d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2820,7 +2820,8 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
 		return ret;
 
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
-	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
+	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter) &&
+	    !(gfp_mask & __GFP_NOFAIL)) {
 		cancel_charge(memcg, nr_pages);
 		return -ENOMEM;
 	}

> the second bug that alloc_slabmgmt() in mm/slab.c is returning NULL
> when it is a __GFP_NOFAIL allocation request will appear.
> I don't know how to handle this.

I am sorry, I do not follow, why would alloc_slabmgmt return NULL
with forcing gfp_nofail charges?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [BUG] kmemcg limit defeats __GFP_NOFAIL allocation
       [not found]         ` <4d87d770-c110-224f-6c0c-d6fada90417d@i-love.sakura.ne.jp>
@ 2019-09-04 11:59           ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2019-09-04 11:59 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Andrey Ryabinin, Thomas Lindroth, linux-mm

On Wed 04-09-19 20:32:25, Tetsuo Handa wrote:
> On 2019/09/04 20:25, Michal Hocko wrote:
> > On Wed 04-09-19 18:36:06, Tetsuo Handa wrote:
> > [...]
> >> The first bug is that __memcg_kmem_charge_memcg() in mm/memcontrol.c is
> >> failing to return 0 when it is a __GFP_NOFAIL allocation request.
> >> We should ignore limits when it is a __GFP_NOFAIL allocation request.
> > 
> > OK, fixing that sounds like a reasonable thing to do.
> >  
> >> If we force __memcg_kmem_charge_memcg() to return 0, then
> >>
> >> ----------
> >>         struct page_counter *counter;
> >>         int ret;
> >>
> >> +       if (gfp & __GFP_NOFAIL)
> >> +               return 0;
> >> +
> >>         ret = try_charge(memcg, gfp, nr_pages);
> >>         if (ret)
> >>                 return ret;
> >> ----------
> > 
> > This should be more likely something like
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 9ec5e12486a7..05a4828edf9d 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2820,7 +2820,8 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
> >  		return ret;
> >  
> >  	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
> > -	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
> > +	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter) &&
> > +	    !(gfp_mask & __GFP_NOFAIL)) {
> >  		cancel_charge(memcg, nr_pages);
> >  		return -ENOMEM;
> >  	}
> 
> Is it guaranteed that try_charge(__GFP_NOFAIL) never fails?

it enforces charges.

> >> the second bug that alloc_slabmgmt() in mm/slab.c is returning NULL
> >> when it is a __GFP_NOFAIL allocation request will appear.
> >> I don't know how to handle this.
> > 
> > I am sorry, I do not follow, why would alloc_slabmgmt return NULL
> > with forcing gfp_nofail charges?
> > 
> 
> The reproducer is hitting
> 
> @@ -2300,18 +2302,21 @@ static void *alloc_slabmgmt(struct kmem_cache *cachep,
>  	page->s_mem = addr + colour_off;
>  	page->active = 0;
>  
> -	if (OBJFREELIST_SLAB(cachep))
> +	if (OBJFREELIST_SLAB(cachep)) {
> +		BUG_ON(local_flags & __GFP_NOFAIL); // <= this condition

What does this bugon tries to say though. I am not an expert on slab bu
only OFF_SLAB(cachep) branch depends on an allocation. Others should
allocate object from the cache.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] kmemcg limit defeats __GFP_NOFAIL allocation
       [not found]         ` <0056063b-46ff-0ebd-ff0d-c96a1f9ae6b1@i-love.sakura.ne.jp>
@ 2019-09-04 14:29           ` Michal Hocko
       [not found]             ` <405ce28b-c0b4-780c-c883-42d741ec60e0@i-love.sakura.ne.jp>
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2019-09-04 14:29 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: Andrey Ryabinin, Thomas Lindroth, linux-mm

On Wed 04-09-19 23:19:31, Tetsuo Handa wrote:
> On 2019/09/04 20:25, Michal Hocko wrote:
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 9ec5e12486a7..05a4828edf9d 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2820,7 +2820,8 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
> >  		return ret;
> >  
> >  	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
> > -	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
> > +	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter) &&
> > +	    !(gfp_mask & __GFP_NOFAIL)) {
> >  		cancel_charge(memcg, nr_pages);
> >  		return -ENOMEM;
> >  	}
> > 
> 
> With s/gfp_mask/gfp/ applied, I get no crash but got below warning.
> I don't know relevance with the patch.

Ohh, right. We are trying to uncharge something that hasn't been charged
because page_counter_try_charge has failed. So the fix needs to be more
involved. Sorry, I should have realized that.
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9ec5e12486a7..e18108b2b786 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2821,6 +2821,16 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
 
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
 	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
+
+		/*
+		 * Enforce __GFP_NOFAIL allocation because callers are not
+		 * prepared to see failures and likely do not have any failure
+		 * handling code.
+		 */
+		if (gfp & __GFP_NOFAIL) {
+			page_counter_charge(&memcg->kmem, nr_pages);
+			return 0;
+		}
 		cancel_charge(memcg, nr_pages);
 		return -ENOMEM;
 	}
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [BUG] kmemcg limit defeats __GFP_NOFAIL allocation
       [not found]             ` <405ce28b-c0b4-780c-c883-42d741ec60e0@i-love.sakura.ne.jp>
@ 2019-09-05 23:11               ` Thomas Lindroth
  2019-09-06  7:27                 ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: Thomas Lindroth @ 2019-09-05 23:11 UTC (permalink / raw)
  To: Tetsuo Handa, Michal Hocko; +Cc: Andrey Ryabinin, linux-mm

On 9/4/19 6:39 PM, Tetsuo Handa wrote:
> On 2019/09/04 23:29, Michal Hocko wrote:
>> Ohh, right. We are trying to uncharge something that hasn't been charged
>> because page_counter_try_charge has failed. So the fix needs to be more
>> involved. Sorry, I should have realized that.
> 
> OK. Survived the test. Thomas, please try.
> 
>> ---
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 9ec5e12486a7..e18108b2b786 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2821,6 +2821,16 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
>>   
>>   	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
>>   	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
>> +
>> +		/*
>> +		 * Enforce __GFP_NOFAIL allocation because callers are not
>> +		 * prepared to see failures and likely do not have any failure
>> +		 * handling code.
>> +		 */
>> +		if (gfp & __GFP_NOFAIL) {
>> +			page_counter_charge(&memcg->kmem, nr_pages);
>> +			return 0;
>> +		}
>>   		cancel_charge(memcg, nr_pages);
>>   		return -ENOMEM;
>>   	}
>>

I tried the patch with 5.2.11 and wasn't able to trigger any null pointer
deref crashes with it. Testing is tricky because the OOM killer will still
run and eventually kill bash and whatever runs in the cgroup.

I backported the patch to 4.19.69 and ran the chromium build like before
but this time I couldn't trigger any system crashes.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] kmemcg limit defeats __GFP_NOFAIL allocation
  2019-09-05 23:11               ` Thomas Lindroth
@ 2019-09-06  7:27                 ` Michal Hocko
  2019-09-06 10:54                   ` Andrey Ryabinin
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2019-09-06  7:27 UTC (permalink / raw)
  To: Thomas Lindroth; +Cc: Tetsuo Handa, Andrey Ryabinin, linux-mm

On Fri 06-09-19 01:11:53, Thomas Lindroth wrote:
> On 9/4/19 6:39 PM, Tetsuo Handa wrote:
> > On 2019/09/04 23:29, Michal Hocko wrote:
> > > Ohh, right. We are trying to uncharge something that hasn't been charged
> > > because page_counter_try_charge has failed. So the fix needs to be more
> > > involved. Sorry, I should have realized that.
> > 
> > OK. Survived the test. Thomas, please try.
> > 
> > > ---
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 9ec5e12486a7..e18108b2b786 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2821,6 +2821,16 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
> > >   	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
> > >   	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
> > > +
> > > +		/*
> > > +		 * Enforce __GFP_NOFAIL allocation because callers are not
> > > +		 * prepared to see failures and likely do not have any failure
> > > +		 * handling code.
> > > +		 */
> > > +		if (gfp & __GFP_NOFAIL) {
> > > +			page_counter_charge(&memcg->kmem, nr_pages);
> > > +			return 0;
> > > +		}
> > >   		cancel_charge(memcg, nr_pages);
> > >   		return -ENOMEM;
> > >   	}
> > > 
> 
> I tried the patch with 5.2.11 and wasn't able to trigger any null pointer
> deref crashes with it. Testing is tricky because the OOM killer will still
> run and eventually kill bash and whatever runs in the cgroup.

Yeah, this is unfortunate but also unfixable I am afraid. I will post an
official patch with a changelog later today.

Thanks for testing!
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [BUG] kmemcg limit defeats __GFP_NOFAIL allocation
  2019-09-06  7:27                 ` Michal Hocko
@ 2019-09-06 10:54                   ` Andrey Ryabinin
  2019-09-06 11:29                     ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: Andrey Ryabinin @ 2019-09-06 10:54 UTC (permalink / raw)
  To: Michal Hocko, Thomas Lindroth; +Cc: Tetsuo Handa, Andrey Ryabinin, linux-mm



On 9/6/19 10:27 AM, Michal Hocko wrote:
> On Fri 06-09-19 01:11:53, Thomas Lindroth wrote:
>> On 9/4/19 6:39 PM, Tetsuo Handa wrote:
>>> On 2019/09/04 23:29, Michal Hocko wrote:
>>>> Ohh, right. We are trying to uncharge something that hasn't been charged
>>>> because page_counter_try_charge has failed. So the fix needs to be more
>>>> involved. Sorry, I should have realized that.
>>>
>>> OK. Survived the test. Thomas, please try.
>>>
>>>> ---
>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>> index 9ec5e12486a7..e18108b2b786 100644
>>>> --- a/mm/memcontrol.c
>>>> +++ b/mm/memcontrol.c
>>>> @@ -2821,6 +2821,16 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
>>>>   	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
>>>>   	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
>>>> +
>>>> +		/*
>>>> +		 * Enforce __GFP_NOFAIL allocation because callers are not
>>>> +		 * prepared to see failures and likely do not have any failure
>>>> +		 * handling code.
>>>> +		 */
>>>> +		if (gfp & __GFP_NOFAIL) {
>>>> +			page_counter_charge(&memcg->kmem, nr_pages);
>>>> +			return 0;
>>>> +		}
>>>>   		cancel_charge(memcg, nr_pages);
>>>>   		return -ENOMEM;
>>>>   	}
>>>>
>>
>> I tried the patch with 5.2.11 and wasn't able to trigger any null pointer
>> deref crashes with it. Testing is tricky because the OOM killer will still
>> run and eventually kill bash and whatever runs in the cgroup.
> 
> Yeah, this is unfortunate but also unfixable I am afraid. 

I think there are two possible ways to fix this. If we decide to keep kmem.limit_in_bytes broken,
than we can just always bypass limit. Also we could add something like pr_warn_once("kmem limit doesn't work");
when user changes kmem.limit_in_bytes 


Or we can fix kmem.limit_in_bytes like this:


---
 mm/memcontrol.c | 76 +++++++++++++++++++++++++++++++------------------
 1 file changed, 48 insertions(+), 28 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d1d598d9554..71b9065e4b31 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1314,7 +1314,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
  * Returns the maximum amount of memory @mem can be charged with, in
  * pages.
  */
-static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
+static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg, bool kmem)
 {
 	unsigned long margin = 0;
 	unsigned long count;
@@ -1334,6 +1334,15 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
 			margin = 0;
 	}
 
+	if (kmem && margin) {
+		count = page_counter_read(&memcg->kmem);
+		limit = READ_ONCE(memcg->kmem.max);
+		if (count <= limit)
+			margin = min(margin, limit - count);
+		else
+			margin = 0;
+	}
+
 	return margin;
 }
 
@@ -2505,7 +2514,7 @@ void mem_cgroup_handle_over_high(void)
 }
 
 static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
-		      unsigned int nr_pages)
+		      unsigned int nr_pages, bool kmem_charge)
 {
 	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
@@ -2519,21 +2528,42 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (mem_cgroup_is_root(memcg))
 		return 0;
 retry:
-	if (consume_stock(memcg, nr_pages))
+	if (consume_stock(memcg, nr_pages)) {
+		if (kmem_charge && !page_counter_try_charge(&memcg->kmem,
+							nr_pages, &counter)) {
+			refill_stock(memcg, nr_pages);
+			goto charge;
+		}
 		return 0;
+	}
 
+charge:
+	mem_over_limit = NULL;
 	if (!do_memsw_account() ||
 	    page_counter_try_charge(&memcg->memsw, batch, &counter)) {
-		if (page_counter_try_charge(&memcg->memory, batch, &counter))
-			goto done_restock;
-		if (do_memsw_account())
-			page_counter_uncharge(&memcg->memsw, batch);
-		mem_over_limit = mem_cgroup_from_counter(counter, memory);
+		if (!page_counter_try_charge(&memcg->memory, batch, &counter)) {
+			if (do_memsw_account())
+				page_counter_uncharge(&memcg->memsw, batch);
+			mem_over_limit = mem_cgroup_from_counter(counter, memory);
+		}
 	} else {
 		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
 		may_swap = false;
 	}
 
+	if (!mem_over_limit && kmem_charge) {
+		if (!page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
+			may_swap = false;
+			mem_over_limit = mem_cgroup_from_counter(counter, kmem);
+			page_counter_uncharge(&memcg->memory, batch);
+			if (do_memsw_account())
+				page_counter_uncharge(&memcg->memsw, batch);
+		}
+	}
+
+	if (!mem_over_limit)
+		goto done_restock;
+
 	if (batch > nr_pages) {
 		batch = nr_pages;
 		goto retry;
@@ -2568,7 +2598,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
 						    gfp_mask, may_swap);
 
-	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
+	if (mem_cgroup_margin(mem_over_limit, kmem_charge) >= nr_pages)
 		goto retry;
 
 	if (!drained) {
@@ -2637,6 +2667,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	page_counter_charge(&memcg->memory, nr_pages);
 	if (do_memsw_account())
 		page_counter_charge(&memcg->memsw, nr_pages);
+	if (kmem_charge)
+		page_counter_charge(&memcg->kmem, nr_pages);
 	css_get_many(&memcg->css, nr_pages);
 
 	return 0;
@@ -2943,20 +2975,8 @@ void memcg_kmem_put_cache(struct kmem_cache *cachep)
 int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
 			    struct mem_cgroup *memcg)
 {
-	unsigned int nr_pages = 1 << order;
-	struct page_counter *counter;
-	int ret;
-
-	ret = try_charge(memcg, gfp, nr_pages);
-	if (ret)
-		return ret;
-
-	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
-	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
-		cancel_charge(memcg, nr_pages);
-		return -ENOMEM;
-	}
-	return 0;
+	return try_charge(memcg, gfp, 1 << order,
+			!cgroup_subsys_on_dfl(memory_cgrp_subsys));
 }
 
 /**
@@ -5053,7 +5073,7 @@ static int mem_cgroup_do_precharge(unsigned long count)
 	int ret;
 
 	/* Try a single bulk charge without reclaim first, kswapd may wake */
-	ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_DIRECT_RECLAIM, count);
+	ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_DIRECT_RECLAIM, count, false);
 	if (!ret) {
 		mc.precharge += count;
 		return ret;
@@ -5061,7 +5081,7 @@ static int mem_cgroup_do_precharge(unsigned long count)
 
 	/* Try charges one by one with reclaim, but do not retry */
 	while (count--) {
-		ret = try_charge(mc.to, GFP_KERNEL | __GFP_NORETRY, 1);
+		ret = try_charge(mc.to, GFP_KERNEL | __GFP_NORETRY, 1, false);
 		if (ret)
 			return ret;
 		mc.precharge++;
@@ -6255,7 +6275,7 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
 	if (!memcg)
 		memcg = get_mem_cgroup_from_mm(mm);
 
-	ret = try_charge(memcg, gfp_mask, nr_pages);
+	ret = try_charge(memcg, gfp_mask, nr_pages, false);
 
 	css_put(&memcg->css);
 out:
@@ -6634,10 +6654,10 @@ bool mem_cgroup_charge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages)
 
 	mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
 
-	if (try_charge(memcg, gfp_mask, nr_pages) == 0)
+	if (try_charge(memcg, gfp_mask, nr_pages, false) == 0)
 		return true;
 
-	try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages);
+	try_charge(memcg, gfp_mask|__GFP_NOFAIL, nr_pages, false);
 	return false;
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [BUG] kmemcg limit defeats __GFP_NOFAIL allocation
  2019-09-06 10:54                   ` Andrey Ryabinin
@ 2019-09-06 11:29                     ` Michal Hocko
  0 siblings, 0 replies; 26+ messages in thread
From: Michal Hocko @ 2019-09-06 11:29 UTC (permalink / raw)
  To: Andrey Ryabinin; +Cc: Thomas Lindroth, Tetsuo Handa, linux-mm

On Fri 06-09-19 13:54:30, Andrey Ryabinin wrote:
> 
> 
> On 9/6/19 10:27 AM, Michal Hocko wrote:
> > On Fri 06-09-19 01:11:53, Thomas Lindroth wrote:
> >> On 9/4/19 6:39 PM, Tetsuo Handa wrote:
> >>> On 2019/09/04 23:29, Michal Hocko wrote:
> >>>> Ohh, right. We are trying to uncharge something that hasn't been charged
> >>>> because page_counter_try_charge has failed. So the fix needs to be more
> >>>> involved. Sorry, I should have realized that.
> >>>
> >>> OK. Survived the test. Thomas, please try.
> >>>
> >>>> ---
> >>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >>>> index 9ec5e12486a7..e18108b2b786 100644
> >>>> --- a/mm/memcontrol.c
> >>>> +++ b/mm/memcontrol.c
> >>>> @@ -2821,6 +2821,16 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
> >>>>   	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
> >>>>   	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
> >>>> +
> >>>> +		/*
> >>>> +		 * Enforce __GFP_NOFAIL allocation because callers are not
> >>>> +		 * prepared to see failures and likely do not have any failure
> >>>> +		 * handling code.
> >>>> +		 */
> >>>> +		if (gfp & __GFP_NOFAIL) {
> >>>> +			page_counter_charge(&memcg->kmem, nr_pages);
> >>>> +			return 0;
> >>>> +		}
> >>>>   		cancel_charge(memcg, nr_pages);
> >>>>   		return -ENOMEM;
> >>>>   	}
> >>>>
> >>
> >> I tried the patch with 5.2.11 and wasn't able to trigger any null pointer
> >> deref crashes with it. Testing is tricky because the OOM killer will still
> >> run and eventually kill bash and whatever runs in the cgroup.
> > 
> > Yeah, this is unfortunate but also unfixable I am afraid. 
> 
> I think there are two possible ways to fix this. If we decide to keep kmem.limit_in_bytes broken,
> than we can just always bypass limit. Also we could add something like pr_warn_once("kmem limit doesn't work");
> when user changes kmem.limit_in_bytes 
> 
> 
> Or we can fix kmem.limit_in_bytes like this:

I would rather state the brokenness in the documentation. I do not want
to make the more complex. I have only glanced through your patch but
sheer size is really discouraging. Besides that the issue is really not
fixable because kmem charges are simply never going to be guaranteed to
be reclaimable and we simply cannot involve the memcg OOM killer to
resolve the problem. Having a separate counter was just a bad design
choice :/
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] memcg, kmem: do not fail __GFP_NOFAIL charges
       [not found] ` <20190906125608.32129-1-mhocko@kernel.org>
@ 2019-09-06 18:24   ` Shakeel Butt
  2019-09-09 11:22     ` Michal Hocko
  2019-09-24 10:53   ` Michal Hocko
  1 sibling, 1 reply; 26+ messages in thread
From: Shakeel Butt @ 2019-09-06 18:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Vladimir Davydov, LKML, Linux MM,
	Andrey Ryabinin, Michal Hocko, Thomas Lindroth, Tetsuo Handa

On Fri, Sep 6, 2019 at 5:56 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> From: Michal Hocko <mhocko@suse.com>
>
> Thomas has noticed the following NULL ptr dereference when using cgroup
> v1 kmem limit:
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> PGD 0
> P4D 0
> Oops: 0000 [#1] PREEMPT SMP PTI
> CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
> Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
> RIP: 0010:create_empty_buffers+0x24/0x100
> Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
> RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
> RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
> RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
> R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
> R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
> FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
> Call Trace:
>  create_page_buffers+0x4d/0x60
>  __block_write_begin_int+0x8e/0x5a0
>  ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
>  ? jbd2__journal_start+0xd7/0x1f0
>  ext4_da_write_begin+0x112/0x3d0
>  generic_perform_write+0xf1/0x1b0
>  ? file_update_time+0x70/0x140
>  __generic_file_write_iter+0x141/0x1a0
>  ext4_file_write_iter+0xef/0x3b0
>  __vfs_write+0x17e/0x1e0
>  vfs_write+0xa5/0x1a0
>  ksys_write+0x57/0xd0
>  do_syscall_64+0x55/0x160
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
>  Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
>  fails __GFP_NOFAIL charge when the kmem limit is reached. This is a
>  wrong behavior because nofail allocations are not allowed to fail.
>  Normal charge path simply forces the charge even if that means to cross
>  the limit. Kmem accounting should be doing the same.
>
> Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com>
> Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
> Cc: stable
> Signed-off-by: Michal Hocko <mhocko@suse.com>

I wonder what has changed since
<http://lkml.kernel.org/r/20180525185501.82098-1-shakeelb@google.com/>.

> ---
>  mm/memcontrol.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9ec5e12486a7..e18108b2b786 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2821,6 +2821,16 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
>
>         if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
>             !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
> +
> +               /*
> +                * Enforce __GFP_NOFAIL allocation because callers are not
> +                * prepared to see failures and likely do not have any failure
> +                * handling code.
> +                */
> +               if (gfp & __GFP_NOFAIL) {
> +                       page_counter_charge(&memcg->kmem, nr_pages);
> +                       return 0;
> +               }
>                 cancel_charge(memcg, nr_pages);
>                 return -ENOMEM;
>         }
> --
> 2.20.1
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] memcg, kmem: do not fail __GFP_NOFAIL charges
  2019-09-06 18:24   ` [PATCH] memcg, kmem: do not fail __GFP_NOFAIL charges Shakeel Butt
@ 2019-09-09 11:22     ` Michal Hocko
  2019-09-11 12:00       ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2019-09-09 11:22 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Vladimir Davydov, LKML, Linux MM,
	Andrey Ryabinin, Thomas Lindroth, Tetsuo Handa

On Fri 06-09-19 11:24:55, Shakeel Butt wrote:
> On Fri, Sep 6, 2019 at 5:56 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > From: Michal Hocko <mhocko@suse.com>
> >
> > Thomas has noticed the following NULL ptr dereference when using cgroup
> > v1 kmem limit:
> > BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> > PGD 0
> > P4D 0
> > Oops: 0000 [#1] PREEMPT SMP PTI
> > CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
> > Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
> > RIP: 0010:create_empty_buffers+0x24/0x100
> > Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
> > RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
> > RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
> > RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
> > RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
> > R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
> > R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
> > FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
> > Call Trace:
> >  create_page_buffers+0x4d/0x60
> >  __block_write_begin_int+0x8e/0x5a0
> >  ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
> >  ? jbd2__journal_start+0xd7/0x1f0
> >  ext4_da_write_begin+0x112/0x3d0
> >  generic_perform_write+0xf1/0x1b0
> >  ? file_update_time+0x70/0x140
> >  __generic_file_write_iter+0x141/0x1a0
> >  ext4_file_write_iter+0xef/0x3b0
> >  __vfs_write+0x17e/0x1e0
> >  vfs_write+0xa5/0x1a0
> >  ksys_write+0x57/0xd0
> >  do_syscall_64+0x55/0x160
> >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> >  Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
> >  fails __GFP_NOFAIL charge when the kmem limit is reached. This is a
> >  wrong behavior because nofail allocations are not allowed to fail.
> >  Normal charge path simply forces the charge even if that means to cross
> >  the limit. Kmem accounting should be doing the same.
> >
> > Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com>
> > Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
> > Cc: stable
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> I wonder what has changed since
> <http://lkml.kernel.org/r/20180525185501.82098-1-shakeelb@google.com/>.

I have completely forgot about that one. It seems that we have just
repeated the same discussion again. This time we have a poor user who
actually enabled the kmem limit.

I guess there was no real objection to the change back then. The primary
discussion revolved around the fact that the accounting will stay broken
even when this particular part was fixed. Considering this leads to easy
to trigger crash (with the limit enabled) then I guess we should just
make it less broken and backport to stable trees and have a serious
discussion about discontinuing of the limit. Start by simply failing to
set any limit in the current upstream kernels.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] memcg, kmem: do not fail __GFP_NOFAIL charges
  2019-09-09 11:22     ` Michal Hocko
@ 2019-09-11 12:00       ` Michal Hocko
  2019-09-11 14:37         ` Andrew Morton
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2019-09-11 12:00 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Vladimir Davydov, LKML, Linux MM,
	Andrey Ryabinin, Thomas Lindroth, Tetsuo Handa

On Mon 09-09-19 13:22:45, Michal Hocko wrote:
> On Fri 06-09-19 11:24:55, Shakeel Butt wrote:
[...]
> > I wonder what has changed since
> > <http://lkml.kernel.org/r/20180525185501.82098-1-shakeelb@google.com/>.
> 
> I have completely forgot about that one. It seems that we have just
> repeated the same discussion again. This time we have a poor user who
> actually enabled the kmem limit.
> 
> I guess there was no real objection to the change back then. The primary
> discussion revolved around the fact that the accounting will stay broken
> even when this particular part was fixed. Considering this leads to easy
> to trigger crash (with the limit enabled) then I guess we should just
> make it less broken and backport to stable trees and have a serious
> discussion about discontinuing of the limit. Start by simply failing to
> set any limit in the current upstream kernels.

Any more concerns/objections to the patch? I can add a reference to your
earlier post Shakeel if you want or to credit you the way you prefer.

Also are there any objections to start deprecating process of kmem
limit? I would see it in two stages
- 1st warn in the kernel log
	pr_warn("kmem.limit_in_bytes is deprecated and will be removed.
	        "Please report your usecase to linux-mm@kvack.org if you "
		"depend on this functionality."
- 2nd fail any write to kmem.limit_in_bytes
- 3rd remove the control file completely
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] memcg, kmem: do not fail __GFP_NOFAIL charges
  2019-09-11 12:00       ` Michal Hocko
@ 2019-09-11 14:37         ` Andrew Morton
  2019-09-11 15:16           ` Michal Hocko
  0 siblings, 1 reply; 26+ messages in thread
From: Andrew Morton @ 2019-09-11 14:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Shakeel Butt, Johannes Weiner, Vladimir Davydov, LKML, Linux MM,
	Andrey Ryabinin, Thomas Lindroth, Tetsuo Handa

On Wed, 11 Sep 2019 14:00:02 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> On Mon 09-09-19 13:22:45, Michal Hocko wrote:
> > On Fri 06-09-19 11:24:55, Shakeel Butt wrote:
> [...]
> > > I wonder what has changed since
> > > <http://lkml.kernel.org/r/20180525185501.82098-1-shakeelb@google.com/>.
> > 
> > I have completely forgot about that one. It seems that we have just
> > repeated the same discussion again. This time we have a poor user who
> > actually enabled the kmem limit.
> > 
> > I guess there was no real objection to the change back then. The primary
> > discussion revolved around the fact that the accounting will stay broken
> > even when this particular part was fixed. Considering this leads to easy
> > to trigger crash (with the limit enabled) then I guess we should just
> > make it less broken and backport to stable trees and have a serious
> > discussion about discontinuing of the limit. Start by simply failing to
> > set any limit in the current upstream kernels.
> 
> Any more concerns/objections to the patch? I can add a reference to your
> earlier post Shakeel if you want or to credit you the way you prefer.
> 
> Also are there any objections to start deprecating process of kmem
> limit? I would see it in two stages
> - 1st warn in the kernel log
> 	pr_warn("kmem.limit_in_bytes is deprecated and will be removed.
> 	        "Please report your usecase to linux-mm@kvack.org if you "
> 		"depend on this functionality."

pr_warn_once() :)

> - 2nd fail any write to kmem.limit_in_bytes
> - 3rd remove the control file completely

Sounds good to me.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] memcg, kmem: do not fail __GFP_NOFAIL charges
  2019-09-11 14:37         ` Andrew Morton
@ 2019-09-11 15:16           ` Michal Hocko
  2019-09-13  2:46             ` Shakeel Butt
  0 siblings, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2019-09-11 15:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shakeel Butt, Johannes Weiner, Vladimir Davydov, LKML, Linux MM,
	Andrey Ryabinin, Thomas Lindroth, Tetsuo Handa

On Wed 11-09-19 07:37:40, Andrew Morton wrote:
> On Wed, 11 Sep 2019 14:00:02 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Mon 09-09-19 13:22:45, Michal Hocko wrote:
> > > On Fri 06-09-19 11:24:55, Shakeel Butt wrote:
> > [...]
> > > > I wonder what has changed since
> > > > <http://lkml.kernel.org/r/20180525185501.82098-1-shakeelb@google.com/>.
> > > 
> > > I have completely forgot about that one. It seems that we have just
> > > repeated the same discussion again. This time we have a poor user who
> > > actually enabled the kmem limit.
> > > 
> > > I guess there was no real objection to the change back then. The primary
> > > discussion revolved around the fact that the accounting will stay broken
> > > even when this particular part was fixed. Considering this leads to easy
> > > to trigger crash (with the limit enabled) then I guess we should just
> > > make it less broken and backport to stable trees and have a serious
> > > discussion about discontinuing of the limit. Start by simply failing to
> > > set any limit in the current upstream kernels.
> > 
> > Any more concerns/objections to the patch? I can add a reference to your
> > earlier post Shakeel if you want or to credit you the way you prefer.
> > 
> > Also are there any objections to start deprecating process of kmem
> > limit? I would see it in two stages
> > - 1st warn in the kernel log
> > 	pr_warn("kmem.limit_in_bytes is deprecated and will be removed.
> > 	        "Please report your usecase to linux-mm@kvack.org if you "
> > 		"depend on this functionality."
> 
> pr_warn_once() :)
> 
> > - 2nd fail any write to kmem.limit_in_bytes
> > - 3rd remove the control file completely
> 
> Sounds good to me.

Here we go

From 512822e551fe2960040c23b12c7b27a5fdab9013 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 11 Sep 2019 17:02:33 +0200
Subject: [PATCH] memcg, kmem: deprecate kmem.limit_in_bytes

Cgroup v1 memcg controller has exposed a dedicated kmem limit to users
which turned out to be really a bad idea because there are paths which
cannot shrink the kernel memory usage enough to get below the limit
(e.g. because the accounted memory is not reclaimable). There are cases
when the failure is even not allowed (e.g. __GFP_NOFAIL). This means
that the kmem limit is in excess to the hard limit without any way to
shrink and thus completely useless. OOM killer cannot be invoked to
handle the situation because that would lead to a premature oom killing.

As a result many places might see ENOMEM returning from kmalloc and
result in unexpected errors. E.g. a global OOM killer when there is a
lot of free memory because ENOMEM is translated into VM_FAULT_OOM in #PF
path and therefore pagefault_out_of_memory would result in OOM killer.

Please note that the kernel memory is still accounted to the overall
limit along with the user memory so removing the kmem specific limit
should still allow to contain kernel memory consumption. Unlike the kmem
one, though, it invokes memory reclaim and targeted memcg oom killing if
necessary.

Start the deprecation process by crying to the kernel log. Let's see
whether there are relevant usecases and simply return to EINVAL in the
second stage if nobody complains in few releases.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 Documentation/admin-guide/cgroup-v1/memory.rst | 3 +++
 mm/memcontrol.c                                | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 41bdc038dad9..e53fc2f31549 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -87,6 +87,9 @@ Brief summary of control files.
 				     node
 
  memory.kmem.limit_in_bytes          set/show hard limit for kernel memory
+                                     This knob is deprecated it shouldn't be
+                                     used. It is planned to be removed in
+                                     a foreseeable future.
  memory.kmem.usage_in_bytes          show current kernel memory allocation
  memory.kmem.failcnt                 show the number of kernel memory usage
 				     hits limits
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e18108b2b786..113969bc57e8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3518,6 +3518,9 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
 			ret = mem_cgroup_resize_max(memcg, nr_pages, true);
 			break;
 		case _KMEM:
+			pr_warn_once("kmem.limit_in_bytes is deprecated and will be removed. "
+				     "Please report your usecase to linux-mm@kvack.org if you "
+				     "depend on this functionality.\n");
 			ret = memcg_update_kmem_max(memcg, nr_pages);
 			break;
 		case _TCP:
-- 
2.20.1


-- 
Michal Hocko
SUSE Labs


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH] memcg, kmem: do not fail __GFP_NOFAIL charges
  2019-09-11 15:16           ` Michal Hocko
@ 2019-09-13  2:46             ` Shakeel Butt
  0 siblings, 0 replies; 26+ messages in thread
From: Shakeel Butt @ 2019-09-13  2:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Vladimir Davydov, LKML, Linux MM,
	Andrey Ryabinin, Thomas Lindroth, Tetsuo Handa

On Wed, Sep 11, 2019 at 8:16 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 11-09-19 07:37:40, Andrew Morton wrote:
> > On Wed, 11 Sep 2019 14:00:02 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> >
> > > On Mon 09-09-19 13:22:45, Michal Hocko wrote:
> > > > On Fri 06-09-19 11:24:55, Shakeel Butt wrote:
> > > [...]
> > > > > I wonder what has changed since
> > > > > <http://lkml.kernel.org/r/20180525185501.82098-1-shakeelb@google.com/>.
> > > >
> > > > I have completely forgot about that one. It seems that we have just
> > > > repeated the same discussion again. This time we have a poor user who
> > > > actually enabled the kmem limit.
> > > >
> > > > I guess there was no real objection to the change back then. The primary
> > > > discussion revolved around the fact that the accounting will stay broken
> > > > even when this particular part was fixed. Considering this leads to easy
> > > > to trigger crash (with the limit enabled) then I guess we should just
> > > > make it less broken and backport to stable trees and have a serious
> > > > discussion about discontinuing of the limit. Start by simply failing to
> > > > set any limit in the current upstream kernels.
> > >
> > > Any more concerns/objections to the patch? I can add a reference to your
> > > earlier post Shakeel if you want or to credit you the way you prefer.
> > >
> > > Also are there any objections to start deprecating process of kmem
> > > limit? I would see it in two stages
> > > - 1st warn in the kernel log
> > >     pr_warn("kmem.limit_in_bytes is deprecated and will be removed.
> > >             "Please report your usecase to linux-mm@kvack.org if you "
> > >             "depend on this functionality."
> >
> > pr_warn_once() :)
> >
> > > - 2nd fail any write to kmem.limit_in_bytes
> > > - 3rd remove the control file completely
> >
> > Sounds good to me.
>
> Here we go
>
> From 512822e551fe2960040c23b12c7b27a5fdab9013 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 11 Sep 2019 17:02:33 +0200
> Subject: [PATCH] memcg, kmem: deprecate kmem.limit_in_bytes
>
> Cgroup v1 memcg controller has exposed a dedicated kmem limit to users
> which turned out to be really a bad idea because there are paths which
> cannot shrink the kernel memory usage enough to get below the limit
> (e.g. because the accounted memory is not reclaimable). There are cases
> when the failure is even not allowed (e.g. __GFP_NOFAIL). This means
> that the kmem limit is in excess to the hard limit without any way to
> shrink and thus completely useless. OOM killer cannot be invoked to
> handle the situation because that would lead to a premature oom killing.
>
> As a result many places might see ENOMEM returning from kmalloc and
> result in unexpected errors. E.g. a global OOM killer when there is a
> lot of free memory because ENOMEM is translated into VM_FAULT_OOM in #PF
> path and therefore pagefault_out_of_memory would result in OOM killer.
>
> Please note that the kernel memory is still accounted to the overall
> limit along with the user memory so removing the kmem specific limit
> should still allow to contain kernel memory consumption. Unlike the kmem
> one, though, it invokes memory reclaim and targeted memcg oom killing if
> necessary.
>
> Start the deprecation process by crying to the kernel log. Let's see
> whether there are relevant usecases and simply return to EINVAL in the
> second stage if nobody complains in few releases.
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

> ---
>  Documentation/admin-guide/cgroup-v1/memory.rst | 3 +++
>  mm/memcontrol.c                                | 3 +++
>  2 files changed, 6 insertions(+)
>
> diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
> index 41bdc038dad9..e53fc2f31549 100644
> --- a/Documentation/admin-guide/cgroup-v1/memory.rst
> +++ b/Documentation/admin-guide/cgroup-v1/memory.rst
> @@ -87,6 +87,9 @@ Brief summary of control files.
>                                      node
>
>   memory.kmem.limit_in_bytes          set/show hard limit for kernel memory
> +                                     This knob is deprecated it shouldn't be
> +                                     used. It is planned to be removed in
> +                                     a foreseeable future.
>   memory.kmem.usage_in_bytes          show current kernel memory allocation
>   memory.kmem.failcnt                 show the number of kernel memory usage
>                                      hits limits
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e18108b2b786..113969bc57e8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3518,6 +3518,9 @@ static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
>                         ret = mem_cgroup_resize_max(memcg, nr_pages, true);
>                         break;
>                 case _KMEM:
> +                       pr_warn_once("kmem.limit_in_bytes is deprecated and will be removed. "
> +                                    "Please report your usecase to linux-mm@kvack.org if you "
> +                                    "depend on this functionality.\n");
>                         ret = memcg_update_kmem_max(memcg, nr_pages);
>                         break;
>                 case _TCP:
> --
> 2.20.1
>
>
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] memcg, kmem: do not fail __GFP_NOFAIL charges
       [not found] ` <20190906125608.32129-1-mhocko@kernel.org>
  2019-09-06 18:24   ` [PATCH] memcg, kmem: do not fail __GFP_NOFAIL charges Shakeel Butt
@ 2019-09-24 10:53   ` Michal Hocko
  2019-09-24 23:06     ` Andrew Morton
  1 sibling, 1 reply; 26+ messages in thread
From: Michal Hocko @ 2019-09-24 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vladimir Davydov, LKML, linux-mm,
	Andrey Ryabinin, Thomas Lindroth, Tetsuo Handa

Andrew, do you plan to send this patch to Linus as ell?

On Fri 06-09-19 14:56:08, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Thomas has noticed the following NULL ptr dereference when using cgroup
> v1 kmem limit:
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
> PGD 0
> P4D 0
> Oops: 0000 [#1] PREEMPT SMP PTI
> CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
> Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
> RIP: 0010:create_empty_buffers+0x24/0x100
> Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
> RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
> RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
> RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
> RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
> R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
> R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
> FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
> Call Trace:
>  create_page_buffers+0x4d/0x60
>  __block_write_begin_int+0x8e/0x5a0
>  ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
>  ? jbd2__journal_start+0xd7/0x1f0
>  ext4_da_write_begin+0x112/0x3d0
>  generic_perform_write+0xf1/0x1b0
>  ? file_update_time+0x70/0x140
>  __generic_file_write_iter+0x141/0x1a0
>  ext4_file_write_iter+0xef/0x3b0
>  __vfs_write+0x17e/0x1e0
>  vfs_write+0xa5/0x1a0
>  ksys_write+0x57/0xd0
>  do_syscall_64+0x55/0x160
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
>  Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
>  fails __GFP_NOFAIL charge when the kmem limit is reached. This is a
>  wrong behavior because nofail allocations are not allowed to fail.
>  Normal charge path simply forces the charge even if that means to cross
>  the limit. Kmem accounting should be doing the same.
> 
> Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com>
> Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
> Cc: stable
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/memcontrol.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9ec5e12486a7..e18108b2b786 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2821,6 +2821,16 @@ int __memcg_kmem_charge_memcg(struct page *page, gfp_t gfp, int order,
>  
>  	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
>  	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
> +
> +		/*
> +		 * Enforce __GFP_NOFAIL allocation because callers are not
> +		 * prepared to see failures and likely do not have any failure
> +		 * handling code.
> +		 */
> +		if (gfp & __GFP_NOFAIL) {
> +			page_counter_charge(&memcg->kmem, nr_pages);
> +			return 0;
> +		}
>  		cancel_charge(memcg, nr_pages);
>  		return -ENOMEM;
>  	}
> -- 
> 2.20.1

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] memcg, kmem: do not fail __GFP_NOFAIL charges
  2019-09-24 10:53   ` Michal Hocko
@ 2019-09-24 23:06     ` Andrew Morton
  0 siblings, 0 replies; 26+ messages in thread
From: Andrew Morton @ 2019-09-24 23:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Vladimir Davydov, LKML, linux-mm,
	Andrey Ryabinin, Thomas Lindroth, Tetsuo Handa

On Tue, 24 Sep 2019 12:53:55 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> Andrew, do you plan to send this patch to Linus as ell?

I suppose so.  We don't actually have any reviewed-bys or acked-bys but
I expect that's because Shakeel forgot to type them in.

The followup deprecation warning patch I parked for 5.4-rc1.  Best to
give it a spin in -next and see if anyone complains before we go
bothering mainline users.


From: Michal Hocko <mhocko@suse.com>
Subject: memcg, kmem: do not fail __GFP_NOFAIL charges

Thomas has noticed the following NULL ptr dereference when using cgroup
v1 kmem limit:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
PGD 0
P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
RIP: 0010:create_empty_buffers+0x24/0x100
Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
Call Trace:
 create_page_buffers+0x4d/0x60
 __block_write_begin_int+0x8e/0x5a0
 ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
 ? jbd2__journal_start+0xd7/0x1f0
 ext4_da_write_begin+0x112/0x3d0
 generic_perform_write+0xf1/0x1b0
 ? file_update_time+0x70/0x140
 __generic_file_write_iter+0x141/0x1a0
 ext4_file_write_iter+0xef/0x3b0
 __vfs_write+0x17e/0x1e0
 vfs_write+0xa5/0x1a0
 ksys_write+0x57/0xd0
 do_syscall_64+0x55/0x160
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
fails __GFP_NOFAIL charge when the kmem limit is reached.  This is a wrong
behavior because nofail allocations are not allowed to fail.  Normal
charge path simply forces the charge even if that means to cross the
limit.  Kmem accounting should be doing the same.

Link: http://lkml.kernel.org/r/20190906125608.32129-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Thomas Lindroth <thomas.lindroth@gmail.com>
Debugged-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Thomas Lindroth <thomas.lindroth@gmail.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

--- a/mm/memcontrol.c~memcg-kmem-do-not-fail-__gfp_nofail-charges
+++ a/mm/memcontrol.c
@@ -2943,6 +2943,16 @@ int __memcg_kmem_charge_memcg(struct pag
 
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
 	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
+
+		/*
+		 * Enforce __GFP_NOFAIL allocation because callers are not
+		 * prepared to see failures and likely do not have any failure
+		 * handling code.
+		 */
+		if (gfp & __GFP_NOFAIL) {
+			page_counter_charge(&memcg->kmem, nr_pages);
+			return 0;
+		}
 		cancel_charge(memcg, nr_pages);
 		return -ENOMEM;
 	}
_



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-09-24 23:06 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-01 20:43 [BUG] Early OOM and kernel NULL pointer dereference in 4.19.69 Thomas Lindroth
2019-09-02  7:16 ` Michal Hocko
2019-09-02  7:27   ` Michal Hocko
2019-09-02 19:34   ` Thomas Lindroth
2019-09-03  7:41     ` Michal Hocko
2019-09-03 12:01       ` Thomas Lindroth
2019-09-03 12:05       ` Andrey Ryabinin
2019-09-03 12:22         ` Michal Hocko
2019-09-03 18:20           ` Thomas Lindroth
2019-09-03 19:36             ` Michal Hocko
     [not found] ` <666dbcde-1b8a-9e2d-7d1f-48a117c78ae1@I-love.SAKURA.ne.jp>
2019-09-03 18:25   ` Thomas Lindroth
     [not found]     ` <4d0eda9a-319d-1a7d-1eed-71da90902367@i-love.sakura.ne.jp>
2019-09-04 11:25       ` [BUG] kmemcg limit defeats __GFP_NOFAIL allocation Michal Hocko
     [not found]         ` <4d87d770-c110-224f-6c0c-d6fada90417d@i-love.sakura.ne.jp>
2019-09-04 11:59           ` Michal Hocko
     [not found]         ` <0056063b-46ff-0ebd-ff0d-c96a1f9ae6b1@i-love.sakura.ne.jp>
2019-09-04 14:29           ` Michal Hocko
     [not found]             ` <405ce28b-c0b4-780c-c883-42d741ec60e0@i-love.sakura.ne.jp>
2019-09-05 23:11               ` Thomas Lindroth
2019-09-06  7:27                 ` Michal Hocko
2019-09-06 10:54                   ` Andrey Ryabinin
2019-09-06 11:29                     ` Michal Hocko
     [not found] ` <20190906125608.32129-1-mhocko@kernel.org>
2019-09-06 18:24   ` [PATCH] memcg, kmem: do not fail __GFP_NOFAIL charges Shakeel Butt
2019-09-09 11:22     ` Michal Hocko
2019-09-11 12:00       ` Michal Hocko
2019-09-11 14:37         ` Andrew Morton
2019-09-11 15:16           ` Michal Hocko
2019-09-13  2:46             ` Shakeel Butt
2019-09-24 10:53   ` Michal Hocko
2019-09-24 23:06     ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).