All of lore.kernel.org
 help / color / mirror / Atom feed
* More OOM problems
@ 2016-09-18 20:03 Linus Torvalds
  2016-09-18 20:26 ` Lorenzo Stoakes
                   ` (4 more replies)
  0 siblings, 5 replies; 36+ messages in thread
From: Linus Torvalds @ 2016-09-18 20:03 UTC (permalink / raw)
  To: Michal Hocko, Tetsuo Handa, Oleg Nesterov, Vladimir Davydov,
	Vlastimil Babka
  Cc: Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Olaf Hering, Joonsoo Kim,
	linux-mm

[ More or less random collection of people from previous oom patches
and/or discussions, if you feel you shouldn't have been cc'd, blame me
for just picking things from earlier threads and/or commits ]

I'm afraid that the oom situation is still not fixed, and the "let's
die quickly" patches are still a nasty regression.

I have a 16GB desktop that I just noticed killed one of the chrome
tabs yesterday. Tha machine had *tons* of freeable memory, with
something like 7GB of page cache at the time, if I read this right.

The trigger is a kcalloc() in the i915 driver:

    Xorg invoked oom-killer:
gfp_mask=0x240c0d0(GFP_TEMPORARY|__GFP_COMP|__GFP_ZERO), order=3,
oom_score_adj=0

      __kmalloc+0x1cd/0x1f0
      alloc_gen8_temp_bitmaps+0x47/0x80 [i915]

which looks like it is one of these:

  slabinfo - version: 2.1
  # name            <active_objs> <num_objs> <objsize> <objperslab>
<pagesperslab>
  kmalloc-8192         268    268   8192    4    8
  kmalloc-4096         732    786   4096    8    8
  kmalloc-2048        1402   1456   2048   16    8
  kmalloc-1024        2505   2976   1024   32    8

so even just a 1kB allocation can cause an order-3 page allocation.

And yeah, I had what, 137MB free memory, it's just that it's all
fairly fragmented. There's actually even order-4 pages, but they are
in low DMA memory and the system tries to protect them:

  Node 0 DMA: 0*4kB 1*8kB (U) 2*16kB (U) 1*32kB (U) 3*64kB (U) 2*128kB
(U) 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB
  Node 0 DMA32: 11110*4kB (UMEH) 2929*8kB (UMEH) 44*16kB (MH) 1*32kB
(H) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
68608kB
  Node 0 Normal: 14031*4kB (UMEH) 49*8kB (UMEH) 18*16kB (UH) 0*32kB
0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 56804kB
  Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
hugepages_size=1048576kB
  Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
hugepages_size=2048kB
  2084682 total pagecache pages
  11 pages in swap cache
  Swap cache stats: add 35, delete 24, find 2/3
  Free swap  = 8191868kB
  Total swap = 8191996kB
  4168499 pages RAM

And it looks like there's a fair amount of memory busy under writeback
(470MB or so)

  active_anon:1539159 inactive_anon:374915 isolated_anon:0
                            active_file:1251771 inactive_file:450068
isolated_file:0
                            unevictable:175 dirty:26 writeback:118690 unstable:0
                            slab_reclaimable:220784 slab_unreclaimable:39819
                            mapped:491617 shmem:382891 pagetables:20439 bounce:0
                            free:35301 free_pcp:895 free_cma:0

And yes, CONFIG_COMPACTION was enabled.

So quite honestly, I *really* don't think that a 1kB allocation should
have reasonably failed and killed anything at all (ok, it could have
been an 8kB one, who knows - but it really looks like it *could* have
been just 1kB).

Considering that kmalloc() pattern, I suspect that we need to consider
order-3 allocations "small", and try a lot harder.

Because killing processes due to "out of memory" in this situation is
unquestionably a bug.

And no, I can't recreate this, obviously.

I think there's a series in -mm that hasn't been merged and that is
pending (presumably for 4.9). I think Arkadiusz tested it for his
(repeatable) workload. It may need to be considered for 4.8, because
the above is ridiculously bad, imho.

Andrew? Vlastimil? Michal? Others?

            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-18 20:03 More OOM problems Linus Torvalds
@ 2016-09-18 20:26 ` Lorenzo Stoakes
  2016-09-18 20:58   ` Linus Torvalds
  2016-09-19  8:32   ` Michal Hocko
  2016-09-18 21:00 ` Vlastimil Babka
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 36+ messages in thread
From: Lorenzo Stoakes @ 2016-09-18 20:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michal Hocko, Tetsuo Handa, Oleg Nesterov, Vladimir Davydov,
	Vlastimil Babka, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Joonsoo Kim, linux-mm

Hi all,

In case it's helpful - I have experienced these OOM issues invoked in my case via the nvidia driver and similarly to Linus an order 3 allocation resulted in killed chromium tabs. I encountered this even after applying the patch discussed in the original thread at https://lkml.org/lkml/2016/8/22/184. It's not easily reproducible but it is happening enough that I could probably check some specific state when it next occurs or test out a patch to see if it stops it if that'd be useful.

I saved a couple OOM's from the last time it occurred, this is on a 8GiB system with plenty of reclaimable memory:

[350085.038693] Xorg invoked oom-killer: gfp_mask=0x24040c0(GFP_KERNEL|__GFP_COMP), order=3, oom_score_adj=0
[350085.038696] Xorg cpuset=/ mems_allowed=0
[350085.038699] CPU: 0 PID: 2119 Comm: Xorg Tainted: P           O    4.7.2-1-custom #1
[350085.038701] Hardware name: MSI MS-7850/Z97 PC Mate(MS-7850), BIOS V4.10 08/11/2015
[350085.038702]  0000000000000286 000000009fd6569c ffff88020c60f940 ffffffff812eb122
[350085.038704]  ffff88020c60fb18 ffff8800b4cfaac0 ffff88020c60f9b0 ffffffff811f6e4c
[350085.038706]  0000000000000246 ffff880200000000 ffff88020c60f970 ffffffff00000001
[350085.038708] Call Trace:
[350085.038712]  [<ffffffff812eb122>] dump_stack+0x63/0x81
[350085.038715]  [<ffffffff811f6e4c>] dump_header+0x60/0x1e8
[350085.038718]  [<ffffffff811762fa>] oom_kill_process+0x22a/0x440
[350085.038720]  [<ffffffff8117696a>] out_of_memory+0x40a/0x4b0
[350085.038723]  [<ffffffff812ffdf8>] ? find_next_bit+0x18/0x20
[350085.038725]  [<ffffffff8117c034>] __alloc_pages_nodemask+0xee4/0xf20
[350085.038727]  [<ffffffff811cb835>] alloc_pages_current+0x95/0x140
[350085.038729]  [<ffffffff8117c2f9>] alloc_kmem_pages+0x19/0x90
[350085.038731]  [<ffffffff8119a79e>] kmalloc_order_trace+0x2e/0x100
[350085.038733]  [<ffffffff811d6bd3>] __kmalloc+0x213/0x230
[350085.038745]  [<ffffffffa147d2c7>] nvkms_alloc+0x27/0x60 [nvidia_modeset]
[350085.038752]  [<ffffffffa147e540>] ? _nv000318kms+0x40/0x40 [nvidia_modeset]
[350085.038760]  [<ffffffffa14b7eea>] _nv001929kms+0x1a/0x30 [nvidia_modeset]
[350085.038767]  [<ffffffffa14a4242>] ? _nv001878kms+0x32/0xcf0 [nvidia_modeset]
[350085.038768]  [<ffffffff8117c2f9>] ? alloc_kmem_pages+0x19/0x90
[350085.038770]  [<ffffffff811d6bd3>] ? __kmalloc+0x213/0x230
[350085.038776]  [<ffffffffa147d2c7>] ? nvkms_alloc+0x27/0x60 [nvidia_modeset]
[350085.038782]  [<ffffffffa147e540>] ? _nv000318kms+0x40/0x40 [nvidia_modeset]
[350085.038788]  [<ffffffffa147e56e>] ? _nv000169kms+0x2e/0x40 [nvidia_modeset]
[350085.038794]  [<ffffffffa147f0c1>] ? nvKmsIoctl+0x161/0x1e0 [nvidia_modeset]
[350085.038800]  [<ffffffffa147dd65>] ? nvkms_ioctl_common+0x45/0x80 [nvidia_modeset]
[350085.038806]  [<ffffffffa147de11>] ? nvkms_ioctl+0x71/0xa0 [nvidia_modeset]
[350085.038962]  [<ffffffffa0831080>] ? nvidia_frontend_compat_ioctl+0x40/0x50 [nvidia]
[350085.039032]  [<ffffffffa083109e>] ? nvidia_frontend_unlocked_ioctl+0xe/0x10 [nvidia]
[350085.039035]  [<ffffffff8120cd62>] ? do_vfs_ioctl+0xa2/0x5d0
[350085.039037]  [<ffffffff8120d309>] ? SyS_ioctl+0x79/0x90
[350085.039039]  [<ffffffff815de7b2>] ? entry_SYSCALL_64_fastpath+0x1a/0xa4
[350085.039048] Mem-Info:
[350085.039051] active_anon:861397 inactive_anon:23397 isolated_anon:0
                 active_file:146274 inactive_file:144248 isolated_file:0
                 unevictable:8 dirty:14587 writeback:0 unstable:0
                 slab_reclaimable:697630 slab_unreclaimable:24397
                 mapped:79655 shmem:26548 pagetables:7211 bounce:0
                 free:25159 free_pcp:235 free_cma:0
[350085.039054] Node 0 DMA free:15516kB min:136kB low:168kB high:200kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[350085.039058] lowmem_reserve[]: 0 3196 7658 7658
[350085.039060] Node 0 DMA32 free:45980kB min:28148kB low:35184kB high:42220kB active_anon:1466208kB inactive_anon:43120kB active_file:239740kB inactive_file:234920kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3617864kB managed:3280092kB mlocked:0kB dirty:21692kB writeback:0kB mapped:131184kB shmem:47588kB slab_reclaimable:1147984kB slab_unreclaimable:37484kB kernel_stack:2976kB pagetables:11512kB unstable:0kB bounce:0kB free_pcp:188kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[350085.039064] lowmem_reserve[]: 0 0 4462 4462
[350085.039065] Node 0 Normal free:39140kB min:39296kB low:49120kB high:58944kB active_anon:1979380kB inactive_anon:50468kB active_file:345356kB inactive_file:342072kB unevictable:32kB isolated(anon):0kB isolated(file):0kB present:4702208kB managed:4569312kB mlocked:32kB dirty:36656kB writeback:0kB mapped:187436kB shmem:58604kB slab_reclaimable:1642536kB slab_unreclaimable:60104kB kernel_stack:5040kB pagetables:17332kB unstable:0kB bounce:0kB free_pcp:752kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:136 all_unreclaimable? no
[350085.039069] lowmem_reserve[]: 0 0 0 0
[350085.039071] Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15516kB
[350085.039077] Node 0 DMA32: 11569*4kB (UME) 50*8kB (M) 2*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 46708kB
[350085.039083] Node 0 Normal: 9282*4kB (UE) 0*8kB 4*16kB (H) 2*32kB (H) 3*64kB (H) 1*128kB (H) 2*256kB (H) 2*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 39112kB
[350085.039090] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[350085.039092] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[350085.039092] 316873 total pagecache pages
[350085.039093] 0 pages in swap cache
[350085.039094] Swap cache stats: add 0, delete 0, find 0/0
[350085.039095] Free swap  = 0kB
[350085.039096] Total swap = 0kB
[350085.039097] 2084014 pages RAM
[350085.039097] 0 pages HighMem/MovableOnly
[350085.039098] 117688 pages reserved
[350085.039099] 0 pages hwpoisoned
[350085.039099] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[350085.039107] [  208]     0   208    19325     3639      35       3        0             0 systemd-journal
[350085.039109] [  265]     0   265     9102      818      18       3        0         -1000 systemd-udevd
[350085.039113] [ 1809]   192  1809    28539      542      25       3        0             0 systemd-timesyn
[350085.039115] [ 1822]    81  1822     8296      653      21       4        0          -900 dbus-daemon
[350085.039117] [ 1832]     0  1832     9629      511      22       3        0             0 systemd-logind
[350085.039119] [ 1862]     0  1862     3260      611      10       3        0             0 crond
[350085.039121] [ 1910]  1000  1910     3505      776      11       3        0             0 devmon
[350085.039122] [ 2041]  1000  2041     6453      707      18       3        0             0 udevil
[350085.039124] [ 2050]     0  2050     1685       77       9       3        0             0 dhcpcd
[350085.039126] [ 2051]     0  2051   132774     4634      65       6        0          -500 dockerd
[350085.039128] [ 2057]     0  2057    10099      746      25       3        0         -1000 sshd
[350085.039130] [ 2085]     0  2085   108773     1180      30       5        0          -500 docker-containe
[350085.039132] [ 2100]     0  2100    66532      937      32       3        0             0 lightdm
[350085.039134] [ 2119]     0  2119    50761    17423      97       3        0             0 Xorg
[350085.039136] [ 2123]     0  2123    68666     1431      36       3        0             0 accounts-daemon
[350085.039138] [ 2135]   102  2135   129825     2707      50       4        0             0 polkitd
[350085.039142] [ 2562]     0  2562    65038     1104      59       3        0             0 lightdm
[350085.039144] [ 2572]  1000  2572    13695     1020      29       3        0             0 systemd
[350085.039146] [ 2577]  1000  2577    24641      397      48       3        0             0 (sd-pam)
[350085.039147] [ 2584]  1000  2584    32170     1929      66       3        0             0 i3
[350085.039149] [ 2596]  1000  2596     2788      184      10       3        0             0 ssh-agent
[350085.039151] [ 2603]  1000  2603     8212      576      20       3        0             0 dbus-daemon
[350085.039153] [ 2605]  1000  2605    27268     1497      56       3        0             0 i3bar
[350085.039155] [ 2606]  1000  2606     3406      459      10       3        0             0 measure-net-spe
[350085.039157] [ 2607]  1000  2607    17782      505      40       3        0             0 i3status
[350085.039159] [ 2608]  1000  2608     3406      585      10       3        0             0 measure-net-spe
[350085.039161] [ 2658]  1000  2658    67811      849      34       3        0             0 gvfsd
[350085.039163] [ 2663]  1000  2663    84725     1246      31       3        0             0 gvfsd-fuse
[350085.039164] [ 2671]  1000  2671    84401      651      33       3        0             0 at-spi-bus-laun
[350085.039166] [ 2676]  1000  2676     8186      714      20       3        0             0 dbus-daemon
[350085.039168] [ 2678]  1000  2678    53563      667      40       3        0             0 at-spi2-registr
[350085.039169] [ 2682]  1000  2682    14718      829      32       3        0             0 gconfd-2
[350085.039171] [ 2690]  1000  2690   222566     5206     122       4        0             0 pulseaudio
[350085.039173] [ 2691]   133  2691    44462      530      22       3        0             0 rtkit-daemon
[350085.039174] [ 2743]  1000  2743     4649      755      13       3        0             0 zsh
[350085.039176] [ 2748]  1000  2748   437286    84044     478       6        0             0 chromium
[350085.039178] [ 2752]  1000  2752     1585      191       9       3        0             0 chrome-sandbox
[350085.039179] [ 2753]  1000  2753   113527     5589     166       4        0             0 chromium
[350085.039181] [ 2756]  1000  2756     1585      177       8       3        0             0 chrome-sandbox
[350085.039182] [ 2757]  1000  2757     7909      840      22       4        0             0 nacl_helper
[350085.039184] [ 2759]  1000  2759   113527     2847     127       4        0             0 chromium
[350085.039186] [ 2866]  1000  2866   340267   219730     629       7        0           200 chromium
[350085.039187] [ 2881]  1000  2881   114831     4858     144       5        0           200 chromium
[350085.039189] [ 2891]  1000  2891   258525    43032     338      68        0           300 chromium
[350085.039191] [ 2908]  1000  2908   216776    17487     220      31        0           300 chromium
[350085.039193] [ 3096]     0  3096    73383     1417      42       3        0             0 upowerd
[350085.039194] [ 4273]  1000  4273     4649      761      13       3        0             0 zsh
[350085.039196] [ 4276]  1000  4276   206798     6849     144       4        0             0 pavucontrol
[350085.039198] [ 6647]  1000  6647   250470    37756     295      54        0           300 chromium
[350085.039200] [ 6658]  1000  6658   214211    17257     215      29        0           300 chromium
[350085.039201] [ 7390]  1000  7390   216243    17154     217      29        0           300 chromium
[350085.039204] [23007]  1000 23007   113232     2020      54       4        0             0 gvfs-udisks2-vo
[350085.039205] [23010]     0 23010    91532     2142      44       3        0             0 udisksd
[350085.039207] [ 6558]  1000  6558    20485     2858      42       3        0             0 urxvt
[350085.039209] [ 6559]  1000  6559     9121     1722      22       3        0             0 zsh
[350085.039210] [ 6581]  1000  6581    39165    25124      80       4        0             0 mutt
[350085.039213] [18246]  1000 18246     4649      848      12       3        0             0 zsh
[350085.039215] [18251]  1000 18251   191866    14934     175       4        0             0 emacs
[350085.039216] [18256]  1000 18256     4004      813      12       3        0             0 bash
[350085.039218] [18261]  1000 18261    20305     2924      43       3        0             0 urxvt
[350085.039220] [18262]  1000 18262     9121     1714      23       3        0             0 zsh
[350085.039223] [ 7362]  1000  7362   319274   102294     527     164        0           300 chromium
[350085.039225] [ 9185]  1000  9185   400186   164602     672     161        0           300 chromium
[350085.039227] [10839]  1000 10839   253464    41492     303      50        0           300 chromium
[350085.039228] [10957]     0 10957    17509     1231      37       3        0             0 sudo
[350085.039230] [10960]     0 10960    55798    21075      81       4        0             0 pacman
[350085.039232] [15262]     0 15262     3438      787      11       3        0             0 alpm-hook
[350085.039234] [15263]     0 15263     3868     1244      11       3        0             0 dkms
[350085.039236] [15278]     0 15278     3869     1168      11       3        0             0 dkms
[350085.039237] [15611]     0 15611     3869      916      11       3        0             0 dkms
[350085.039239] [15612]     0 15612     3869      947      11       3        0             0 dkms
[350085.039241] [15613]     0 15613     8562      865      20       3        0             0 make
[350085.039242] [15619]     0 15619     8793     1078      20       3        0             0 make
[350085.039244] [15889]     0 15889     9148     1498      22       3        0             0 make
[350085.039246] [18079]     0 18079     3442      779      11       3        0             0 sh
[350085.039248] [18080]     0 18080     2490      227       9       3        0             0 cc
[350085.039249] [18081]     0 18081    68687    38388      97       3        0             0 cc1
[350085.039251] [18082]     0 18082     4786     1977      14       3        0             0 as
[350085.039253] [18091]     0 18091     3442      808      11       3        0             0 sh
[350085.039255] [18093]     0 18093     1454      165       8       3        0             0 sleep
[350085.039257] [18094]     0 18094     2490      253       9       3        0             0 cc
[350085.039259] [18095]     0 18095    68650    38238      96       3        0             0 cc1
[350085.039261] [18101]     0 18101     4786     1964      14       3        0             0 as
[350085.039263] [18104]     0 18104     3442      814      11       3        0             0 sh
[350085.039264] [18106]     0 18106     2490      248       9       3        0             0 cc
[350085.039266] [18107]     0 18107    67906    36050      93       3        0             0 cc1
[350085.039268] [18108]     0 18108     4786     2030      14       3        0             0 as
[350085.039270] [18130]     0 18130     3442      790      12       3        0             0 sh
[350085.039271] [18133]     0 18133     2490      235       8       3        0             0 cc
[350085.039273] [18134]     0 18134     3442      781      12       3        0             0 sh
[350085.039275] [18135]     0 18135    67911    36623      95       3        0             0 cc1
[350085.039277] [18136]     0 18136     4786     1935      15       3        0             0 as
[350085.039278] [18137]     0 18137     3442      786      10       3        0             0 sh
[350085.039280] [18138]     0 18138     2490      229       9       3        0             0 cc
[350085.039282] [18139]     0 18139     2490      242       9       3        0             0 cc
[350085.039284] [18140]     0 18140    67922    20214      63       3        0             0 cc1
[350085.039286] [18141]     0 18141    66967    36993      94       3        0             0 cc1
[350085.039288] [18142]     0 18142     4786     1952      14       4        0             0 as
[350085.039289] [18143]     0 18143     4786     2012      13       3        0             0 as
[350085.039291] [18152]     0 18152     3442      778      10       3        0             0 sh
[350085.039293] [18153]     0 18153     2490      226       9       3        0             0 cc
[350085.039295] [18154]     0 18154    22881    13677      47       3        0             0 cc1
[350085.039296] [18155]     0 18155     4786     2012      15       3        0             0 as
[350085.039298] [18166]     0 18166     3442      809      10       3        0             0 sh
[350085.039300] [18167]     0 18167     3442      137       8       3        0             0 sh
[350085.039301] Out of memory: Kill process 9185 (chromium) score 384 or sacrifice child
[350085.039346] Killed process 9185 (chromium) total-vm:1600744kB, anon-rss:548240kB, file-rss:71988kB, shmem-rss:38180kB
[350085.075980] oom_reaper: reaped process 9185 (chromium), now anon-rss:0kB, file-rss:0kB, shmem-rss:38480kB
[350086.337625] Xorg invoked oom-killer: gfp_mask=0x24040c0(GFP_KERNEL|__GFP_COMP), order=3, oom_score_adj=0
[350086.337628] Xorg cpuset=/ mems_allowed=0
[350086.337633] CPU: 0 PID: 2119 Comm: Xorg Tainted: P           O    4.7.2-1-custom #1
[350086.337634] Hardware name: MSI MS-7850/Z97 PC Mate(MS-7850), BIOS V4.10 08/11/2015
[350086.337635]  0000000000000286 000000009fd6569c ffff88020c60f940 ffffffff812eb122
[350086.337637]  ffff88020c60fb18 ffff8800cb5ae3c0 ffff88020c60f9b0 ffffffff811f6e4c
[350086.337639]  0000000000000246 ffff880200000000 ffff88020c60f970 ffffffff00000002
[350086.337640] Call Trace:
[350086.337646]  [<ffffffff812eb122>] dump_stack+0x63/0x81
[350086.337649]  [<ffffffff811f6e4c>] dump_header+0x60/0x1e8
[350086.337653]  [<ffffffff811762fa>] oom_kill_process+0x22a/0x440
[350086.337655]  [<ffffffff8117696a>] out_of_memory+0x40a/0x4b0
[350086.337657]  [<ffffffff812ffdf8>] ? find_next_bit+0x18/0x20
[350086.337659]  [<ffffffff8117c034>] __alloc_pages_nodemask+0xee4/0xf20
[350086.337662]  [<ffffffff811cb835>] alloc_pages_current+0x95/0x140
[350086.337663]  [<ffffffff8117c2f9>] alloc_kmem_pages+0x19/0x90
[350086.337666]  [<ffffffff8119a79e>] kmalloc_order_trace+0x2e/0x100
[350086.337668]  [<ffffffff811d6bd3>] __kmalloc+0x213/0x230
[350086.337681]  [<ffffffffa147d2c7>] nvkms_alloc+0x27/0x60 [nvidia_modeset]
[350086.337687]  [<ffffffffa147e540>] ? _nv000318kms+0x40/0x40 [nvidia_modeset]
[350086.337695]  [<ffffffffa14b7eea>] _nv001929kms+0x1a/0x30 [nvidia_modeset]
[350086.337702]  [<ffffffffa14a4242>] ? _nv001878kms+0x32/0xcf0 [nvidia_modeset]
[350086.337703]  [<ffffffff8117c2f9>] ? alloc_kmem_pages+0x19/0x90
[350086.337705]  [<ffffffff811d6bd3>] ? __kmalloc+0x213/0x230
[350086.337711]  [<ffffffffa147d2c7>] ? nvkms_alloc+0x27/0x60 [nvidia_modeset]
[350086.337716]  [<ffffffffa147e540>] ? _nv000318kms+0x40/0x40 [nvidia_modeset]
[350086.337722]  [<ffffffffa147e56e>] ? _nv000169kms+0x2e/0x40 [nvidia_modeset]
[350086.337728]  [<ffffffffa147f0c1>] ? nvKmsIoctl+0x161/0x1e0 [nvidia_modeset]
[350086.337734]  [<ffffffffa147dd65>] ? nvkms_ioctl_common+0x45/0x80 [nvidia_modeset]
[350086.337740]  [<ffffffffa147de11>] ? nvkms_ioctl+0x71/0xa0 [nvidia_modeset]
[350086.337838]  [<ffffffffa0831080>] ? nvidia_frontend_compat_ioctl+0x40/0x50 [nvidia]
[350086.337911]  [<ffffffffa083109e>] ? nvidia_frontend_unlocked_ioctl+0xe/0x10 [nvidia]
[350086.337915]  [<ffffffff8120cd62>] ? do_vfs_ioctl+0xa2/0x5d0
[350086.337917]  [<ffffffff8120d309>] ? SyS_ioctl+0x79/0x90
[350086.337920]  [<ffffffff815de7b2>] ? entry_SYSCALL_64_fastpath+0x1a/0xa4
[350086.337933] Mem-Info:
[350086.337936] active_anon:926090 inactive_anon:14054 isolated_anon:0
                 active_file:127217 inactive_file:124640 isolated_file:0
                 unevictable:8 dirty:14757 writeback:0 unstable:0
                 slab_reclaimable:685505 slab_unreclaimable:20594
                 mapped:69794 shmem:17206 pagetables:7032 bounce:0
                 free:25275 free_pcp:114 free_cma:0
[350086.337939] Node 0 DMA free:15516kB min:136kB low:168kB high:200kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[350086.337944] lowmem_reserve[]: 0 3196 7658 7658
[350086.337946] Node 0 DMA32 free:46168kB min:28148kB low:35184kB high:42220kB active_anon:1571968kB inactive_anon:33316kB active_file:206232kB inactive_file:198884kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3617864kB managed:3280092kB mlocked:0kB dirty:21952kB writeback:0kB mapped:120868kB shmem:37784kB slab_reclaimable:1128300kB slab_unreclaimable:31216kB kernel_stack:2848kB pagetables:11300kB unstable:0kB bounce:0kB free_pcp:4kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:92 all_unreclaimable? no
[350086.337950] lowmem_reserve[]: 0 0 4462 4462
[350086.337952] Node 0 Normal free:39416kB min:39296kB low:49120kB high:58944kB active_anon:2132392kB inactive_anon:22900kB active_file:302636kB inactive_file:299676kB unevictable:32kB isolated(anon):0kB isolated(file):0kB present:4702208kB managed:4569312kB mlocked:32kB dirty:37076kB writeback:0kB mapped:158308kB shmem:31040kB slab_reclaimable:1613720kB slab_unreclaimable:51160kB kernel_stack:4752kB pagetables:16828kB unstable:0kB bounce:0kB free_pcp:440kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:400 all_unreclaimable? no
[350086.337956] lowmem_reserve[]: 0 0 0 0
[350086.337958] Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15516kB
[350086.337984] Node 0 DMA32: 11350*4kB (UME) 172*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 46776kB
[350086.337989] Node 0 Normal: 9232*4kB (UME) 54*8kB (ME) 62*16kB (MEH) 2*32kB (H) 3*64kB (H) 1*128kB (H) 2*256kB (H) 2*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 40272kB
[350086.337997] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[350086.337998] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[350086.337999] 269040 total pagecache pages
[350086.338011] 0 pages in swap cache
[350086.338012] Swap cache stats: add 0, delete 0, find 0/0
[350086.338013] Free swap  = 0kB
[350086.338013] Total swap = 0kB
[350086.338014] 2084014 pages RAM
[350086.338015] 0 pages HighMem/MovableOnly
[350086.338016] 117688 pages reserved
[350086.338016] 0 pages hwpoisoned
[350086.338017] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[350086.338027] [  208]     0   208    19325     3815      35       3        0             0 systemd-journal
[350086.338029] [  265]     0   265     9102      818      18       3        0         -1000 systemd-udevd
[350086.338033] [ 1809]   192  1809    28539      542      25       3        0             0 systemd-timesyn
[350086.338035] [ 1822]    81  1822     8296      653      21       4        0          -900 dbus-daemon
[350086.338037] [ 1832]     0  1832     9629      511      22       3        0             0 systemd-logind
[350086.338039] [ 1862]     0  1862     3260      611      10       3        0             0 crond
[350086.338041] [ 1910]  1000  1910     3505      776      11       3        0             0 devmon
[350086.338043] [ 2041]  1000  2041     6453      707      18       3        0             0 udevil
[350086.338045] [ 2050]     0  2050     1685       77       9       3        0             0 dhcpcd
[350086.338047] [ 2051]     0  2051   132774     4634      65       6        0          -500 dockerd
[350086.338049] [ 2057]     0  2057    10099      746      25       3        0         -1000 sshd
[350086.338051] [ 2085]     0  2085   108773     1180      30       5        0          -500 docker-containe
[350086.338053] [ 2100]     0  2100    66532      937      32       3        0             0 lightdm
[350086.338055] [ 2119]     0  2119    50761    17520      97       3        0             0 Xorg
[350086.338057] [ 2123]     0  2123    68666     1431      36       3        0             0 accounts-daemon
[350086.338058] [ 2135]   102  2135   129825     2705      50       4        0             0 polkitd
[350086.338062] [ 2562]     0  2562    65038     1104      59       3        0             0 lightdm
[350086.338064] [ 2572]  1000  2572    13695     1020      29       3        0             0 systemd
[350086.338066] [ 2577]  1000  2577    24641      397      48       3        0             0 (sd-pam)
[350086.338069] [ 2584]  1000  2584    32170     1929      66       3        0             0 i3
[350086.338070] [ 2596]  1000  2596     2788      184      10       3        0             0 ssh-agent
[350086.338073] [ 2603]  1000  2603     8212      576      20       3        0             0 dbus-daemon
[350086.338075] [ 2605]  1000  2605    27268     1497      56       3        0             0 i3bar
[350086.338077] [ 2606]  1000  2606     3406      459      10       3        0             0 measure-net-spe
[350086.338079] [ 2607]  1000  2607    17782      505      40       3        0             0 i3status
[350086.338081] [ 2608]  1000  2608     3406      585      10       3        0             0 measure-net-spe
[350086.338084] [ 2658]  1000  2658    67811      849      34       3        0             0 gvfsd
[350086.338086] [ 2663]  1000  2663    84725     1246      31       3        0             0 gvfsd-fuse
[350086.338088] [ 2671]  1000  2671    84401      651      33       3        0             0 at-spi-bus-laun
[350086.338091] [ 2676]  1000  2676     8186      714      20       3        0             0 dbus-daemon
[350086.338093] [ 2678]  1000  2678    53563      667      40       3        0             0 at-spi2-registr
[350086.338095] [ 2682]  1000  2682    14718      829      32       3        0             0 gconfd-2
[350086.338098] [ 2690]  1000  2690   222566     5206     122       4        0             0 pulseaudio
[350086.338100] [ 2691]   133  2691    44462      530      22       3        0             0 rtkit-daemon
[350086.338103] [ 2743]  1000  2743     4649      755      13       3        0             0 zsh
[350086.338106] [ 2748]  1000  2748   433311    84260     475       6        0             0 chromium
[350086.338108] [ 2752]  1000  2752     1585      191       9       3        0             0 chrome-sandbox
[350086.338110] [ 2753]  1000  2753   113527     5589     166       4        0             0 chromium
[350086.338112] [ 2756]  1000  2756     1585      177       8       3        0             0 chrome-sandbox
[350086.338114] [ 2757]  1000  2757     7909      840      22       4        0             0 nacl_helper
[350086.338117] [ 2759]  1000  2759   113527     2847     127       4        0             0 chromium
[350086.338120] [ 2866]  1000  2866   332680   213705     612       7        0           200 chromium
[350086.338122] [ 2881]  1000  2881   114831     4858     144       5        0           200 chromium
[350086.338124] [ 2891]  1000  2891   258525    43032     338      68        0           300 chromium
[350086.338126] [ 2908]  1000  2908   216776    17487     220      31        0           300 chromium
[350086.338129] [ 3096]     0  3096    73383     1417      42       3        0             0 upowerd
[350086.338131] [ 4273]  1000  4273     4649      761      13       3        0             0 zsh
[350086.338134] [ 4276]  1000  4276   206984     7921     144       4        0             0 pavucontrol
[350086.338136] [ 6647]  1000  6647   250470    37756     295      54        0           300 chromium
[350086.338138] [ 6658]  1000  6658   214211    17257     215      29        0           300 chromium
[350086.338140] [ 7390]  1000  7390   216243    17154     217      29        0           300 chromium
[350086.338143] [23007]  1000 23007   113232     2020      54       4        0             0 gvfs-udisks2-vo
[350086.338145] [23010]     0 23010    91532     2140      44       3        0             0 udisksd
[350086.338147] [ 6558]  1000  6558    20485     2858      42       3        0             0 urxvt
[350086.338150] [ 6559]  1000  6559     9121     1722      22       3        0             0 zsh
[350086.338152] [ 6581]  1000  6581    39165    25124      80       4        0             0 mutt
[350086.338155] [18246]  1000 18246     4649      848      12       3        0             0 zsh
[350086.338157] [18251]  1000 18251   191866    14934     175       4        0             0 emacs
[350086.338159] [18256]  1000 18256     4004      813      12       3        0             0 bash
[350086.338161] [18261]  1000 18261    20305     2924      43       3        0             0 urxvt
[350086.338163] [18262]  1000 18262     9121     1714      23       3        0             0 zsh
[350086.338168] [ 7362]  1000  7362   319274   102294     527     164        0           300 chromium
[350086.338171] [10839]  1000 10839   253464    41492     303      50        0           300 chromium
[350086.338173] [10957]     0 10957    17509     1231      37       3        0             0 sudo
[350086.338175] [10960]     0 10960    55798    21075      81       4        0             0 pacman
[350086.338178] [15262]     0 15262     3438      787      11       3        0             0 alpm-hook
[350086.338180] [15263]     0 15263     3868     1244      11       3        0             0 dkms
[350086.338182] [15278]     0 15278     3869     1168      11       3        0             0 dkms
[350086.338184] [15611]     0 15611     3869      916      11       3        0             0 dkms
[350086.338186] [15612]     0 15612     3869      947      11       3        0             0 dkms
[350086.338189] [15613]     0 15613     8562      865      20       3        0             0 make
[350086.338191] [15619]     0 15619     8793     1078      20       3        0             0 make
[350086.338193] [15889]     0 15889     9148     1498      22       3        0             0 make
[350086.338196] [18079]     0 18079     3442      779      11       3        0             0 sh
[350086.338198] [18080]     0 18080     2490      227       9       3        0             0 cc
[350086.338201] [18081]     0 18081    99723    59927     144       3        0             0 cc1
[350086.338203] [18082]     0 18082     4786     1977      14       3        0             0 as
[350086.338205] [18091]     0 18091     3442      808      11       3        0             0 sh
[350086.338208] [18093]     0 18093     1454      165       8       3        0             0 sleep
[350086.338210] [18094]     0 18094     2490      253       9       3        0             0 cc
[350086.338211] [18095]     0 18095   100238    60442     146       3        0             0 cc1
[350086.338213] [18101]     0 18101     4786     1964      14       3        0             0 as
[350086.338215] [18104]     0 18104     3442      814      11       3        0             0 sh
[350086.338217] [18106]     0 18106     2490      248       9       3        0             0 cc
[350086.338219] [18107]     0 18107    99725    57639     141       4        0             0 cc1
[350086.338221] [18108]     0 18108     4786     2030      14       3        0             0 as
[350086.338223] [18130]     0 18130     3442      790      12       3        0             0 sh
[350086.338226] [18133]     0 18133     2490      235       8       3        0             0 cc
[350086.338228] [18134]     0 18134     3442      781      12       3        0             0 sh
[350086.338230] [18135]     0 18135    81008    48498     121       3        0             0 cc1
[350086.338232] [18136]     0 18136     4786     1935      15       3        0             0 as
[350086.338234] [18137]     0 18137     3442      786      10       3        0             0 sh
[350086.338236] [18138]     0 18138     2490      229       9       3        0             0 cc
[350086.338238] [18139]     0 18139     2490      242       9       3        0             0 cc
[350086.338240] [18140]     0 18140    80993    48202     118       3        0             0 cc1
[350086.338242] [18141]     0 18141    99713    53841     132       3        0             0 cc1
[350086.338243] [18142]     0 18142     4786     1952      14       4        0             0 as
[350086.338245] [18143]     0 18143     4786     2012      13       3        0             0 as
[350086.338247] [18152]     0 18152     3442      778      10       3        0             0 sh
[350086.338249] [18153]     0 18153     2490      226       9       3        0             0 cc
[350086.338251] [18154]     0 18154    83047    50126     121       3        0             0 cc1
[350086.338253] [18155]     0 18155     4786     2012      15       3        0             0 as
[350086.338255] [18166]     0 18166     3442      809      10       3        0             0 sh
[350086.338257] [18167]     0 18167     2490      236       9       3        0             0 cc
[350086.338259] [18168]     0 18168    73800    42887     106       4        0             0 cc1
[350086.338260] [18169]     0 18169     4786     1952      14       3        0             0 as
[350086.338262] Out of memory: Kill process 7362 (chromium) score 352 or sacrifice child
[350086.338298] Killed process 7362 (chromium) total-vm:1277096kB, anon-rss:313392kB, file-rss:68416kB, shmem-rss:27368kB
[350086.360581] oom_reaper: reaped process 7362 (chromium), now anon-rss:0kB, file-rss:0kB, shmem-rss:27268kB
[~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:           7.5G        3.3G        810M         39M        3.4G        3.9G
Swap:            0B          0B          0B

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-18 20:26 ` Lorenzo Stoakes
@ 2016-09-18 20:58   ` Linus Torvalds
  2016-09-18 21:13     ` Vlastimil Babka
  2016-09-19  8:32   ` Michal Hocko
  1 sibling, 1 reply; 36+ messages in thread
From: Linus Torvalds @ 2016-09-18 20:58 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Michal Hocko, Tetsuo Handa, Oleg Nesterov, Vladimir Davydov,
	Vlastimil Babka, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Joonsoo Kim, linux-mm

On Sun, Sep 18, 2016 at 1:26 PM, Lorenzo Stoakes <lstoakes@gmail.com> wrote:
>
> I encountered this even after applying the patch discussed in the
> original thread at https://lkml.org/lkml/2016/8/22/184.  It's not easily
> reproducible but it is happening enough that I could probably check some
> specific state when it next occurs or test out a patch to see if it
> stops it if that'd be useful.

Since you can at least try to recreate it, how about the series in -mm
by Vlastimil? The series was called "reintroduce compaction feedback
for OOM decisions", and is in -mm right now:

  Vlastimil Babka (4):
    Revert "mm, oom: prevent premature OOM killer invocation for high
order request"
    mm, compaction: more reliably increase direct compaction priority
    mm, compaction: restrict full priority to non-costly orders
    mm, compaction: make full priority ignore pageblock suitability

I'm not sure if Andrew has any other ones pending that are relevant to oom.

A lot of the oom discussion seemed to be about the task stack
allocation (order-2), but kmalloc() really can and does trigger those
order-3 allocations even for small allocations.

Just as an example, these are the slab entries for me that are order-3:

  bio-1, UDPv6, TCPv6, kcopyd_job, dm_uevent, mqueue_inode_cache,
  ext4_inode_cache, pid_namespace, PING, UDP, TCP, request_queue,
  net_namespace, bdev_cache, mm_struct, signal_cache, sighand_cache,
  task_struct, idr_layer_cache, dma-kmalloc-8192, dma-kmalloc-4096,
  dma-kmalloc-2048, dma-kmalloc-1024, kmalloc-8192, kmalloc-4096,
  kmalloc-2048, kmalloc-1024

and most of those are 1-2kB in size.

Of course, any slab allocation failure is harder to trigger just
because slab itself ends up often having empty cache entries, so only
a small percentage makes it to the page allocator itself. But the page
allocator failure case really needs to treat PAGE_ALLOC_COSTLY_ORDER
specially.

Which implies that if compaction is magical for page allocation
success, then compaction needs to do so too.

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-18 20:03 More OOM problems Linus Torvalds
  2016-09-18 20:26 ` Lorenzo Stoakes
@ 2016-09-18 21:00 ` Vlastimil Babka
  2016-09-18 21:18   ` Linus Torvalds
  2016-09-19  1:07   ` Andi Kleen
  2016-09-18 22:00 ` Vlastimil Babka
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 36+ messages in thread
From: Vlastimil Babka @ 2016-09-18 21:00 UTC (permalink / raw)
  To: Linus Torvalds, Michal Hocko, Tetsuo Handa, Oleg Nesterov,
	Vladimir Davydov
  Cc: Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Olaf Hering, Joonsoo Kim,
	linux-mm

On 09/18/2016 10:03 PM, Linus Torvalds wrote:
> [ More or less random collection of people from previous oom patches
> and/or discussions, if you feel you shouldn't have been cc'd, blame me
> for just picking things from earlier threads and/or commits ]
> 
> I'm afraid that the oom situation is still not fixed, and the "let's
> die quickly" patches are still a nasty regression.
> 
> I have a 16GB desktop that I just noticed killed one of the chrome
> tabs yesterday. Tha machine had *tons* of freeable memory, with
> something like 7GB of page cache at the time, if I read this right.
> 
> The trigger is a kcalloc() in the i915 driver:
> 
>     Xorg invoked oom-killer:
> gfp_mask=0x240c0d0(GFP_TEMPORARY|__GFP_COMP|__GFP_ZERO), order=3,
> oom_score_adj=0
> 
>       __kmalloc+0x1cd/0x1f0
>       alloc_gen8_temp_bitmaps+0x47/0x80 [i915]
> 
> which looks like it is one of these:
> 
>   slabinfo - version: 2.1
>   # name            <active_objs> <num_objs> <objsize> <objperslab>
> <pagesperslab>
>   kmalloc-8192         268    268   8192    4    8
>   kmalloc-4096         732    786   4096    8    8
>   kmalloc-2048        1402   1456   2048   16    8
>   kmalloc-1024        2505   2976   1024   32    8
> 
> so even just a 1kB allocation can cause an order-3 page allocation.

Sounds like SLUB. SLAB would use order-0 as long as things fit. I would
hope for SLUB to fallback to order-0 (or order-1 for 8kB) instead of
OOM, though. Guess not...

> And yeah, I had what, 137MB free memory, it's just that it's all
> fairly fragmented. There's actually even order-4 pages, but they are
> in low DMA memory and the system tries to protect them:
> 
>   Node 0 DMA: 0*4kB 1*8kB (U) 2*16kB (U) 1*32kB (U) 3*64kB (U) 2*128kB
> (U) 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB
>   Node 0 DMA32: 11110*4kB (UMEH) 2929*8kB (UMEH) 44*16kB (MH) 1*32kB
> (H) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 68608kB
>   Node 0 Normal: 14031*4kB (UMEH) 49*8kB (UMEH) 18*16kB (UH) 0*32kB
> 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 56804kB
>   Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
> hugepages_size=1048576kB
>   Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
> hugepages_size=2048kB
>   2084682 total pagecache pages
>   11 pages in swap cache
>   Swap cache stats: add 35, delete 24, find 2/3
>   Free swap  = 8191868kB
>   Total swap = 8191996kB
>   4168499 pages RAM
> 
> And it looks like there's a fair amount of memory busy under writeback
> (470MB or so)
> 
>   active_anon:1539159 inactive_anon:374915 isolated_anon:0
>                             active_file:1251771 inactive_file:450068
> isolated_file:0
>                             unevictable:175 dirty:26 writeback:118690 unstable:0
>                             slab_reclaimable:220784 slab_unreclaimable:39819
>                             mapped:491617 shmem:382891 pagetables:20439 bounce:0
>                             free:35301 free_pcp:895 free_cma:0
> 
> And yes, CONFIG_COMPACTION was enabled.
> 
> So quite honestly, I *really* don't think that a 1kB allocation should
> have reasonably failed and killed anything at all (ok, it could have
> been an 8kB one, who knows - but it really looks like it *could* have
> been just 1kB).
> 
> Considering that kmalloc() pattern, I suspect that we need to consider
> order-3 allocations "small", and try a lot harder.

Well, order-3 is actually PAGE_ALLOC_COSTLY_ORDER, and costly orders
have to be strictly larger in all the tests. So order-3 is in fact still
considered "small", and thus it actually results in OOM instead of
allocation failure.

> Because killing processes due to "out of memory" in this situation is
> unquestionably a bug.
> 
> And no, I can't recreate this, obviously.
> 
> I think there's a series in -mm that hasn't been merged and that is
> pending (presumably for 4.9). I think Arkadiusz tested it for his
> (repeatable) workload. It may need to be considered for 4.8, because
> the above is ridiculously bad, imho.

So this series will make compaction ignore most of its heuristics
intended for reducing latency, when we keep repeating reclaim/compaction
long enough without success. This should help. But it also restores the
feedback from compaction to the retry loop (that Michal disabled for 4.8
and 4.7.x stable due to the earlier reports). So the result might not be
a clear win and that's why I hoped for more testing (thanks Arkadiusz,
though).

> Andrew? Vlastimil? Michal? Others?
> 
>             Linus
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-18 20:58   ` Linus Torvalds
@ 2016-09-18 21:13     ` Vlastimil Babka
  2016-09-18 21:34       ` Lorenzo Stoakes
  0 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2016-09-18 21:13 UTC (permalink / raw)
  To: Linus Torvalds, Lorenzo Stoakes
  Cc: Michal Hocko, Tetsuo Handa, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Olaf Hering, Joonsoo Kim,
	linux-mm

On 09/18/2016 10:58 PM, Linus Torvalds wrote:
> On Sun, Sep 18, 2016 at 1:26 PM, Lorenzo Stoakes <lstoakes@gmail.com> wrote:
>>
>> I encountered this even after applying the patch discussed in the
>> original thread at https://lkml.org/lkml/2016/8/22/184.  It's not easily
>> reproducible but it is happening enough that I could probably check some
>> specific state when it next occurs or test out a patch to see if it
>> stops it if that'd be useful.
> 
> Since you can at least try to recreate it, how about the series in -mm
> by Vlastimil? The series was called "reintroduce compaction feedback
> for OOM decisions", and is in -mm right now:
> 
>   Vlastimil Babka (4):
>     Revert "mm, oom: prevent premature OOM killer invocation for high
> order request"
>     mm, compaction: more reliably increase direct compaction priority
>     mm, compaction: restrict full priority to non-costly orders
>     mm, compaction: make full priority ignore pageblock suitability
> 
> I'm not sure if Andrew has any other ones pending that are relevant to oom.

The 4 patches above had more as prerequisities already in -mm. So one
way to test is the whole tree:
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
tag mmotm-2016-09-14-16-49

or just a recent -next.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-18 21:00 ` Vlastimil Babka
@ 2016-09-18 21:18   ` Linus Torvalds
  2016-09-19  6:27     ` Jiri Slaby
  2016-09-19  7:01     ` Michal Hocko
  2016-09-19  1:07   ` Andi Kleen
  1 sibling, 2 replies; 36+ messages in thread
From: Linus Torvalds @ 2016-09-18 21:18 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Tetsuo Handa, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Olaf Hering, Joonsoo Kim,
	linux-mm

On Sun, Sep 18, 2016 at 2:00 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Sounds like SLUB. SLAB would use order-0 as long as things fit. I would
> hope for SLUB to fallback to order-0 (or order-1 for 8kB) instead of
> OOM, though. Guess not...

SLUB it is - and I think that's pretty much all the world these days.
SLAB is largely deprecated.

We should probably start to remove SLAB entirely, and I definitely
hope that no oom people run with it. SLUB is marked default in our
config files, and I think most distros follow that (I know Fedora
does, didn't check others).

> Well, order-3 is actually PAGE_ALLOC_COSTLY_ORDER, and costly orders
> have to be strictly larger in all the tests. So order-3 is in fact still
> considered "small", and thus it actually results in OOM instead of
> allocation failure.

Yeah, but I do think that "oom when you have 156MB free and 7GB
reclaimable, and haven't even tried swapping" counts as obviously
wrong.

I'm not saying the code should fail and return NULL either, of course.

So  PAGE_ALLOC_COSTLY_ORDER should *not* mean "oom rather than return
NULL". It really has to mean "try a _lot_ harder".

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-18 21:13     ` Vlastimil Babka
@ 2016-09-18 21:34       ` Lorenzo Stoakes
  0 siblings, 0 replies; 36+ messages in thread
From: Lorenzo Stoakes @ 2016-09-18 21:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linus Torvalds, Michal Hocko, Tetsuo Handa, Oleg Nesterov,
	Vladimir Davydov, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Joonsoo Kim, linux-mm

On Sun, Sep 18, 2016 at 11:13:36PM +0200, Vlastimil Babka wrote:
>
> The 4 patches above had more as prerequisities already in -mm. So one
> way to test is the whole tree:
> git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
> tag mmotm-2016-09-14-16-49
>
> or just a recent -next.
>

Thanks, I will try this out (probably using a recent -next.)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-18 20:03 More OOM problems Linus Torvalds
  2016-09-18 20:26 ` Lorenzo Stoakes
  2016-09-18 21:00 ` Vlastimil Babka
@ 2016-09-18 22:00 ` Vlastimil Babka
  2016-09-19  6:56   ` Michal Hocko
  2016-09-19  6:48 ` Michal Hocko
  2016-09-21  7:04 ` Raymond Jennings
  4 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2016-09-18 22:00 UTC (permalink / raw)
  To: Linus Torvalds, Michal Hocko, Tetsuo Handa, Oleg Nesterov,
	Vladimir Davydov
  Cc: Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Olaf Hering, Joonsoo Kim,
	linux-mm

On 09/18/2016 10:03 PM, Linus Torvalds wrote:
> [ More or less random collection of people from previous oom patches 
> and/or discussions, if you feel you shouldn't have been cc'd, blame
> me for just picking things from earlier threads and/or commits ]
> 
> I'm afraid that the oom situation is still not fixed, and the "let's 
> die quickly" patches are still a nasty regression.

So I'm trying to understand the core of the regression compared to
pre-4.7. It can't be the compaction feedback, as that was reverted, and
compaction itself shouldn't perform worse than pre-4.7. This leaves us
with should_reclaim_retry() false. This can return false if:

1) no_progress_loops > MAX_RECLAIM_RETRIES

But we have in __allow_pages_slowpath() this:

if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
no_progress_loops = 0;

I doubt reclaim makes no progress in your case, and the non-costly order
is also true. So, unlikely.

2) The watermark check that includes estimate for pages available for
reclaim fails.

Could be the backoff in calculation of "available" in
should_reclaim_retry() is too aggressive. But it depends on the
no_progress_loops which I think is 0 (see above). Again, unlikely.

But the watermark check doesn't actually work for order-1+ allocations,
the "available" estimate only affects order-0 check. For higher orders
it will be false if the page of sufficient order doesn't already exist.
That's fine if we trust should_compact_retry() in such case.

But Joonsoo already had a theoretical scenario where this can fall apart:
http://lkml.kernel.org/r/<20160824050157.GA22781@js1304-P5Q-DELUXE>

See the part that starts at "Assume following situation:". I suspect
something like that happened here.

I think at least temporarily we'll have to make the watermark check
to be order-0 check for non-costly orders.

Something like below (untested)?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a2214c64ed3c..9b3b3a79c58a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3347,17 +3347,24 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 					ac->nodemask) {
 		unsigned long available;
 		unsigned long reclaimable;
+		int check_order = order;
+		unsigned long watermark = min_wmark_pages(zone);
 
 		available = reclaimable = zone_reclaimable_pages(zone);
 		available -= DIV_ROUND_UP(no_progress_loops * available,
 					  MAX_RECLAIM_RETRIES);
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 
+		if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER) {
+			check_order = 0;
+			watermark += 1UL << order;
+		}
+
 		/*
 		 * Would the allocation succeed if we reclaimed the whole
 		 * available?
 		 */
-		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
+		if (__zone_watermark_ok(zone, check_order, watermark,
 				ac_classzone_idx(ac), alloc_flags, available)) {
 			/*
 			 * If we didn't make any progress and have a lot of

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-18 21:00 ` Vlastimil Babka
  2016-09-18 21:18   ` Linus Torvalds
@ 2016-09-19  1:07   ` Andi Kleen
       [not found]     ` <alpine.DEB.2.20.1609190836540.12121@east.gentwo.org>
  1 sibling, 1 reply; 36+ messages in thread
From: Andi Kleen @ 2016-09-19  1:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linus Torvalds, Michal Hocko, Tetsuo Handa, Oleg Nesterov,
	Vladimir Davydov, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Joonsoo Kim, linux-mm, cl

Vlastimil Babka <vbabka@suse.cz> writes:
>> 
>> The trigger is a kcalloc() in the i915 driver:
>> 
>>     Xorg invoked oom-killer:
>> gfp_mask=0x240c0d0(GFP_TEMPORARY|__GFP_COMP|__GFP_ZERO), order=3,
>> oom_score_adj=0
>> 
>>       __kmalloc+0x1cd/0x1f0
>>       alloc_gen8_temp_bitmaps+0x47/0x80 [i915]
>> 
>> which looks like it is one of these:
>> 
>>   slabinfo - version: 2.1
>>   # name            <active_objs> <num_objs> <objsize> <objperslab>
>> <pagesperslab>
>>   kmalloc-8192         268    268   8192    4    8
>>   kmalloc-4096         732    786   4096    8    8
>>   kmalloc-2048        1402   1456   2048   16    8
>>   kmalloc-1024        2505   2976   1024   32    8
>> 
>> so even just a 1kB allocation can cause an order-3 page allocation.
>
> Sounds like SLUB. SLAB would use order-0 as long as things fit. I would
> hope for SLUB to fallback to order-0 (or order-1 for 8kB) instead of
> OOM, though. Guess not...

It's already trying to do that, perhaps just some flags need to be
changed?

Adding Christoph.

	flags |= s->allocflags;

	/*
	 * Let the initial higher-order allocation fail under memory pressure
	 * so we fall-back to the minimum order allocation.
	 */
	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;

	page = alloc_slab_page(alloc_gfp, node, oo);
	if (unlikely(!page)) {
		oo = s->min;
		/*
		 * Allocation may have failed due to fragmentation.
		 * Try a lower order alloc if possible
		 */
		page = alloc_slab_page(flags, node, oo);

		if (page)
			stat(s, ORDER_FALLBACK);
	}


-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-18 21:18   ` Linus Torvalds
@ 2016-09-19  6:27     ` Jiri Slaby
  2016-09-19  7:01     ` Michal Hocko
  1 sibling, 0 replies; 36+ messages in thread
From: Jiri Slaby @ 2016-09-19  6:27 UTC (permalink / raw)
  To: Linus Torvalds, Vlastimil Babka
  Cc: Olaf Hering, Arkadiusz Miskiewicz, Joonsoo Kim, Tetsuo Handa,
	Michal Hocko, linux-mm, Andrew Morton, Vladimir Davydov,
	Ralf-Peter Rohbeck, Oleg Nesterov, Markus Trippelsdorf

On 09/18/2016, 11:18 PM, Linus Torvalds wrote:
> SLUB is marked default in our
> config files, and I think most distros follow that (I know Fedora
> does, didn't check others).

For the reference, all active SUSE kernels use SLAB.

thanks,
-- 
js
suse labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-18 20:03 More OOM problems Linus Torvalds
                   ` (2 preceding siblings ...)
  2016-09-18 22:00 ` Vlastimil Babka
@ 2016-09-19  6:48 ` Michal Hocko
  2016-09-21  7:04 ` Raymond Jennings
  4 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2016-09-19  6:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Tetsuo Handa, Oleg Nesterov, Vladimir Davydov, Vlastimil Babka,
	Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Olaf Hering, Joonsoo Kim,
	linux-mm

On Sun 18-09-16 13:03:01, Linus Torvalds wrote:
> [ More or less random collection of people from previous oom patches
> and/or discussions, if you feel you shouldn't have been cc'd, blame me
> for just picking things from earlier threads and/or commits ]
> 
> I'm afraid that the oom situation is still not fixed, and the "let's
> die quickly" patches are still a nasty regression.
> 
> I have a 16GB desktop that I just noticed killed one of the chrome
> tabs yesterday. Tha machine had *tons* of freeable memory, with
> something like 7GB of page cache at the time, if I read this right.
> 
> The trigger is a kcalloc() in the i915 driver:
> 
>     Xorg invoked oom-killer:
> gfp_mask=0x240c0d0(GFP_TEMPORARY|__GFP_COMP|__GFP_ZERO), order=3,
> oom_score_adj=0
> 
>       __kmalloc+0x1cd/0x1f0
>       alloc_gen8_temp_bitmaps+0x47/0x80 [i915]
> 
> which looks like it is one of these:
> 
>   slabinfo - version: 2.1
>   # name            <active_objs> <num_objs> <objsize> <objperslab>
> <pagesperslab>
>   kmalloc-8192         268    268   8192    4    8
>   kmalloc-4096         732    786   4096    8    8
>   kmalloc-2048        1402   1456   2048   16    8
>   kmalloc-1024        2505   2976   1024   32    8
> 
> so even just a 1kB allocation can cause an order-3 page allocation.

Yes it can trigger order-3 but that should be just
alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL

so not triggering OOM and failing early rather than retry really hard.
Considering the above gfp_mask this seems like the real order-3 size
request.

> And yeah, I had what, 137MB free memory, it's just that it's all
> fairly fragmented.

137MB in your case means that all usable zones are not meating the min
wmark so 6b4e3181d7bd ("mm, oom: prevent premature OOM killer invocation
for high order request") didn't stop the OOM.

[...]

> So quite honestly, I *really* don't think that a 1kB allocation should
> have reasonably failed and killed anything at all (ok, it could have
> been an 8kB one, who knows - but it really looks like it *could* have
> been just 1kB).

Unless I am missing something this should really be a 32k request. It is
true that retrying some or much more might help here indeed this is
really hard to tell. Vlastimil's patches you have mentioned might really
help here because they are getting rid of most of the heuristics that
would give up just too early. But I am also wondering whether a more
pragmatic approach in this case would be to simply use GFP_NORETRY and
fallback to vmalloc. Note that I am not familiar with the code and
vmalloc might be a no-go but it is at least worth exploring this option.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-18 22:00 ` Vlastimil Babka
@ 2016-09-19  6:56   ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2016-09-19  6:56 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linus Torvalds, Tetsuo Handa, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Olaf Hering, Joonsoo Kim,
	linux-mm

On Mon 19-09-16 00:00:24, Vlastimil Babka wrote:
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a2214c64ed3c..9b3b3a79c58a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3347,17 +3347,24 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  					ac->nodemask) {
>  		unsigned long available;
>  		unsigned long reclaimable;
> +		int check_order = order;
> +		unsigned long watermark = min_wmark_pages(zone);
>  
>  		available = reclaimable = zone_reclaimable_pages(zone);
>  		available -= DIV_ROUND_UP(no_progress_loops * available,
>  					  MAX_RECLAIM_RETRIES);
>  		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
>  
> +		if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER) {
> +			check_order = 0;
> +			watermark += 1UL << order;
> +		}
> +
>  		/*
>  		 * Would the allocation succeed if we reclaimed the whole
>  		 * available?
>  		 */
> -		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> +		if (__zone_watermark_ok(zone, check_order, watermark,
>  				ac_classzone_idx(ac), alloc_flags, available)) {
>  			/*
>  			 * If we didn't make any progress and have a lot of

Joonsoo was suggesting something like this before and I really hated
that. We can very well just not invoke the OOM killer for those requests
at all and rely on a smaller order request to trigger it for us. But
who knows maybe we will have no other option and bite the bullet and
declare the defeat and do something special for !costly orders.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-18 21:18   ` Linus Torvalds
  2016-09-19  6:27     ` Jiri Slaby
@ 2016-09-19  7:01     ` Michal Hocko
  2016-09-19  7:52       ` Michal Hocko
  1 sibling, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2016-09-19  7:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vlastimil Babka, Tetsuo Handa, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Olaf Hering, Joonsoo Kim,
	linux-mm

On Sun 18-09-16 14:18:22, Linus Torvalds wrote:
> On Sun, Sep 18, 2016 at 2:00 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > Sounds like SLUB. SLAB would use order-0 as long as things fit. I would
> > hope for SLUB to fallback to order-0 (or order-1 for 8kB) instead of
> > OOM, though. Guess not...
> 
> SLUB it is - and I think that's pretty much all the world these days.
> SLAB is largely deprecated.

It seems that this is not a general consensus
http://lkml.kernel.org/r/20160823153807.GN23577@dhcp22.suse.cz

> We should probably start to remove SLAB entirely, and I definitely
> hope that no oom people run with it. SLUB is marked default in our
> config files, and I think most distros follow that (I know Fedora
> does, didn't check others).
> 
> > Well, order-3 is actually PAGE_ALLOC_COSTLY_ORDER, and costly orders
> > have to be strictly larger in all the tests. So order-3 is in fact still
> > considered "small", and thus it actually results in OOM instead of
> > allocation failure.
> 
> Yeah, but I do think that "oom when you have 156MB free and 7GB
> reclaimable, and haven't even tried swapping" counts as obviously
> wrong.

The thing is that swapping doesn't really help. You can easily migrate
anonymous memory to create larger blocks even without reclaiming them.
So I still believe compaction is giving up too easily.

> I'm not saying the code should fail and return NULL either, of course.
> 
> So  PAGE_ALLOC_COSTLY_ORDER should *not* mean "oom rather than return
> NULL". It really has to mean "try a _lot_ harder".

Agreed and Vlastimil's patches go that route. We just do not try
sufficiently hard with the compaction.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-19  7:01     ` Michal Hocko
@ 2016-09-19  7:52       ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2016-09-19  7:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vlastimil Babka, Tetsuo Handa, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Olaf Hering, Joonsoo Kim,
	linux-mm

On Mon 19-09-16 09:01:06, Michal Hocko wrote:
> On Sun 18-09-16 14:18:22, Linus Torvalds wrote:
[...]
> > I'm not saying the code should fail and return NULL either, of course.
> > 
> > So  PAGE_ALLOC_COSTLY_ORDER should *not* mean "oom rather than return
> > NULL". It really has to mean "try a _lot_ harder".
> 
> Agreed and Vlastimil's patches go that route. We just do not try
> sufficiently hard with the compaction.

And just to clarify why I think that Vlastimil's patches might help
here. Your allocation fails because you seem to be hitting min watermark
even for order-0 with my workaround which is sitting in 4.8. If this is
a longer term state then the compaction even doesn't try to do anything.
With the original should_compact_retry we would keep retrying based on
compaction_withdrawn() feedback. That would get us over order-0
watermarks kick the compaction in. Without Vlastimil's patches we could
still give up too early due some of the back off heuristic in the
compaction code. But most of those should be gone with his patches. So I
believe that they should really help here. Maybe there are still some
places to look at - I didn't get to fully review his patches (plan to do
it this week).

So in short the workaround we have in 4.8 currently tried to plug the
biggest hole while the situation is not ideal. That's why I originally
hoped for the compaction feedback already in 4.8.

I fully realize this is a lot of code for late 4.8 cycle, though. So if
this turns out to be really critical for 4.8 then what Vlastimil was
suggesting in
http://lkml.kernel.org/r/6aa81fe3-7f04-78d7-d477-609a7acd351a@suse.cz
might be another workaround on top. We can even consider completely
disable OOM killer for !costly orders.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-18 20:26 ` Lorenzo Stoakes
  2016-09-18 20:58   ` Linus Torvalds
@ 2016-09-19  8:32   ` Michal Hocko
  2016-09-19  8:42     ` Lorenzo Stoakes
  1 sibling, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2016-09-19  8:32 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Linus Torvalds, Tetsuo Handa, Oleg Nesterov, Vladimir Davydov,
	Vlastimil Babka, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Joonsoo Kim, linux-mm

On Sun 18-09-16 21:26:14, Lorenzo Stoakes wrote:
> Hi all,
> 
> In case it's helpful - I have experienced these OOM issues invoked
> in my case via the nvidia driver and similarly to Linus an order
> 3 allocation resulted in killed chromium tabs. I encountered this
> even after applying the patch discussed in the original thread at
> https://lkml.org/lkml/2016/8/22/184. It's not easily reproducible
> but it is happening enough that I could probably check some specific
> state when it next occurs or test out a patch to see if it stops it if
> that'd be useful.
>
> I saved a couple OOM's from the last time it occurred, this is on a
> 8GiB system with plenty of reclaimable memory:

Just for the reference
 
> [350085.038693] Xorg invoked oom-killer: gfp_mask=0x24040c0(GFP_KERNEL|__GFP_COMP), order=3, oom_score_adj=0
> [350085.038696] Xorg cpuset=/ mems_allowed=0
> [350085.038699] CPU: 0 PID: 2119 Comm: Xorg Tainted: P           O    4.7.2-1-custom #1
[...]
> [350085.039048] Mem-Info:
> [350085.039051] active_anon:861397 inactive_anon:23397 isolated_anon:0
>                  active_file:146274 inactive_file:144248 isolated_file:0
>                  unevictable:8 dirty:14587 writeback:0 unstable:0
>                  slab_reclaimable:697630 slab_unreclaimable:24397
>                  mapped:79655 shmem:26548 pagetables:7211 bounce:0
>                  free:25159 free_pcp:235 free_cma:0
> [350085.039054] Node 0 DMA free:15516kB min:136kB low:168kB high:200kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
> [350085.039058] lowmem_reserve[]: 0 3196 7658 7658
> [350085.039060] Node 0 DMA32 free:45980kB min:28148kB low:35184kB high:42220kB active_anon:1466208kB inactive_anon:43120kB active_file:239740kB inactive_file:234920kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3617864kB managed:3280092kB mlocked:0kB dirty:21692kB writeback:0kB mapped:131184kB shmem:47588kB slab_reclaimable:1147984kB slab_unreclaimable:37484kB kernel_stack:2976kB pagetables:11512kB unstable:0kB bounce:0kB free_pcp:188kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [350085.039064] lowmem_reserve[]: 0 0 4462 4462

45980-(4462*4) = 28132

> [350085.039065] Node 0 Normal free:39140kB min:39296kB low:49120kB high:58944kB active_anon:1979380kB inactive_anon:50468kB active_file:345356kB inactive_file:342072kB unevictable:32kB isolated(anon):0kB isolated(file):0kB present:4702208kB managed:4569312kB mlocked:32kB dirty:36656kB writeback:0kB mapped:187436kB shmem:58604kB slab_reclaimable:1642536kB slab_unreclaimable:60104kB kernel_stack:5040kB pagetables:17332kB unstable:0kB bounce:0kB free_pcp:752kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:136 all_unreclaimable? no

so this is the same thing as in Linus case. All the zones are hitting
min wmark so the should_compact_retry() gave up. As mentioned in other
email [1] this is inherent limitation of the workaround. Your system is
swapless but there is a lot of the reclaimable page cache so Vlastimil's
patches should help.

[1] http://lkml.kernel.org/r/20160919075230.GE10785@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-19  8:32   ` Michal Hocko
@ 2016-09-19  8:42     ` Lorenzo Stoakes
  2016-09-19  8:53       ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Lorenzo Stoakes @ 2016-09-19  8:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Tetsuo Handa, Oleg Nesterov, Vladimir Davydov,
	Vlastimil Babka, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Joonsoo Kim, linux-mm

On Mon, Sep 19, 2016 at 10:32:15AM +0200, Michal Hocko wrote:
>
> so this is the same thing as in Linus case. All the zones are hitting
> min wmark so the should_compact_retry() gave up. As mentioned in other
> email [1] this is inherent limitation of the workaround. Your system is
> swapless but there is a lot of the reclaimable page cache so Vlastimil's
> patches should help.

I will experiment with a linux-next kernel and see if the problem recurs. I've attempted to see if there is a way to manually reproduce on the mainline kernel by performing workloads that triggered the OOM (loading google sheets tabs, compiling a kernel, playing a video on youtube), but to no avail - it seems the system needs to be sufficiently fragmented first before it'll trigger.

Given that's the case, I'll just have to try using the linux-next kernel and if you don't hear from me you can assume it did not repro again :)

I actually have a whole bunch of other OOM kill logs that I saved from previous occurrences of this issue, would it be useful for me to pastebin them, or would they not add anything of use beyond what's been shown in this thread?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-19  8:42     ` Lorenzo Stoakes
@ 2016-09-19  8:53       ` Michal Hocko
  2016-09-25 21:48         ` Lorenzo Stoakes
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2016-09-19  8:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Linus Torvalds, Tetsuo Handa, Oleg Nesterov, Vladimir Davydov,
	Vlastimil Babka, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Joonsoo Kim, linux-mm

On Mon 19-09-16 09:42:37, Lorenzo Stoakes wrote:
> On Mon, Sep 19, 2016 at 10:32:15AM +0200, Michal Hocko wrote:
> >
> > so this is the same thing as in Linus case. All the zones are hitting
> > min wmark so the should_compact_retry() gave up. As mentioned in other
> > email [1] this is inherent limitation of the workaround. Your system is
> > swapless but there is a lot of the reclaimable page cache so Vlastimil's
> > patches should help.
> 
> I will experiment with a linux-next kernel and see if the problem
> recurs. I've attempted to see if there is a way to manually reproduce
> on the mainline kernel by performing workloads that triggered the
> OOM (loading google sheets tabs, compiling a kernel, playing a video
> on youtube), but to no avail - it seems the system needs to be
> sufficiently fragmented first before it'll trigger.
>
> Given that's the case, I'll just have to try using the linux-next
> kernel and if you don't hear from me you can assume it did not repro
> again :)

OK, fair deal ;)

> I actually have a whole bunch of other OOM kill logs that I saved
> from previous occurrences of this issue, would it be useful for me to
> pastebin them, or would they not add anything of use beyond what's
> been shown in this thread?

If they are from before the workaround then it probably won't be that
useful.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
       [not found]     ` <alpine.DEB.2.20.1609190836540.12121@east.gentwo.org>
@ 2016-09-19 14:31       ` Andi Kleen
  2016-09-19 14:39         ` Michal Hocko
  2016-09-19 14:41         ` Vlastimil Babka
  0 siblings, 2 replies; 36+ messages in thread
From: Andi Kleen @ 2016-09-19 14:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Vlastimil Babka, Linus Torvalds, Michal Hocko,
	Tetsuo Handa, Oleg Nesterov, Vladimir Davydov, Andrew Morton,
	Markus Trippelsdorf, Arkadiusz Miskiewicz, Ralf-Peter Rohbeck,
	Jiri Slaby, Olaf Hering, Joonsoo Kim, linux-mm

On Mon, Sep 19, 2016 at 08:37:36AM -0500, Christoph Lameter wrote:
> On Sun, 18 Sep 2016, Andi Kleen wrote:
> 
> > > Sounds like SLUB. SLAB would use order-0 as long as things fit. I would
> > > hope for SLUB to fallback to order-0 (or order-1 for 8kB) instead of
> > > OOM, though. Guess not...
> >
> > It's already trying to do that, perhaps just some flags need to be
> > changed?
> 
> SLUB tries order-N and falls back to order 0 on failure.

Right it tries, but Linus apparently got an OOM in the order-N
allocation. So somehow the flag combination that it passes first
is not preventing the OOM killer.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-19 14:31       ` Andi Kleen
@ 2016-09-19 14:39         ` Michal Hocko
  2016-09-19 14:41         ` Vlastimil Babka
  1 sibling, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2016-09-19 14:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, Vlastimil Babka, Linus Torvalds, Tetsuo Handa,
	Oleg Nesterov, Vladimir Davydov, Andrew Morton,
	Markus Trippelsdorf, Arkadiusz Miskiewicz, Ralf-Peter Rohbeck,
	Jiri Slaby, Olaf Hering, Joonsoo Kim, linux-mm

On Mon 19-09-16 07:31:06, Andi Kleen wrote:
> On Mon, Sep 19, 2016 at 08:37:36AM -0500, Christoph Lameter wrote:
> > On Sun, 18 Sep 2016, Andi Kleen wrote:
> > 
> > > > Sounds like SLUB. SLAB would use order-0 as long as things fit. I would
> > > > hope for SLUB to fallback to order-0 (or order-1 for 8kB) instead of
> > > > OOM, though. Guess not...
> > >
> > > It's already trying to do that, perhaps just some flags need to be
> > > changed?
> > 
> > SLUB tries order-N and falls back to order 0 on failure.
> 
> Right it tries, but Linus apparently got an OOM in the order-N
> allocation. So somehow the flag combination that it passes first
> is not preventing the OOM killer.

It does AFAICS:
	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
	if ((alloc_gfp & __GFP_DIRECT_RECLAIM) && oo_order(oo) > oo_order(s->min))
		alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~(__GFP_RECLAIM|__GFP_NOFAIL);

	page = alloc_slab_page(s, alloc_gfp, node, oo);
	if (unlikely(!page)) {
		oo = s->min;
		alloc_gfp = flags;
		/*
		 * Allocation may have failed due to fragmentation.
		 * Try a lower order alloc if possible
		 */
		page = alloc_slab_page(s, alloc_gfp, node, oo);

I think that Linus just see a genuine order-3 request
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-19 14:31       ` Andi Kleen
  2016-09-19 14:39         ` Michal Hocko
@ 2016-09-19 14:41         ` Vlastimil Babka
  2016-09-19 18:18           ` Linus Torvalds
  2016-09-19 19:57           ` Christoph Lameter
  1 sibling, 2 replies; 36+ messages in thread
From: Vlastimil Babka @ 2016-09-19 14:41 UTC (permalink / raw)
  To: Andi Kleen, Christoph Lameter
  Cc: Linus Torvalds, Michal Hocko, Tetsuo Handa, Oleg Nesterov,
	Vladimir Davydov, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Joonsoo Kim, linux-mm

On 09/19/2016 04:31 PM, Andi Kleen wrote:
> On Mon, Sep 19, 2016 at 08:37:36AM -0500, Christoph Lameter wrote:
>> On Sun, 18 Sep 2016, Andi Kleen wrote:
>>
>>>> Sounds like SLUB. SLAB would use order-0 as long as things fit. I would
>>>> hope for SLUB to fallback to order-0 (or order-1 for 8kB) instead of
>>>> OOM, though. Guess not...
>>>
>>> It's already trying to do that, perhaps just some flags need to be
>>> changed?
>>
>> SLUB tries order-N and falls back to order 0 on failure.
>
> Right it tries, but Linus apparently got an OOM in the order-N
> allocation. So somehow the flag combination that it passes first
> is not preventing the OOM killer.

But Linus' error was:

    Xorg invoked oom-killer:
gfp_mask=0x240c0d0(GFP_TEMPORARY|__GFP_COMP|__GFP_ZERO), order=3,
oom_score_adj=0

There's no __GFP_NOWARN | __GFP_NORETRY, so it clearly wasn't the 
opportunistic "initial higher-order allocation". The logical conclusion 
is that it was a genuine order-3 allocation. 1kB allocation using 
order-3 would silently fail without OOM or warning, and then fallback to 
order-0.

> -Andi
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-19 14:41         ` Vlastimil Babka
@ 2016-09-19 18:18           ` Linus Torvalds
  2016-09-19 19:57           ` Christoph Lameter
  1 sibling, 0 replies; 36+ messages in thread
From: Linus Torvalds @ 2016-09-19 18:18 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andi Kleen, Christoph Lameter, Michal Hocko, Tetsuo Handa,
	Oleg Nesterov, Vladimir Davydov, Andrew Morton,
	Markus Trippelsdorf, Arkadiusz Miskiewicz, Ralf-Peter Rohbeck,
	Jiri Slaby, Olaf Hering, Joonsoo Kim, linux-mm

On Mon, Sep 19, 2016 at 7:41 AM, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> There's no __GFP_NOWARN | __GFP_NORETRY, so it clearly wasn't the
> opportunistic "initial higher-order allocation". The logical conclusion is
> that it was a genuine order-3 allocation. 1kB allocation using order-3 would
> silently fail without OOM or warning, and then fallback to order-0.

Yes, I think you're right. The kcalloc() probably *was* a 32kB
allocation. In which case it's really more of a i915 driver issue.
I'll talk to the drm people and see if they can perhaps fix their
allocation patterns.

               Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-19 14:41         ` Vlastimil Babka
  2016-09-19 18:18           ` Linus Torvalds
@ 2016-09-19 19:57           ` Christoph Lameter
  1 sibling, 0 replies; 36+ messages in thread
From: Christoph Lameter @ 2016-09-19 19:57 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andi Kleen, Linus Torvalds, Michal Hocko, Tetsuo Handa,
	Oleg Nesterov, Vladimir Davydov, Andrew Morton,
	Markus Trippelsdorf, Arkadiusz Miskiewicz, Ralf-Peter Rohbeck,
	Jiri Slaby, Olaf Hering, Joonsoo Kim, linux-mm

On Mon, 19 Sep 2016, Vlastimil Babka wrote:

> There's no __GFP_NOWARN | __GFP_NORETRY, so it clearly wasn't the
> opportunistic "initial higher-order allocation". The logical conclusion is
> that it was a genuine order-3 allocation. 1kB allocation using order-3 would
> silently fail without OOM or warning, and then fallback to order-0.

Sorry if you really want an object that is greater than page size then the
slab allocators wont be able to satisfy that with an order 0 allocation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-18 20:03 More OOM problems Linus Torvalds
                   ` (3 preceding siblings ...)
  2016-09-19  6:48 ` Michal Hocko
@ 2016-09-21  7:04 ` Raymond Jennings
  2016-09-21  7:29   ` Michal Hocko
  2016-09-29  6:12   ` More OOM problems (sorry fro the mail bomb) Raymond Jennings
  4 siblings, 2 replies; 36+ messages in thread
From: Raymond Jennings @ 2016-09-21  7:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michal Hocko, Tetsuo Handa, Oleg Nesterov, Vladimir Davydov,
	Vlastimil Babka, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Joonsoo Kim, linux-mm

On Sun, 18 Sep 2016 13:03:01 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> [ More or less random collection of people from previous oom patches
> and/or discussions, if you feel you shouldn't have been cc'd, blame me
> for just picking things from earlier threads and/or commits ]
> 
> I'm afraid that the oom situation is still not fixed, and the "let's
> die quickly" patches are still a nasty regression.
> 
> I have a 16GB desktop that I just noticed killed one of the chrome
> tabs yesterday. Tha machine had *tons* of freeable memory, with
> something like 7GB of page cache at the time, if I read this right.

Suggestions:

* Live compaction?

Have a background process that actively defragments free memory by
bubbling movable pages to one end of the zone and the free holes to the
other end?

Same spirit perhaps as khugepaged, periodically walk a zone from one
end and migrate any used movable pages into the hole closest to the
other end?

I dunno, doing this manually with /proc/sys/vm/compact_blah seems a
little hamfisted to me, and maybe a background process doing it
incrementally would be better?

Also, question (for myself but also for the curious):

If you're allocating memory, can you synchronously reclaim, or does the
memory have to be free already?  I have a hunch that if you get caught
with freeable memory that's still being used as clean pagecache, you
should be able to free it immediately if memory is scarce...but then
again it might choke because a process in userland could always touch
it through vfs or something like that.

> The trigger is a kcalloc() in the i915 driver:
> 
>     Xorg invoked oom-killer:
> gfp_mask=0x240c0d0(GFP_TEMPORARY|__GFP_COMP|__GFP_ZERO), order=3,
> oom_score_adj=0
> 
>       __kmalloc+0x1cd/0x1f0
>       alloc_gen8_temp_bitmaps+0x47/0x80 [i915]
> 
> which looks like it is one of these:
> 
>   slabinfo - version: 2.1
>   # name            <active_objs> <num_objs> <objsize> <objperslab>
> <pagesperslab>
>   kmalloc-8192         268    268   8192    4    8
>   kmalloc-4096         732    786   4096    8    8
>   kmalloc-2048        1402   1456   2048   16    8
>   kmalloc-1024        2505   2976   1024   32    8
> 
> so even just a 1kB allocation can cause an order-3 page allocation.
> 
> And yeah, I had what, 137MB free memory, it's just that it's all
> fairly fragmented. There's actually even order-4 pages, but they are
> in low DMA memory and the system tries to protect them:
> 
>   Node 0 DMA: 0*4kB 1*8kB (U) 2*16kB (U) 1*32kB (U) 3*64kB (U) 2*128kB
> (U) 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB
>   Node 0 DMA32: 11110*4kB (UMEH) 2929*8kB (UMEH) 44*16kB (MH) 1*32kB
> (H) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB =
> 68608kB
>   Node 0 Normal: 14031*4kB (UMEH) 49*8kB (UMEH) 18*16kB (UH) 0*32kB
> 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 56804kB
>   Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
> hugepages_size=1048576kB
>   Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0
> hugepages_size=2048kB
>   2084682 total pagecache pages
>   11 pages in swap cache
>   Swap cache stats: add 35, delete 24, find 2/3
>   Free swap  = 8191868kB
>   Total swap = 8191996kB
>   4168499 pages RAM
> 
> And it looks like there's a fair amount of memory busy under writeback
> (470MB or so)
> 
>   active_anon:1539159 inactive_anon:374915 isolated_anon:0
>                             active_file:1251771 inactive_file:450068
> isolated_file:0
>                             unevictable:175 dirty:26 writeback:118690
> unstable:0 slab_reclaimable:220784 slab_unreclaimable:39819
>                             mapped:491617 shmem:382891
> pagetables:20439 bounce:0 free:35301 free_pcp:895 free_cma:0
> 
> And yes, CONFIG_COMPACTION was enabled.

Does this compact manually or automatically?

> So quite honestly, I *really* don't think that a 1kB allocation should
> have reasonably failed and killed anything at all (ok, it could have
> been an 8kB one, who knows - but it really looks like it *could* have
> been just 1kB).
> 
> Considering that kmalloc() pattern, I suspect that we need to consider
> order-3 allocations "small", and try a lot harder.
> 
> Because killing processes due to "out of memory" in this situation is
> unquestionably a bug.

In this case I'd wonder why the freeable-but-still-used-in-pagecache
memory isn't being reaped at alloc time.

> And no, I can't recreate this, obviously.
> 
> I think there's a series in -mm that hasn't been merged and that is
> pending (presumably for 4.9). I think Arkadiusz tested it for his
> (repeatable) workload. It may need to be considered for 4.8, because
> the above is ridiculously bad, imho.
> 
> Andrew? Vlastimil? Michal? Others?
> 
>             Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-21  7:04 ` Raymond Jennings
@ 2016-09-21  7:29   ` Michal Hocko
  2016-09-29  6:12   ` More OOM problems (sorry fro the mail bomb) Raymond Jennings
  1 sibling, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2016-09-21  7:29 UTC (permalink / raw)
  To: Raymond Jennings
  Cc: Linus Torvalds, Tetsuo Handa, Oleg Nesterov, Vladimir Davydov,
	Vlastimil Babka, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Joonsoo Kim, linux-mm

On Wed 21-09-16 00:04:58, Raymond Jennings wrote:
> On Sun, 18 Sep 2016 13:03:01 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > [ More or less random collection of people from previous oom patches
> > and/or discussions, if you feel you shouldn't have been cc'd, blame me
> > for just picking things from earlier threads and/or commits ]
> > 
> > I'm afraid that the oom situation is still not fixed, and the "let's
> > die quickly" patches are still a nasty regression.
> > 
> > I have a 16GB desktop that I just noticed killed one of the chrome
> > tabs yesterday. Tha machine had *tons* of freeable memory, with
> > something like 7GB of page cache at the time, if I read this right.
> 
> Suggestions:
> 
> * Live compaction?
> 
> Have a background process that actively defragments free memory by
> bubbling movable pages to one end of the zone and the free holes to the
> other end?
> 
> Same spirit perhaps as khugepaged, periodically walk a zone from one
> end and migrate any used movable pages into the hole closest to the
> other end?

we have something like that already. It's called kcompactd

> I dunno, doing this manually with /proc/sys/vm/compact_blah seems a
> little hamfisted to me, and maybe a background process doing it
> incrementally would be better?
> 
> Also, question (for myself but also for the curious):
> 
> If you're allocating memory, can you synchronously reclaim, or does the
> memory have to be free already?

Yes we do direct reclaim if we are hitting watermarks. kswapd will start
earlier to prevent from direct reclaim because that will incur
latencies.

[...]
> > And yes, CONFIG_COMPACTION was enabled.
> 
> Does this compact manually or automatically?

Without this option there is no compaction at all and the reclaim is the
only source of high order pages.

> > So quite honestly, I *really* don't think that a 1kB allocation should
> > have reasonably failed and killed anything at all (ok, it could have
> > been an 8kB one, who knows - but it really looks like it *could* have
> > been just 1kB).
> > 
> > Considering that kmalloc() pattern, I suspect that we need to consider
> > order-3 allocations "small", and try a lot harder.
> > 
> > Because killing processes due to "out of memory" in this situation is
> > unquestionably a bug.
> 
> In this case I'd wonder why the freeable-but-still-used-in-pagecache
> memory isn't being reaped at alloc time.

I've tried to explain in other email. But let me try again. Compaction
code will back off and refrain from doing anything if we are close the
watermarks. This was your case as I've pointed in other email. The
workaround (retry as long as we are above order-0 watermark) which is
sitting in the Linus' tree will prevent only high order ooms only if
there is some memory left which should be normally the case because the
reclaim should free up something but if you hit parallel allocation
during reclaim somebody might have eaten up that memory. That's why I've
said it's far from idea but it should at least plug the biggest hole.

The patches from Vlastimil get us back to compaction feedback route
which was my original design. That means we keep reclaiming while the
compaction backs off and keep retrying as long as the compaction doesn't
fail. His changes get rid of some heuristics if we are getting close to
OOM situation so it should work much more reliably than my original
implementation. He doesn't have to change the detection code but rather
change compaction implementation details.

HTH
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-19  8:53       ` Michal Hocko
@ 2016-09-25 21:48         ` Lorenzo Stoakes
  2016-09-26  7:48           ` Michal Hocko
  0 siblings, 1 reply; 36+ messages in thread
From: Lorenzo Stoakes @ 2016-09-25 21:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linus Torvalds, Tetsuo Handa, Oleg Nesterov, Vladimir Davydov,
	Vlastimil Babka, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Joonsoo Kim, linux-mm

On Mon, Sep 19, 2016 at 10:53:48AM +0200, Michal Hocko wrote:
> On Mon 19-09-16 09:42:37, Lorenzo Stoakes wrote:
> > On Mon, Sep 19, 2016 at 10:32:15AM +0200, Michal Hocko wrote:
> > >
> > > so this is the same thing as in Linus case. All the zones are hitting
> > > min wmark so the should_compact_retry() gave up. As mentioned in other
> > > email [1] this is inherent limitation of the workaround. Your system is
> > > swapless but there is a lot of the reclaimable page cache so Vlastimil's
> > > patches should help.
> >
> > I will experiment with a linux-next kernel and see if the problem
> > recurs. I've attempted to see if there is a way to manually reproduce
> > on the mainline kernel by performing workloads that triggered the
> > OOM (loading google sheets tabs, compiling a kernel, playing a video
> > on youtube), but to no avail - it seems the system needs to be
> > sufficiently fragmented first before it'll trigger.
> >
> > Given that's the case, I'll just have to try using the linux-next
> > kernel and if you don't hear from me you can assume it did not repro
> > again :)
>
> OK, fair deal ;)

Actually, I'll break the deal :) I've been running workloads similar to previous
weeks when I encountered the issue - including kernel builds, video playing,
lotsa tabs, etc. and also tried to intentionally eat up a bit of RAM from
time-to-time and have not seen a single OOM, so it looks like this is sorted it
for my system, notwithstanding Murphy's law.

(I ended up using the mm tree as irritatingly I couldn't get linux-next working
with the arch linux build system, but it definitely includes Vlastimil's
patches.)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-09-25 21:48         ` Lorenzo Stoakes
@ 2016-09-26  7:48           ` Michal Hocko
  0 siblings, 0 replies; 36+ messages in thread
From: Michal Hocko @ 2016-09-26  7:48 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Linus Torvalds, Tetsuo Handa, Oleg Nesterov, Vladimir Davydov,
	Vlastimil Babka, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Joonsoo Kim, linux-mm

On Sun 25-09-16 22:48:23, Lorenzo Stoakes wrote:
> On Mon, Sep 19, 2016 at 10:53:48AM +0200, Michal Hocko wrote:
> > On Mon 19-09-16 09:42:37, Lorenzo Stoakes wrote:
> > > On Mon, Sep 19, 2016 at 10:32:15AM +0200, Michal Hocko wrote:
> > > >
> > > > so this is the same thing as in Linus case. All the zones are hitting
> > > > min wmark so the should_compact_retry() gave up. As mentioned in other
> > > > email [1] this is inherent limitation of the workaround. Your system is
> > > > swapless but there is a lot of the reclaimable page cache so Vlastimil's
> > > > patches should help.
> > >
> > > I will experiment with a linux-next kernel and see if the problem
> > > recurs. I've attempted to see if there is a way to manually reproduce
> > > on the mainline kernel by performing workloads that triggered the
> > > OOM (loading google sheets tabs, compiling a kernel, playing a video
> > > on youtube), but to no avail - it seems the system needs to be
> > > sufficiently fragmented first before it'll trigger.
> > >
> > > Given that's the case, I'll just have to try using the linux-next
> > > kernel and if you don't hear from me you can assume it did not repro
> > > again :)
> >
> > OK, fair deal ;)
> 
> Actually, I'll break the deal :) I've been running workloads similar to previous
> weeks when I encountered the issue - including kernel builds, video playing,
> lotsa tabs, etc. and also tried to intentionally eat up a bit of RAM from
> time-to-time and have not seen a single OOM, so it looks like this is sorted it
> for my system, notwithstanding Murphy's law.

Thanks for the feedback. Your testing is highly appreciated! I guess
Andrew can put your Tested-by for the latest Vlastimil patches to credit
your effort.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems (sorry fro the mail bomb)
  2016-09-21  7:04 ` Raymond Jennings
  2016-09-21  7:29   ` Michal Hocko
@ 2016-09-29  6:12   ` Raymond Jennings
  2016-09-29  7:03     ` Vlastimil Babka
  1 sibling, 1 reply; 36+ messages in thread
From: Raymond Jennings @ 2016-09-29  6:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Michal Hocko, Tetsuo Handa, Oleg Nesterov, Vladimir Davydov,
	Vlastimil Babka, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Joonsoo Kim, linux-mm

On Wed, 21 Sep 2016 00:04:58 -0700
Raymond Jennings <shentino@gmail.com> wrote:

I would like to apologize to everyone for the mailbombing.  Something
went screwy with my email client and I had to bitchslap my installation
when I saw my gmail box full of half-composed messages being sent out.

For the curious, by the by, how does kcompactd work?  Does it just get
run on request or is it a continuous background process akin to
khugepaged?  Is there a way to keep it running in the background
defragmenting on a continuous trickle basis?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems (sorry fro the mail bomb)
  2016-09-29  6:12   ` More OOM problems (sorry fro the mail bomb) Raymond Jennings
@ 2016-09-29  7:03     ` Vlastimil Babka
  2016-09-29 20:08       ` Raymond Jennings
  0 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2016-09-29  7:03 UTC (permalink / raw)
  To: Raymond Jennings, Linus Torvalds
  Cc: Michal Hocko, Tetsuo Handa, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Ralf-Peter Rohbeck, Jiri Slaby, Olaf Hering, Joonsoo Kim,
	linux-mm

On 09/29/2016 08:12 AM, Raymond Jennings wrote:
> On Wed, 21 Sep 2016 00:04:58 -0700
> Raymond Jennings <shentino@gmail.com> wrote:
>
> I would like to apologize to everyone for the mailbombing.  Something
> went screwy with my email client and I had to bitchslap my installation
> when I saw my gmail box full of half-composed messages being sent out.

FWIW, I apparently didn't receive any.

> For the curious, by the by, how does kcompactd work?  Does it just get
> run on request or is it a continuous background process akin to
> khugepaged?  Is there a way to keep it running in the background
> defragmenting on a continuous trickle basis?

Right now it gets run on request. Kswapd is woken up when watermarks get 
between "min" and "low" and when it finishes reclaim and it was a 
high-order request, it wakes up kcompactd, which compacts until page of 
given order is available. That mimics how it was before when kswapd did 
the compaction itself, but I know it's not ideal and plan to make 
kcompactd more proactive.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems (sorry fro the mail bomb)
  2016-09-29  7:03     ` Vlastimil Babka
@ 2016-09-29 20:08       ` Raymond Jennings
  2016-09-29 21:20         ` Vlastimil Babka
  0 siblings, 1 reply; 36+ messages in thread
From: Raymond Jennings @ 2016-09-29 20:08 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linus Torvalds, Michal Hocko, Tetsuo Handa, Oleg Nesterov,
	Vladimir Davydov, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Joonsoo Kim, linux-mm

On Thu, Sep 29, 2016 at 12:03 AM, Vlastimil Babka <vbabka@suse.cz> 
wrote:
> On 09/29/2016 08:12 AM, Raymond Jennings wrote:
>> On Wed, 21 Sep 2016 00:04:58 -0700
>> Raymond Jennings <shentino@gmail.com> wrote:
>> 
>> I would like to apologize to everyone for the mailbombing.  Something
>> went screwy with my email client and I had to bitchslap my 
>> installation
>> when I saw my gmail box full of half-composed messages being sent 
>> out.
> 
> FWIW, I apparently didn't receive any.

Trying geary this time, keeping my fingers crossed

>> For the curious, by the by, how does kcompactd work?  Does it just 
>> get
>> run on request or is it a continuous background process akin to
>> khugepaged?  Is there a way to keep it running in the background
>> defragmenting on a continuous trickle basis?
> 
> Right now it gets run on request. Kswapd is woken up when watermarks 
> get between "min" and "low" and when it finishes reclaim and it was a 
> high-order request, it wakes up kcompactd, which compacts until page 
> of given order is available. That mimics how it was before when 
> kswapd did the compaction itself, but I know it's not ideal and plan 
> to make kcompactd more proactive.

Suggestion:

1.  Make it a background process "kcompactd"
2.  It is activated/woke up/semaphored awake any time a page is freed.
3.  Once it is activated, it enters a loop:
3.1.  Reset the semaphore.
3.2.  Once a cycle, it takes the highest movable page
3.3.  It then finds the lowest free page
3.4.  Then, it migrates the highest used page to the lowest free space
3.5.  maybe pace itself by sleeping for a teensy, then go back to step 
3.2
3.6.  Do one page at a time to keep it neatly interruptible and keep it 
from blocking other stuff.  Since compaction is a housekeeping task, it 
should probably be eager to yield to other things.
3.7.  Probably leave hugepages alone if detected since they are by 
definition fairly defragmented already.
4.  Once all gaps are backfilled, go back to sleep and park back at 
step 2 waiting for the next wakeup.

Would this be a good way to do it?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems (sorry fro the mail bomb)
  2016-09-29 20:08       ` Raymond Jennings
@ 2016-09-29 21:20         ` Vlastimil Babka
  2016-09-30 19:48           ` Raymond Jennings
  0 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2016-09-29 21:20 UTC (permalink / raw)
  To: Raymond Jennings
  Cc: Linus Torvalds, Michal Hocko, Tetsuo Handa, Oleg Nesterov,
	Vladimir Davydov, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Joonsoo Kim, linux-mm

On 09/29/2016 10:08 PM, Raymond Jennings wrote:
> Suggestion:
> 
> 1.  Make it a background process "kcompactd"
> 2.  It is activated/woke up/semaphored awake any time a page is freed.
> 3.  Once it is activated, it enters a loop:
> 3.1.  Reset the semaphore.
> 3.2.  Once a cycle, it takes the highest movable page
> 3.3.  It then finds the lowest free page
> 3.4.  Then, it migrates the highest used page to the lowest free space
> 3.5.  maybe pace itself by sleeping for a teensy, then go back to step 
> 3.2
> 3.6.  Do one page at a time to keep it neatly interruptible and keep it 
> from blocking other stuff.  Since compaction is a housekeeping task, it 
> should probably be eager to yield to other things.
> 3.7.  Probably leave hugepages alone if detected since they are by 
> definition fairly defragmented already.
> 4.  Once all gaps are backfilled, go back to sleep and park back at 
> step 2 waiting for the next wakeup.
> 
> Would this be a good way to do it?

Yes, that's pretty much how it already works, except movable pages are
taken from low pfn and free pages from high. Then there's ton of subtle
issues to tackle, mostly the balance between overhead and benefit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems (sorry fro the mail bomb)
  2016-09-29 21:20         ` Vlastimil Babka
@ 2016-09-30 19:48           ` Raymond Jennings
  0 siblings, 0 replies; 36+ messages in thread
From: Raymond Jennings @ 2016-09-30 19:48 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linus Torvalds, Michal Hocko, Tetsuo Handa, Oleg Nesterov,
	Vladimir Davydov, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Ralf-Peter Rohbeck, Jiri Slaby,
	Olaf Hering, Joonsoo Kim, linux-mm

On Thu, Sep 29, 2016 at 2:20 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 09/29/2016 10:08 PM, Raymond Jennings wrote:
>>  Suggestion:
>> 
>>  1.  Make it a background process "kcompactd"
>>  2.  It is activated/woke up/semaphored awake any time a page is 
>> freed.
>>  3.  Once it is activated, it enters a loop:
>>  3.1.  Reset the semaphore.
>>  3.2.  Once a cycle, it takes the highest movable page
>>  3.3.  It then finds the lowest free page
>>  3.4.  Then, it migrates the highest used page to the lowest free 
>> space
>>  3.5.  maybe pace itself by sleeping for a teensy, then go back to 
>> step
>>  3.2
>>  3.6.  Do one page at a time to keep it neatly interruptible and 
>> keep it
>>  from blocking other stuff.  Since compaction is a housekeeping 
>> task, it
>>  should probably be eager to yield to other things.
>>  3.7.  Probably leave hugepages alone if detected since they are by
>>  definition fairly defragmented already.
>>  4.  Once all gaps are backfilled, go back to sleep and park back at
>>  step 2 waiting for the next wakeup.
>> 
>>  Would this be a good way to do it?
> 
> Yes, that's pretty much how it already works, except movable pages are
> taken from low pfn and free pages from high. Then there's ton of 
> subtle
> issues to tackle, mostly the balance between overhead and benefit.

Besides the kswapd hook, what would nudge kcompactd to run?  If its not 
proactively nudged after a page is freed how will it know that there's 
fragmentation that could be taken care of in advance before being 
shoved by kswapd?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-10-31 21:41           ` Vlastimil Babka
@ 2016-10-31 21:51             ` Vlastimil Babka
  0 siblings, 0 replies; 36+ messages in thread
From: Vlastimil Babka @ 2016-10-31 21:51 UTC (permalink / raw)
  To: Simon Kirby
  Cc: Michal Hocko, Ralf-Peter Rohbeck, Linus Torvalds, Tetsuo Handa,
	Oleg Nesterov, Vladimir Davydov, Andrew Morton,
	Markus Trippelsdorf, Arkadiusz Miskiewicz, Jiri Slaby,
	Olaf Hering, Joonsoo Kim, linux-mm

On 10/31/2016 10:41 PM, Vlastimil Babka wrote:
> In any case, it's still bad for 4.8 then.
> Can you send /proc/vmstat from the system with an uptime that already
> experienced at least one such oom?

Oh, and it might make sense to try the patch at the end of this e-mail:

https://marc.info/?l=linux-mm&m=147423605024993

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-10-30  4:17         ` Simon Kirby
@ 2016-10-31 21:41           ` Vlastimil Babka
  2016-10-31 21:51             ` Vlastimil Babka
  0 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2016-10-31 21:41 UTC (permalink / raw)
  To: Simon Kirby
  Cc: Michal Hocko, Ralf-Peter Rohbeck, Linus Torvalds, Tetsuo Handa,
	Oleg Nesterov, Vladimir Davydov, Andrew Morton,
	Markus Trippelsdorf, Arkadiusz Miskiewicz, Jiri Slaby,
	Olaf Hering, Joonsoo Kim, linux-mm

On 10/30/2016 05:17 AM, Simon Kirby wrote:
> On Tue, Oct 11, 2016 at 09:10:13AM +0200, Vlastimil Babka wrote:
> 
>> Great indeed. Note that meanwhile the patches went to mainline so
>> we'd definitely welcome testing from the rest of you who had
>> originally problems with 4.7/4.8 and didn't try the linux-next
>> recently. So a good point would be to test 4.9-rc1 when it's
>> released. I hope you don't want to discover regressions again too
>> late, in the 4.9 final release :)
> 
> Hello!
> 
> I have a mixed-purpose HTPCish box running MythTV, etc. that I recently
> upgraded from 4.6.7 to 4.8.4. This upgrade started OOM killing of various
> processes even when there is plenty (gigabytes) of memory as page cache.

Hmm, that's too bad.

> This is with CONFIG_COMPACTION=y, and it occurs with or without swap on.
> I'm not able to confirm on 4.9-rc2 since nouveau doesn't support NV117
> and binary blob nvidia doesn't yet like the changes to get_user_pages.

Please try once it starts liking the changes.
Actually this kernel-interface part of the driver isn't binary blob
AFAIK, so it should be possible to adapt it?

> 4.8 includes "prevent premature OOM killer invocation for high order
> request" which sounds like it should fix the issue, but this certainly
> does not seem to be the case for me. I copied kern.log and .config here:
> http://0x.ca/sim/ref/4.8.4/

Looks like the available high-order pages are only as part of the
highatomic reserves. I've checked if there might be some error in the
functions deciding to reclaim/compact where they would wrongly decide
that these pages are available, but it seems fine to me.

> I see that this is reverted in 4.9-rc and replaced with something else.
> Unfortunately, I can't test this workload without the nvidia tainting,
> and "git log --oneline v4.8..v4.9-rc2 mm | grep oom | wc -l" returns 13.
> Is there some stuff I should cherry-pick to try?

Well, there were around 10 related patches, so I would rather try to
adapt the nvidia code first, if possible.

In any case, it's still bad for 4.8 then.
Can you send /proc/vmstat from the system with an uptime that already
experienced at least one such oom?

> Simon-
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-10-11  7:10       ` Vlastimil Babka
@ 2016-10-30  4:17         ` Simon Kirby
  2016-10-31 21:41           ` Vlastimil Babka
  0 siblings, 1 reply; 36+ messages in thread
From: Simon Kirby @ 2016-10-30  4:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Ralf-Peter Rohbeck, Linus Torvalds, Tetsuo Handa,
	Oleg Nesterov, Vladimir Davydov, Andrew Morton,
	Markus Trippelsdorf, Arkadiusz Miskiewicz, Jiri Slaby,
	Olaf Hering, Joonsoo Kim, linux-mm

On Tue, Oct 11, 2016 at 09:10:13AM +0200, Vlastimil Babka wrote:

> Great indeed. Note that meanwhile the patches went to mainline so
> we'd definitely welcome testing from the rest of you who had
> originally problems with 4.7/4.8 and didn't try the linux-next
> recently. So a good point would be to test 4.9-rc1 when it's
> released. I hope you don't want to discover regressions again too
> late, in the 4.9 final release :)

Hello!

I have a mixed-purpose HTPCish box running MythTV, etc. that I recently
upgraded from 4.6.7 to 4.8.4. This upgrade started OOM killing of various
processes even when there is plenty (gigabytes) of memory as page cache.

This is with CONFIG_COMPACTION=y, and it occurs with or without swap on.
I'm not able to confirm on 4.9-rc2 since nouveau doesn't support NV117
and binary blob nvidia doesn't yet like the changes to get_user_pages.

4.8 includes "prevent premature OOM killer invocation for high order
request" which sounds like it should fix the issue, but this certainly
does not seem to be the case for me. I copied kern.log and .config here:
http://0x.ca/sim/ref/4.8.4/

I see that this is reverted in 4.9-rc and replaced with something else.
Unfortunately, I can't test this workload without the nvidia tainting,
and "git log --oneline v4.8..v4.9-rc2 mm | grep oom | wc -l" returns 13.
Is there some stuff I should cherry-pick to try?

Simon-

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
  2016-10-11  6:44     ` More OOM problems Michal Hocko
@ 2016-10-11  7:10       ` Vlastimil Babka
  2016-10-30  4:17         ` Simon Kirby
  0 siblings, 1 reply; 36+ messages in thread
From: Vlastimil Babka @ 2016-10-11  7:10 UTC (permalink / raw)
  To: Michal Hocko, Ralf-Peter Rohbeck
  Cc: Linus Torvalds, Tetsuo Handa, Oleg Nesterov, Vladimir Davydov,
	Andrew Morton, Markus Trippelsdorf, Arkadiusz Miskiewicz,
	Jiri Slaby, Olaf Hering, Joonsoo Kim, linux-mm

On 10/11/2016 08:44 AM, Michal Hocko wrote:
> [Let's restore the CC list]
>
> On Mon 10-10-16 10:20:27, Ralf-Peter Rohbeck wrote:
>> I ran my torture test overnight (after finding the last linux-next branch
>> that compiled, sigh...):
>> Wrote two 4TB USB3 drives, compiled a kernel and ran my btrfs dedup script
>> in parallel.
>
> Thanks for testing and good to hear that premature OOMs are gone

Great indeed. Note that meanwhile the patches went to mainline so we'd 
definitely welcome testing from the rest of you who had originally problems with 
4.7/4.8 and didn't try the linux-next recently. So a good point would be to test 
4.9-rc1 when it's released. I hope you don't want to discover regressions again 
too late, in the 4.9 final release :)

Vlastimil

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: More OOM problems
       [not found]   ` <982671bd-5733-0cd5-c15d-112648ff14c5@Quantum.com>
@ 2016-10-11  6:44     ` Michal Hocko
  2016-10-11  7:10       ` Vlastimil Babka
  0 siblings, 1 reply; 36+ messages in thread
From: Michal Hocko @ 2016-10-11  6:44 UTC (permalink / raw)
  To: Ralf-Peter Rohbeck
  Cc: Vlastimil Babka, Linus Torvalds, Tetsuo Handa, Oleg Nesterov,
	Vladimir Davydov, Andrew Morton, Markus Trippelsdorf,
	Arkadiusz Miskiewicz, Jiri Slaby, Olaf Hering, Joonsoo Kim,
	linux-mm

[Let's restore the CC list]

On Mon 10-10-16 10:20:27, Ralf-Peter Rohbeck wrote:
> I ran my torture test overnight (after finding the last linux-next branch
> that compiled, sigh...):
> Wrote two 4TB USB3 drives, compiled a kernel and ran my btrfs dedup script
> in parallel.

Thanks for testing and good to hear that premature OOMs are gone

> There were a few allocation failures but I didn't notice anything amiss but
> the log entries.
> Logs are at
> https://filebin.net/duj4c1bv64uohm5q/OOM_4.8.0-rc7-next-20160920.tar.bz2.

Oct 10 03:35:18 fs kernel: kworker/1:202: page allocation failure: order:0, mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
Oct 10 03:35:18 fs kernel: kworker/1:214: page allocation failure: order:0, mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
Oct 10 03:35:18 fs kernel: kworker/1:236: page allocation failure: order:0, mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
Oct 10 03:35:18 fs kernel: kworker/1:236: page allocation failure: order:0, mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
Oct 10 03:35:18 fs kernel: kworker/1:224: page allocation failure: order:0, mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
Oct 10 03:35:18 fs kernel: kworker/1:224: page allocation failure: order:0, mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
Oct 10 03:35:18 fs kernel: kworker/1:172: page allocation failure: order:0, mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
Oct 10 03:35:18 fs kernel: kworker/1:227: page allocation failure: order:0, mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
Oct 10 03:35:18 fs kernel: kworker/1:226: page allocation failure: order:0, mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
Oct 10 03:35:18 fs kernel: kworker/1:229: page allocation failure: order:0, mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
Oct 10 06:45:54 fs kernel: kworker/3:91: page allocation failure: order:0, mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)
Oct 10 06:45:54 fs kernel: kworker/3:91: page allocation failure: order:0, mode:0x2204000(GFP_NOWAIT|__GFP_COMP|__GFP_NOTRACK)

So those are all atomic (aka not sleeping) 4K allocations failing
because you are running low on memory and this kind of allocation
requests cannot reclaim any memory.
: Oct 10 03:35:18 fs kernel: Node 0 active_anon:28004kB inactive_anon:532404kB active_file:5665056kB inactive_file:1290052kB unevictable:64kB isolated(anon):0kB isolated(file):128kB mapped:46196kB dirty:686200kB writeback:124196kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 17920kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
: Oct 10 03:35:18 fs kernel: Node 0 DMA free:14236kB min:128kB low:160kB high:192kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15980kB managed:15896kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:1660kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
: Oct 10 03:35:18 fs kernel: lowmem_reserve[]: 0 1939 7939 7939 7939
: Oct 10 03:35:18 fs kernel: Node 0 DMA32 free:40476kB min:16480kB low:20600kB high:24720kB active_anon:6472kB inactive_anon:14408kB active_file:1073784kB inactive_file:740536kB unevictable:0kB writepending:470432kB present:2072256kB managed:2006688kB mlocked:0kB slab_reclaimable:60376kB slab_unreclaimable:32844kB kernel_stack:8352kB pagetables:1984kB bounce:0kB free_pcp:164kB local_pcp:0kB free_cma:0kB
: Oct 10 03:35:18 fs kernel: lowmem_reserve[]: 0 0 5999 5999 5999

These two zones are above min watermark but still under if we consider
lowmemory reserves.

: Oct 10 03:35:18 fs kernel: Node 0 Normal free:50928kB min:50968kB low:63708kB high:76448kB active_anon:21532kB inactive_anon:517996kB active_file:4591272kB inactive_file:549636kB unevictable:64kB writepending:339940kB present:6291456kB managed:6147908kB mlocked:64kB slab_reclaimable:105320kB slab_unreclaimable:146140kB kernel_stack:17664kB pagetables:43872kB bounce:0kB free_pcp:340kB local_pcp:0kB free_cma:0kB
: Oct 10 03:35:18 fs kernel: lowmem_reserve[]: 0 0 0 0 0

and this zone is below the min watermark. I haven't checked other
allocation failures but I assume a similar situation. It looks that you
have a peak memory pressure load and kswapd just cannot catch up with it
for a moment. Note that most of those failures come within a second. You
can ignore these warnings.

I will just note that all those failures come from bcache.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2016-10-31 21:51 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-18 20:03 More OOM problems Linus Torvalds
2016-09-18 20:26 ` Lorenzo Stoakes
2016-09-18 20:58   ` Linus Torvalds
2016-09-18 21:13     ` Vlastimil Babka
2016-09-18 21:34       ` Lorenzo Stoakes
2016-09-19  8:32   ` Michal Hocko
2016-09-19  8:42     ` Lorenzo Stoakes
2016-09-19  8:53       ` Michal Hocko
2016-09-25 21:48         ` Lorenzo Stoakes
2016-09-26  7:48           ` Michal Hocko
2016-09-18 21:00 ` Vlastimil Babka
2016-09-18 21:18   ` Linus Torvalds
2016-09-19  6:27     ` Jiri Slaby
2016-09-19  7:01     ` Michal Hocko
2016-09-19  7:52       ` Michal Hocko
2016-09-19  1:07   ` Andi Kleen
     [not found]     ` <alpine.DEB.2.20.1609190836540.12121@east.gentwo.org>
2016-09-19 14:31       ` Andi Kleen
2016-09-19 14:39         ` Michal Hocko
2016-09-19 14:41         ` Vlastimil Babka
2016-09-19 18:18           ` Linus Torvalds
2016-09-19 19:57           ` Christoph Lameter
2016-09-18 22:00 ` Vlastimil Babka
2016-09-19  6:56   ` Michal Hocko
2016-09-19  6:48 ` Michal Hocko
2016-09-21  7:04 ` Raymond Jennings
2016-09-21  7:29   ` Michal Hocko
2016-09-29  6:12   ` More OOM problems (sorry fro the mail bomb) Raymond Jennings
2016-09-29  7:03     ` Vlastimil Babka
2016-09-29 20:08       ` Raymond Jennings
2016-09-29 21:20         ` Vlastimil Babka
2016-09-30 19:48           ` Raymond Jennings
     [not found] <eafb59b5-0a2b-0e28-ca79-f044470a2851@Quantum.com>
     [not found] ` <20160930214448.GB28379@dhcp22.suse.cz>
     [not found]   ` <982671bd-5733-0cd5-c15d-112648ff14c5@Quantum.com>
2016-10-11  6:44     ` More OOM problems Michal Hocko
2016-10-11  7:10       ` Vlastimil Babka
2016-10-30  4:17         ` Simon Kirby
2016-10-31 21:41           ` Vlastimil Babka
2016-10-31 21:51             ` Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.