All of lore.kernel.org
 help / color / mirror / Atom feed
* post linux 4.4 vm oom kill, lockup and thrashing woes
@ 2018-07-10 12:07 Marc Lehmann
  2018-07-10 12:32 ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Marc Lehmann @ 2018-07-10 12:07 UTC (permalink / raw)
  To: linux-mm; +Cc: Michal Hocko

(I am not subscribed)

Hi!

While reporting another (not strictly related) kernel bug
(https://bugzilla.kernel.org/show_bug.cgi?id=199931) I was encouraged to
report my problem here, even though, in my opinion, I don't have enough
hard data for a good bug report, so bear with me, please.

Basically, the post 4.4 VM system (I think my troubles started around 4.6
or 4.7) is nearly unusable on all of my (very different) systems that
actually do some work, with symptoms being frequent OOM kills with many
gigabytes of available memory, extended periods of semi-freezing with
thrashing, and apparent hard lockups, almost certainly related to memory
usage.

I have experienced this with both debians and ubuntus precompiled kernels
(4.9 being the most unstable for me) as well as with my own. Booting 4.4
makes the problems go away in all cases.

Since I kept losing my logs due to the other kernel bug caused by my
workaround, I don't have a lot of good logs, so this is mostly anecdotal,
but I hope this is of some use, especially since I found a workaround for
each case that reduces or alleviates the problem, and so might shed some
light on the underlying issue(s).

I present three "case studies" of how I can create/trigger these problems
on three very different systems, a server, a desktop, and a very old
memory-starved laptop, all of which becomes close to unusable for daily
work under post 4.4 kernels.

=============================================================================
Case #1, the home server, frequent unexpected oom kills
=============================================================================

The first system is a server which does a lot of heavy lifting with a lot
of data (>60TB of disk, a lot of activity). It has 32GB of RAM, and almost
never uses more than 8GB of it, the rest usually being disk cache, e.g.:

		  total        used        free      shared  buff/cache   available
    Mem:       32888772     1461060      813500       13740    30614212    30956016
    Swap:       4194300       54016     4140284

Under 4.4, it runs "mostly" rock stable. With debian's 4.9, mysql usually
is killed within a single night. 4.14 is much better, but when doing
backups or other memory intensive jobs, it usually gets killed. Many
times. Usually with >>16GB of "available" memory that linux could use
instead, if it weren't so fragmented or it could free some of it.

Here are some oom reports, all happened during my nightly backup, under
4.14.33:

    http://data.plan9.de/oom-mysql-4.14-201806.txt

This specific OOM kill series happened during backup, which mainly does a
lot of stat() calls (as in, a hundred million+), but while this helps
triggering oom killls, it is by no means the required trigger.

I lost all of the previous OOM kill reports, but AFAICR, they are
invariably caused by higher order allocations, often by the nvidia driver,
which just loves higher order allocations, but they do happen with other
subsystems (such as btrfs) too, and were often triggered by measly order 1
allocations as well.

I have tried various workarounds, and under 4.14, I found that doing this
every hour or so greatly reduced the oom kills (and unfortunately also
causes file corruptiopn, but that's unrelated :):

    echo 1 > /proc/sys/vm/drop_caches

I have tried various other things that didn't work: "echo 1
>/proc/sys/vm/compact_memory", every minute, increasing min_free_kbytes,
setting swappiness to 1 or 100, setting vfs_cache_pressure to 50 or 150
reducing extfrag_threshold.

Clearly, the server has enough memory, but linux has enourmous troubles
making use of it under 4.6+ (or so), while it works fine under
4.4. Naively speaking, linux should obviously drop some cache rather than
drop dead some processes, although I am aware that things are not as
simple as that especially when fragmentation is involved.

=============================================================================
Case #2, my work desktop, frequent unexpected oom kills, frequent lockups
=============================================================================

My work desktop (16GB RAM) also suffers from the same problems as my home
server, with chromium usually being the thing that gets killed first, due
to it's increased oom_score_adjust value, which made me run chromium more
often as a sacrifice process. Clearly a bad thing.

However, under post-4.4 kernels, I also have frequent freezes, which seem
to be hard lockups (I did let it run for 5 to 15 minutes a few times, and
it didn't sem to recover - maybe it's thrashing to the SSD, but I can't
hear that :).

I found a pretty reliable way to get OOM kills or freezes (but they
happen on their own as well, just not as reproducible): mmap a large
file. I have written a simple nbd-based caching program that writes dirty
write data to a separate log file, to be applied later. While it lets
me reproduce the freezes, I don't know if that is the only cause, as I
don't run this cache program very often, but get a lockup every few days
regardless, depending on how heavy I use this machine.

This is a simple simulation of what the cache program does to cause the
problem:

    http://data.plan9.de/mmap-problem-testcase

What this does is create a large 35GB file, mmap it, and then read through
the mappped region, i.e. page it into memory.

Situation before:

		  total        used        free      shared  buff/cache   available
    Mem:       16426820     2455612     1872868       11200    12098340    13623920
    Swap:       8388604       26368     8362236

Situation after starting the problem, when it hangs in sleep 9999:

    7ff72e8e2000-7fffee8e2000 rw-s 00000000 00:17 3746909                    /cryptroot/test
    Size:           36700160 kB
    KernelPageSize:        4 kB
    MMUPageSize:           4 kB
    Rss:             7886400 kB
    Pss:             7886400 kB
    Shared_Clean:          0 kB
    Shared_Dirty:          0 kB
    Private_Clean:   7886400 kB
    Private_Dirty:         0 kB
    Referenced:      7886400 kB
    Anonymous:             0 kB
    LazyFree:              0 kB
    AnonHugePages:         0 kB
    ShmemPmdMapped:        0 kB
    Shared_Hugetlb:        0 kB
    Private_Hugetlb:       0 kB
    Swap:                  0 kB
    SwapPss:               0 kB
    Locked:          7886400 kB
    VmFlags: rd wr sh mr mw me ms sd

		  total        used        free      shared  buff/cache   available
    Mem:       16426820     2391784     5845592        7888     8189444    13734508
    Swap:       8388604       26368     8362236

So, not much changed here one would think, just a bunch of clean pages
that could be freed when memory is needed. Maybe it's notworthy that
I have 8GB buff/cache despite issueing a drop_caches, most of which I
suspect is the non-dirty mmap area.

However, starting kvm with a 8GB memory size in this situation instantly
freezes my box, when it should just work:

   kvm -m 8000 ...

Which is unexpected, with 13GB of "available" memory.

(don't get confused by the Locked: value, since change
493b0e9d945fa9dfe96be93ae41b4ca4b6fdb317, linux always reports Locked
== Pss. I've emailed dancol@google.com about this but never got a
response. There is no mlocking involved, and this confused the heck out of
me for a while).

There is an easy way to make it not freeze: unmap the file, and
immediately mmap it again, which makes all those Private_Clean pages go
away and makes my actual caching program usable, which only has to scan
through the file once during start up and afterwards only has to touch
random pages within.

So, linux 4.14 has trouble freeing these pages, even though they are not
dirty, and instead effectively freezes.

This happens with the mmapped file both on XFS-on-lvm and
BTRFS-on-dmcrypt-on-dmcache-on-lvm, so doesn't seem to be a specific fs
issue.

Another workaround is to create smaller but increasingly sized processes,
e.g.:

   perl -e '1 x 1_000_000_000'
   perl -e '1 x 2_000_000_000'
   perl -e '1 x 4_000_000_000'
   perl -e '1 x 6_000_000_000'
   perl -e '1 x 8_000_000_000'

This manages to recover the "lost" memory somewhat, after which I am able
to start my 8GB vm without causing a freeze:

    7ff72e8e2000-7fffee8e2000 rw-s 00000000 00:17 3746909                    /cryptroot/test
    Size:           36700160 kB
    Rss:             5583036 kB
    Pss:             5583036 kB
    Private_Clean:   5583036 kB
    Referenced:      5583036 kB

The Pss size also reduces slowly over time during normal activity, so
it's clearly not locked by the kernel. The kernel is merely setting its
priorities to freeze rather than free it quickly :)

=============================================================================
Case #3, the 10 year old laptop, thrashing semi freezes
=============================================================================

The last case I present is the laptop at my bed with 2GB of RAM and
8GB of swap. It's used for image/movie viewing, e-book-reading and
firefoxing. The root filesystem is sometimes on a 4GB USB stick and
sometimes on a 16GB SD card, and it has a somewhat broken 32GB SSD used
exclusively for swapping and a dmcache. I know it's weird, but it works.

It's not doing any heavy work, but it uses a lot more memory than it has
RAM for. Its quite amazing: under 4.4, despite constantly using 2+GB
of swap (typically 3.5GB swap is in use), it works _very well_ indeed,
with only occasional split-second pauses due to swapping when switching
desktops for example.

Under 4.14, it freezes for 5-10 minutes every few minutes, but always
recovers. Mouse pointer moves a bit every minuite or so when I am
lucky. So not fun to use when all you wanted to do is to flip pages in
fbreader and suddenly have to pause for 10 minutes. And no, I am not
exaggerating, I stopped it a few times and it really hangs for this long
every few minutes.

While it freezes, there is heavy disk activity. Looking at dstat output
afterwards, it is clear that there is little to no write activity, and all
read activity is to the root filesystem, not swap. swap almost doesn't get
used under 4.14 on this box.

From the little data I have, I would guess that linux runs out of
memory and then throws away code pages, just to immediatelly read them
again. This would explain the heavy read-only disk activity and also why
the box is more or less frozen during these episodes, it's in classic full
thrashing mode.

No amount of tinkering with /proc/sys/vm seems to make a difference (I
owuld have hoped setting swappiness to 100 to help, but nope), but I did
find a workaround that almost completely fixes the problem... wait for
it...

   while sleep 10; do perl -e '1 x 300_000_000';done

i.e., create a dummy 300MB process every 10 seconds. I have no clue why
this works, but it changes the behaviour drastically:

    1. swap gets used, not as aggressively as under 4.4, but it does get used
    2. the box thrash-freezes much less often
    3. if it freezes, it usually recovers after 1-2 minutes, and the mouse
       pointer sometimes moves as well during this time. yay.

It also is very similar to my workaround on my desktop box, although the
mix of programs I run is very different and the memory situation is very
different. Still, I feel linux on my other boxes is just as reluctant to
use swap and rather oom kills or freezes instead.

4.4 in the same box with exactly the same root filesystem has none of
these problems, it simply swaps out stuff when memory gets tight.

=============================================================================
Summary
=============================================================================

So, while this is mostly anecdotal, I think there is a real issue with
post 4.4 kernels. Given the wide range of configurations I run into memory
issues, I think this is not an isolated hardware or config issue, some of
these problems I can reproduce with a debian boot cd as well, so it's not
anything in my config.

I found that around 4.8-4.9 the behaviour was worst - 4.9 makes trouble on
most of my boxes, not just these three, while 4.14 is greatly improved and
works fine on a lot of much more idle servers I have.

I hope this is somewhat useful in finding this issue. Thanks for staying
with me and reading this :)

If requested, I can try to produce more info and do more experimenting,
although maybe not in a very timely matter.

Greetings,

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: post linux 4.4 vm oom kill, lockup and thrashing woes
  2018-07-10 12:07 post linux 4.4 vm oom kill, lockup and thrashing woes Marc Lehmann
@ 2018-07-10 12:32 ` Michal Hocko
  2018-07-17 23:45   ` Marc Lehmann
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2018-07-10 12:32 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-mm

On Tue 10-07-18 14:07:56, Marc Lehmann wrote:
> (I am not subscribed)
> 
> Hi!
> 
> While reporting another (not strictly related) kernel bug
> (https://bugzilla.kernel.org/show_bug.cgi?id=199931) I was encouraged to
> report my problem here, even though, in my opinion, I don't have enough
> hard data for a good bug report, so bear with me, please.
> 
> Basically, the post 4.4 VM system (I think my troubles started around 4.6
> or 4.7) is nearly unusable on all of my (very different) systems that
> actually do some work, with symptoms being frequent OOM kills with many
> gigabytes of available memory, extended periods of semi-freezing with
> thrashing, and apparent hard lockups, almost certainly related to memory
> usage.

JFTR, we have discussed that off-list and Marc has provided on example
oom report:
[48190.574505] nvidia-modeset invoked oom-killer: gfp_mask=0x14040c0(GFP_KERNEL|__GFP_COMP), nodemask=(null),  order=3, oom_score_adj=0
[48190.574508] nvidia-modeset cpuset=/ mems_allowed=0
[...]
[48190.574769] active_anon:960260 inactive_anon:175381 isolated_anon:0
                active_file:1061865 inactive_file:177006 isolated_file:0
                unevictable:0 dirty:273 writeback:0 unstable:0
                slab_reclaimable:1519864 slab_unreclaimable:61079
                mapped:31182 shmem:11064 pagetables:23135 bounce:0
                free:53178 free_pcp:68 free_cma:0
[...]
[48190.574783] Node 0 DMA: 0*4kB 2*8kB (U) 3*16kB (U) 2*32kB (U) 2*64kB (U) 2*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15872kB
[48190.574787] Node 0 DMA32: 2015*4kB (UME) 4517*8kB (UME) 5301*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 129012kB
[48190.574791] Node 0 Normal: 6379*4kB (UME) 2915*8kB (UE) 1266*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 69092kB

We are out of order-3+ oroders in all eligible zones (please note that
DMA zone is not really usable for this request). Different kernel
versions have slightly different implementation of the compaction so
they might behave differently but once it cannot make any progress
then we are out of luck. It is quite unfortunate that nvidia really
insists on having order-3 allocation. Maybe it can use kvmalloc or use
__GFP_RETRY_MAYFAIL in current kernels.

It is quite surprising we have so mach memory yet we are not able to
find order-3 contiguous block. This smells suspicious. You have
previously mentioned that dropping cache helped. So I assume that fs
metadata are fragmenting the memory.

Anyway, I will go over your whole report later. I am quite busy right now.

Thanks for the report!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: post linux 4.4 vm oom kill, lockup and thrashing woes
  2018-07-10 12:32 ` Michal Hocko
@ 2018-07-17 23:45   ` Marc Lehmann
  2018-07-18  8:38     ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Marc Lehmann @ 2018-07-17 23:45 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm

On Tue, Jul 10, 2018 at 02:32:22PM +0200, Michal Hocko <mhocko@kernel.org> wrote:
> then we are out of luck. It is quite unfortunate that nvidia really
> insists on having order-3 allocation. Maybe it can use kvmalloc or use
> __GFP_RETRY_MAYFAIL in current kernels.

Please note that nvidia is really just one of many causes. For example,
right now, on one of our company servers with 25GB of available RAM and
no nvidia driver on linux 4.14.43, I couldn't start any kvm until I did a
manual cache flush:

   ~# vmctl start ...
   ioctl(KVM_CREATE_VM) failed: 12 Cannot allocate memory
   failed to initialize KVM: Cannot allocate memory
   ~# free
                 total        used        free      shared  buff/cache   available
   Mem:       32619348     6712028      989540       21652    24917780    25430736
   Swap:      33554428      249676    33304752
   ~# sync; echo 3 >/proc/sys/vm/drop_caches
   ~# vmctl start ...
   [successful]

reason was an order-6 allocation by kvm:

http://data.plan9.de/kvm_oom.txt

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: post linux 4.4 vm oom kill, lockup and thrashing woes
  2018-07-17 23:45   ` Marc Lehmann
@ 2018-07-18  8:38     ` Michal Hocko
  2018-07-22 23:34       ` Marc Lehmann
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2018-07-18  8:38 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-mm

On Wed 18-07-18 01:45:49, Marc Lehmann wrote:
> On Tue, Jul 10, 2018 at 02:32:22PM +0200, Michal Hocko <mhocko@kernel.org> wrote:
> > then we are out of luck. It is quite unfortunate that nvidia really
> > insists on having order-3 allocation. Maybe it can use kvmalloc or use
> > __GFP_RETRY_MAYFAIL in current kernels.
> 
> Please note that nvidia is really just one of many causes. For example,
> right now, on one of our company servers with 25GB of available RAM and
> no nvidia driver on linux 4.14.43, I couldn't start any kvm until I did a
> manual cache flush:
> 
>    ~# vmctl start ...
>    ioctl(KVM_CREATE_VM) failed: 12 Cannot allocate memory
>    failed to initialize KVM: Cannot allocate memory
>    ~# free
>                  total        used        free      shared  buff/cache   available
>    Mem:       32619348     6712028      989540       21652    24917780    25430736
>    Swap:      33554428      249676    33304752
>    ~# sync; echo 3 >/proc/sys/vm/drop_caches
>    ~# vmctl start ...
>    [successful]
> 
> reason was an order-6 allocation by kvm:
> 
> http://data.plan9.de/kvm_oom.txt

That is something to bring up with kvm guys. Order-6 pages are
considered costly and success of the allocation is by no means
guaranteed. Unike for orders smaller than 4 they do not trigger the oom
killer though.

If kvm doesn't really require the physically contiguous memory then
vmalloc fallback would be a good alternative. Unfortunatelly I am not
able to find which allocation is that. What does faddr2line kvm_dev_ioctl_create_vm+0x40
say?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: post linux 4.4 vm oom kill, lockup and thrashing woes
  2018-07-18  8:38     ` Michal Hocko
@ 2018-07-22 23:34       ` Marc Lehmann
  2018-07-23 12:55         ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Marc Lehmann @ 2018-07-22 23:34 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm

On Wed, Jul 18, 2018 at 10:38:08AM +0200, Michal Hocko <mhocko@kernel.org> wrote:
> > http://data.plan9.de/kvm_oom.txt
> 
> That is something to bring up with kvm guys. Order-6 pages are
> considered costly and success of the allocation is by no means
> guaranteed. Unike for orders smaller than 4 they do not trigger the oom
> killer though.

So 4 is the magic barrier, good to know. In any case, as I said, it's just
an example of various allocations that fail unexpectedly after 4.4, and it's
by no means just nvidia.

> vmalloc fallback would be a good alternative. Unfortunatelly I am not
> able to find which allocation is that. What does faddr2line kvm_dev_ioctl_create_vm+0x40
> say?

I suspect I can't run this for an installed kernel without sources/object
files? In this case a precompiled kernel from ubuntu mainline-ppa.
Running faddr2line kvm.ko ... just gives me:

   kvm_dev_ioctl_create_vm+0x40/0x5d1:
   kvm_dev_ioctl_create_vm at ??:?

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: post linux 4.4 vm oom kill, lockup and thrashing woes
  2018-07-22 23:34       ` Marc Lehmann
@ 2018-07-23 12:55         ` Michal Hocko
  2018-07-31  3:45           ` Marc Lehmann
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2018-07-23 12:55 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-mm

On Mon 23-07-18 01:34:37, Marc Lehmann wrote:
> On Wed, Jul 18, 2018 at 10:38:08AM +0200, Michal Hocko <mhocko@kernel.org> wrote:
> > > http://data.plan9.de/kvm_oom.txt
> > 
> > That is something to bring up with kvm guys. Order-6 pages are
> > considered costly and success of the allocation is by no means
> > guaranteed. Unike for orders smaller than 4 they do not trigger the oom
> > killer though.
> 
> So 4 is the magic barrier, good to know.

Yeah, scientifically proven. Or something along those lines.

> In any case, as I said, it's just
> an example of various allocations that fail unexpectedly after 4.4, and it's
> by no means just nvidia.

Large allocation failures shouldn't be directly related to the OOM
changes at the time. There were many compaction fixes/enhancements
introduced at the time and later which should help those though.

Having more examples should help us to work with specific subsystems
on a more appropriate fix. Depending on large order allocations has
always been suboptimal if not outright wrong.

> 
> > vmalloc fallback would be a good alternative. Unfortunatelly I am not
> > able to find which allocation is that. What does faddr2line kvm_dev_ioctl_create_vm+0x40
> > say?
> 
> I suspect I can't run this for an installed kernel without sources/object
> files? In this case a precompiled kernel from ubuntu mainline-ppa.
> Running faddr2line kvm.ko ... just gives me:
> 
>    kvm_dev_ioctl_create_vm+0x40/0x5d1:
>    kvm_dev_ioctl_create_vm at ??:?

You need a vmlinux with debuginfo compiled IIRC.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: post linux 4.4 vm oom kill, lockup and thrashing woes
  2018-07-23 12:55         ` Michal Hocko
@ 2018-07-31  3:45           ` Marc Lehmann
  2018-07-31  7:28             ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Marc Lehmann @ 2018-07-31  3:45 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm

On Mon, Jul 23, 2018 at 02:55:54PM +0200, Michal Hocko <mhocko@kernel.org> wrote:
> 
> Having more examples should help us to work with specific subsystems
> on a more appropriate fix. Depending on large order allocations has
> always been suboptimal if not outright wrong.

I think this is going into the wrong direction. First of all, keep in mind
that I have to actively work against getting more examples, as I have to keep
things running and will employ more and more workarounds.

More importantly, however, it's all good and well if the kernel fails
high order allocations when it has to, and it's all well to try to "fix"
them to not happen, but let's not forget the real problem, which is linux
thrashing, freezing or killing unrelated processes when it has no reason
to. specifically, if I have 32Gb ram and 30GB of page cache that isn't
locked, then linux has no conceivable reason to not satisfy even a high-order
allocation by moving some movable pages around.

I tzhink the examples I provides should already give some insight, for
example, doing a large mmap and faulting the pages in should not cause
these pages to be so stubbornly locked as to cause the machine to freeze
on a large alllocation, when it could "simply" drp a few gigabytes of
(non-dirty!) shared file pages instead.

It's possible that the post-4.4 vm changes are not the direct cause of this,
but only caused a hitherto unproblematic behaviour to cause problems e.g.
(totally made up) mmapped file data was freed in 4.4 simply because it tried
harder, and in post-4.4 kernels the kernel prefers to lock up instead. Then
the changes done in post-4.4 are not the cause of the problem, but simply the
trigger, just as the higher order allocations of some subsystems are not the
cause of the spurious oom kills, but simply the trigger.

Or, to put it bluntly, no matter how badly written kvm and/or the nvidia
subsystem,s are, the kernel has no business killing mysql on my boxes when
it has 95% of available memory. If this were by design, then linux should
have the ability of keeping memory free for suich uses (something like
min_free_kbytes) and not use memory for disk cache if this memory is then
lost to other applications.

And yes, if I see more "interesting" examples, I will of course tell you
about them :)

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: post linux 4.4 vm oom kill, lockup and thrashing woes
  2018-07-31  3:45           ` Marc Lehmann
@ 2018-07-31  7:28             ` Michal Hocko
  0 siblings, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2018-07-31  7:28 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-mm

On Tue 31-07-18 05:45:46, Marc Lehmann wrote:
> On Mon, Jul 23, 2018 at 02:55:54PM +0200, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > Having more examples should help us to work with specific subsystems
> > on a more appropriate fix. Depending on large order allocations has
> > always been suboptimal if not outright wrong.
> 
> I think this is going into the wrong direction. First of all, keep in mind
> that I have to actively work against getting more examples, as I have to keep
> things running and will employ more and more workarounds.
> 
> More importantly, however, it's all good and well if the kernel fails
> high order allocations when it has to, and it's all well to try to "fix"
> them to not happen, but let's not forget the real problem, which is linux
> thrashing, freezing or killing unrelated processes when it has no reason
> to. specifically, if I have 32Gb ram and 30GB of page cache that isn't
> locked, then linux has no conceivable reason to not satisfy even a high-order
> allocation by moving some movable pages around.

This is what we are trying as hard as we can though.

> I tzhink the examples I provides should already give some insight, for
> example, doing a large mmap and faulting the pages in should not cause
> these pages to be so stubbornly locked as to cause the machine to freeze
> on a large alllocation, when it could "simply" drp a few gigabytes of
> (non-dirty!) shared file pages instead.

Yes we try to reclaim clean page cache quite agressively and a failing
compaction is a reason to reclaim even more. But the life is not as
simple. There might be different reasons why even a clean page cache is
not migratable. E.g. when those pages are pinned by the filesystems.

> It's possible that the post-4.4 vm changes are not the direct cause of this,
> but only caused a hitherto unproblematic behaviour to cause problems e.g.
> (totally made up) mmapped file data was freed in 4.4 simply because it tried
> harder, and in post-4.4 kernels the kernel prefers to lock up instead. Then
> the changes done in post-4.4 are not the cause of the problem, but simply the
> trigger, just as the higher order allocations of some subsystems are not the
> cause of the spurious oom kills, but simply the trigger.

Well, this is really hard to tell from the data I have seen. All I can
tell right now is that the system is fragmented heavily and there seems
to be a hard demand for high order requests which we simply do not fail
and rather go and oom kill. Is this good? Absolutely not but this is
something that is really hard to change. We have historical reasons why
non-costly (order smaller than 4) allocations basically never fail. This
is really hard to change. The general recommendation is to simply not do
that because it hurts. Sucks I know...

> Or, to put it bluntly, no matter how badly written kvm and/or the nvidia
> subsystem,s are, the kernel has no business killing mysql on my boxes when
> it has 95% of available memory. If this were by design, then linux should
> have the ability of keeping memory free for suich uses (something like
> min_free_kbytes) and not use memory for disk cache if this memory is then
> lost to other applications.

Yes we have min_free_kbytes but fragmentation sucks. You can try to
increase this value and it usually helps. But not unconditionally.

> And yes, if I see more "interesting" examples, I will of course tell you
> about them :)

It would be good to track why the compaction doesn't help. We have some
counters in /proc/vmstat so collecting this over time might get us some
clue. There are also some tracepoints which might tell us more.

In general though it is much preferable to reduce agreesive high order
memory requests.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-07-31  7:28 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-10 12:07 post linux 4.4 vm oom kill, lockup and thrashing woes Marc Lehmann
2018-07-10 12:32 ` Michal Hocko
2018-07-17 23:45   ` Marc Lehmann
2018-07-18  8:38     ` Michal Hocko
2018-07-22 23:34       ` Marc Lehmann
2018-07-23 12:55         ` Michal Hocko
2018-07-31  3:45           ` Marc Lehmann
2018-07-31  7:28             ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.