linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Memory cgroup invokes OOM killer when there are a lot of dirty pages
@ 2018-07-03 21:14 Petros Angelatos
  2018-07-04  7:50 ` Michal Hocko
  0 siblings, 1 reply; 4+ messages in thread
From: Petros Angelatos @ 2018-07-03 21:14 UTC (permalink / raw)
  To: linux-mm, cgroups; +Cc: Michal Hocko, Hugh Dickins, Tejun Heo, lstoakes

Hello,

I'm facing a strange problem when I constrain an IO intensive
application that generates a lot of dirty pages inside a v1 cgroup
with a memory controller. After a while the OOM killer kicks in and
kills the processes instead of throttling the allocations while dirty
pages are being flushed. Here is a test program that reproduces the
issue:

  cd /sys/fs/cgroup/memory/
  mkdir dirty-test
  echo 10485760 > dirty-test/memory.limit_in_bytes

  echo $$ > dirty-test/cgroup.procs

  rm /mnt/file_*
  for i in $(seq 500); do
    dd if=/dev/urandom count=2048 of="/mnt/file_$i"
  done

When a process gets killed I get the following trace in dmesg:

> foo.sh invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
> foo.sh cpuset=/ mems_allowed=0
> CPU: 0 PID: 18415 Comm: foo.sh Tainted: P           O      4.17.2-1-ARCH #1
> Hardware name: LENOVO 20F9CTO1WW/20F9CTO1WW, BIOS N1CET52W (1.20 ) 11/30/2016
> Call Trace:
>  dump_stack+0x5c/0x80
>  dump_header+0x6b/0x2a1
>  ? preempt_count_add+0x68/0xa0
>  ? _raw_spin_trylock+0x13/0x50
>  oom_kill_process.cold.5+0xb/0x43b
>  out_of_memory+0x1a1/0x470
>  mem_cgroup_out_of_memory+0x49/0x80
>  mem_cgroup_oom_synchronize+0x329/0x360
>  ? __mem_cgroup_insert_exceeded+0x90/0x90
>  pagefault_out_of_memory+0x32/0x77
>  __do_page_fault+0x518/0x570
>  ? __se_sys_rt_sigaction+0x9f/0xd0
>  do_page_fault+0x32/0x130
>  ? page_fault+0x8/0x30
>  page_fault+0x1e/0x30
> RIP: 0033:0x56079824e134
> RSP: 002b:00007ffeac088fd0 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 00007ffeac088be0 RDI: 000056079824e720
> RBP: 00005607984ce5e0 R08: 00007ffeac088dd0 R09: 0000000000000001
> R10: 0000000000000008 R11: 0000000000000202 R12: 00005607984c9040
> R13: 000056079824e720 R14: 00005607984ce3c0 R15: 0000000000000000
> Task in /dirty-test killed as a result of limit of /dirty-test
> memory: usage 10240kB, limit 10240kB, failcnt 13073
> memory+swap: usage 10240kB, limit 9007199254740988kB, failcnt 0
> kmem: usage 1308kB, limit 9007199254740988kB, failcnt 0
> Memory cgroup stats for /dirty-test: cache:8848KB rss:180KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:8580KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:200KB inactive_file:4364KB active_file:4364KB unevictable:0KB
> [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
> [18160]     0 18160     3468      652    73728        0             0 foo.sh
> [18415]     0 18415     3468      118    61440        0             0 foo.sh
> Memory cgroup out of memory: Kill process 18160 (foo.sh) score 261 or sacrifice child
> Killed process 18415 (foo.sh) total-vm:13872kB, anon-rss:472kB, file-rss:0kB, shmem-rss:0kB
> oom_reaper: reaped process 18415 (foo.sh), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The cgroup v2 documentation mentions that the OOM killer will be only
be invoked when the out of memory situation happens inside a page
fault, and this problem is always happening during a page fault so
that's not surprising but I'm not sure why the process ends up in a
fatal page fault.

This is an old problem and from searching online I found that an
initial solution was implemented by Michal Hocko and Hugh Dickins in
e62e384 ("memcg:prevent OOM with too many dirty pages") and c3b94f4
("memcg: further prevent OOM with too many dirty pages") respectively
and then further tweaked by Michal to avoid possible deadlocks in
ecf5fc6 ("mm, vmscan: Do not wait for page writeback for GFP_NOFS
allocations").

This initial ad-hoc implementation was later improved by Tejun Heo in
c2aa723 ("writeback: implement memcg writeback domain based
throttling") and 97c9341 ("mm: vmscan: disable memcg direct reclaim
stalling if cgroup writeback support is in use"). According to the
commit log that change "makes the dirty throttling mechanism
operational for memcg domains including
writeback-bandwidth-proportional dirty page distribution inside them".

I verified that my kernel has cgroup writeback support enabled but
it's unclear if this is used for legacy hierarchies. But even if it's
not according to Tejun's commit the old ad-hoc method should still be
activated to throttle the process. The reason I'm using the legacy
hierarchy is because I'm running the workload in a container and
docker doesn't yet support cgroups v2
(https://github.com/moby/moby/issues/25868).

So my question is, is there a way with the current state of affairs to
constrain an IO intensive process with the goal of preventing page
cache eviction of useful pages using a memory cgroup or is the only
solution altering the application to use fsync() and
posix_fadvise(POSIX_FADV_DONTNEED)?

Best,

-- 
Petros Angelatos
CTO & Founder, Resin.io
BA81 DC1C D900 9B24 2F88  6FDD 4404 DDEE 92BF 1079

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Memory cgroup invokes OOM killer when there are a lot of dirty pages
  2018-07-03 21:14 Memory cgroup invokes OOM killer when there are a lot of dirty pages Petros Angelatos
@ 2018-07-04  7:50 ` Michal Hocko
  2018-07-04 15:45   ` Petros Angelatos
  0 siblings, 1 reply; 4+ messages in thread
From: Michal Hocko @ 2018-07-04  7:50 UTC (permalink / raw)
  To: Petros Angelatos; +Cc: linux-mm, cgroups, Hugh Dickins, Tejun Heo, lstoakes

On Wed 04-07-18 00:14:39, Petros Angelatos wrote:
> Hello,
> 
> I'm facing a strange problem when I constrain an IO intensive
> application that generates a lot of dirty pages inside a v1 cgroup
> with a memory controller. After a while the OOM killer kicks in and
> kills the processes instead of throttling the allocations while dirty
> pages are being flushed. Here is a test program that reproduces the
> issue:
> 
>   cd /sys/fs/cgroup/memory/
>   mkdir dirty-test
>   echo 10485760 > dirty-test/memory.limit_in_bytes
> 
>   echo $$ > dirty-test/cgroup.procs
> 
>   rm /mnt/file_*
>   for i in $(seq 500); do
>     dd if=/dev/urandom count=2048 of="/mnt/file_$i"
>   done
> 
> When a process gets killed I get the following trace in dmesg:
> 
> > foo.sh invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
> > foo.sh cpuset=/ mems_allowed=0
> > CPU: 0 PID: 18415 Comm: foo.sh Tainted: P           O      4.17.2-1-ARCH #1
> > Hardware name: LENOVO 20F9CTO1WW/20F9CTO1WW, BIOS N1CET52W (1.20 ) 11/30/2016
[...]
> > Task in /dirty-test killed as a result of limit of /dirty-test
> > memory: usage 10240kB, limit 10240kB, failcnt 13073
> > memory+swap: usage 10240kB, limit 9007199254740988kB, failcnt 0
> > kmem: usage 1308kB, limit 9007199254740988kB, failcnt 0
> > Memory cgroup stats for /dirty-test: cache:8848KB rss:180KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:8580KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:200KB inactive_file:4364KB active_file:4364KB unevictable:0KB
> > [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
> > [18160]     0 18160     3468      652    73728        0             0 foo.sh
> > [18415]     0 18415     3468      118    61440        0             0 foo.sh
> > Memory cgroup out of memory: Kill process 18160 (foo.sh) score 261 or sacrifice child
> > Killed process 18415 (foo.sh) total-vm:13872kB, anon-rss:472kB, file-rss:0kB, shmem-rss:0kB
> > oom_reaper: reaped process 18415 (foo.sh), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 
> The cgroup v2 documentation mentions that the OOM killer will be only
> be invoked when the out of memory situation happens inside a page
> fault, and this problem is always happening during a page fault so
> that's not surprising but I'm not sure why the process ends up in a
> fatal page fault.

I assume dd just tried to fault a code page in and that failed due to
the hard limit and unreclaimable memory. The reason why the memcg v1
oom throttling heuristic hasn't kicked in is that there are no pages
under writeback. This would match symptoms of the bug fixed by
1c610d5f93c7 ("mm/vmscan: wake up flushers for legacy cgroups too") in
4.16 but there might be more. You should have that fix already so there
must be something more in the game. You've said that you are using blkio
cgroup, right? What is the configuration? I strongly suspect that none
of the writeback has started because of the throttling.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Memory cgroup invokes OOM killer when there are a lot of dirty pages
  2018-07-04  7:50 ` Michal Hocko
@ 2018-07-04 15:45   ` Petros Angelatos
  2018-07-05  6:46     ` Michal Hocko
  0 siblings, 1 reply; 4+ messages in thread
From: Petros Angelatos @ 2018-07-04 15:45 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, cgroups, Hugh Dickins, Tejun Heo, lstoakes

> I assume dd just tried to fault a code page in and that failed due to
> the hard limit and unreclaimable memory. The reason why the memcg v1
> oom throttling heuristic hasn't kicked in is that there are no pages
> under writeback. This would match symptoms of the bug fixed by
> 1c610d5f93c7 ("mm/vmscan: wake up flushers for legacy cgroups too") in
> 4.16 but there might be more. You should have that fix already so there
> must be something more in the game. You've said that you are using blkio
> cgroup, right? What is the configuration? I strongly suspect that none
> of the writeback has started because of the throttling.

I'm only using a memory cgroup with no blkio restrictions so I'm not
sure why writeback hasn't started. Another thing I noticed is that
it's a lot harder to reproduce when the same amount of data is written
in a single file versus many smaller files. That's why my original
example code writes 500 files with 1MB of data.

Your mention of writeback gave me the idea to try and do a
sync_file_range() with SYNC_FILE_RANGE_WRITE after writing each file
to manually schedule writeback and surprisingly it fixed the problem.
Is that an indication of a bug in the kernel that doesn't trigger
writeback in time?

Also, you mentioned that the pagefault is probably due to a code page.
Would another remedy be to lock the whole executable and dynamic
libraries in memory with mlock() before starting the IO operations?

-- 
Petros Angelatos
CTO & Founder, Resin.io
BA81 DC1C D900 9B24 2F88  6FDD 4404 DDEE 92BF 1079

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Memory cgroup invokes OOM killer when there are a lot of dirty pages
  2018-07-04 15:45   ` Petros Angelatos
@ 2018-07-05  6:46     ` Michal Hocko
  0 siblings, 0 replies; 4+ messages in thread
From: Michal Hocko @ 2018-07-05  6:46 UTC (permalink / raw)
  To: Petros Angelatos; +Cc: linux-mm, cgroups, Hugh Dickins, Tejun Heo, lstoakes

On Wed 04-07-18 18:45:48, Petros Angelatos wrote:
> > I assume dd just tried to fault a code page in and that failed due to
> > the hard limit and unreclaimable memory. The reason why the memcg v1
> > oom throttling heuristic hasn't kicked in is that there are no pages
> > under writeback. This would match symptoms of the bug fixed by
> > 1c610d5f93c7 ("mm/vmscan: wake up flushers for legacy cgroups too") in
> > 4.16 but there might be more. You should have that fix already so there
> > must be something more in the game. You've said that you are using blkio
> > cgroup, right? What is the configuration? I strongly suspect that none
> > of the writeback has started because of the throttling.
> 
> I'm only using a memory cgroup with no blkio restrictions so I'm not
> sure why writeback hasn't started. Another thing I noticed is that
> it's a lot harder to reproduce when the same amount of data is written
> in a single file versus many smaller files. That's why my original
> example code writes 500 files with 1MB of data.
> 
> Your mention of writeback gave me the idea to try and do a
> sync_file_range() with SYNC_FILE_RANGE_WRITE after writing each file
> to manually schedule writeback and surprisingly it fixed the problem.
> Is that an indication of a bug in the kernel that doesn't trigger
> writeback in time?

Yeah, it smells so. If you look at 1c610d5f93c7, we've had bug where we
even didn't kick flushers. So it seems they do not start to do a useful
work in time. I would start digging that direction.

> Also, you mentioned that the pagefault is probably due to a code page.
> Would another remedy be to lock the whole executable and dynamic
> libraries in memory with mlock() before starting the IO operations?

That looks like a big hammer to me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-07-05  6:46 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-03 21:14 Memory cgroup invokes OOM killer when there are a lot of dirty pages Petros Angelatos
2018-07-04  7:50 ` Michal Hocko
2018-07-04 15:45   ` Petros Angelatos
2018-07-05  6:46     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).