From: Johannes Weiner <hannes@cmpxchg.org>
To: Greg Thelen <gthelen@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Roman Gushchin <guro@fb.com>, Michal Hocko <mhocko@kernel.org>,
Vladimir Davydov <vdavydov.dev@gmail.com>,
Tejun Heo <tj@kernel.org>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
stable@vger.kernel.org
Subject: Re: [PATCH v2] writeback: use exact memcg dirty counts
Date: Mon, 1 Apr 2019 13:32:02 -0400 [thread overview]
Message-ID: <20190401173202.GA2953@cmpxchg.org> (raw)
In-Reply-To: <20190329174609.164344-1-gthelen@google.com>
On Fri, Mar 29, 2019 at 10:46:09AM -0700, Greg Thelen wrote:
> Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
> memory.stat reporting") memcg dirty and writeback counters are managed
> as:
> 1) per-memcg per-cpu values in range of [-32..32]
> 2) per-memcg atomic counter
> When a per-cpu counter cannot fit in [-32..32] it's flushed to the
> atomic. Stat readers only check the atomic.
> Thus readers such as balance_dirty_pages() may see a nontrivial error
> margin: 32 pages per cpu.
> Assuming 100 cpus:
> 4k x86 page_size: 13 MiB error per memcg
> 64k ppc page_size: 200 MiB error per memcg
> Considering that dirty+writeback are used together for some decisions
> the errors double.
>
> This inaccuracy can lead to undeserved oom kills. One nasty case is
> when all per-cpu counters hold positive values offsetting an atomic
> negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32).
> balance_dirty_pages() only consults the atomic and does not consider
> throttling the next n_cpu*32 dirty pages. If the file_lru is in the
> 13..200 MiB range then there's absolutely no dirty throttling, which
> burdens vmscan with only dirty+writeback pages thus resorting to oom
> kill.
>
> It could be argued that tiny containers are not supported, but it's more
> subtle. It's the amount the space available for file lru that matters.
> If a container has memory.max-200MiB of non reclaimable memory, then it
> will also suffer such oom kills on a 100 cpu machine.
>
> The following test reliably ooms without this patch. This patch avoids
> oom kills.
>
> $ cat test
> mount -t cgroup2 none /dev/cgroup
> cd /dev/cgroup
> echo +io +memory > cgroup.subtree_control
> mkdir test
> cd test
> echo 10M > memory.max
> (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
> (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)
>
> $ cat memcg-writeback-stress.c
> /*
> * Dirty pages from all but one cpu.
> * Clean pages from the non dirtying cpu.
> * This is to stress per cpu counter imbalance.
> * On a 100 cpu machine:
> * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
> * - per memcg atomic is -99*32 pages
> * - thus the complete dirty limit: sum of all counters 0
> * - balance_dirty_pages() only sees atomic count -99*32 pages, which
> * it max()s to 0.
> * - So a workload can dirty -99*32 pages before balance_dirty_pages()
> * cares.
> */
> #define _GNU_SOURCE
> #include <err.h>
> #include <fcntl.h>
> #include <sched.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <sys/stat.h>
> #include <sys/sysinfo.h>
> #include <sys/types.h>
> #include <unistd.h>
>
> static char *buf;
> static int bufSize;
>
> static void set_affinity(int cpu)
> {
> cpu_set_t affinity;
>
> CPU_ZERO(&affinity);
> CPU_SET(cpu, &affinity);
> if (sched_setaffinity(0, sizeof(affinity), &affinity))
> err(1, "sched_setaffinity");
> }
>
> static void dirty_on(int output_fd, int cpu)
> {
> int i, wrote;
>
> set_affinity(cpu);
> for (i = 0; i < 32; i++) {
> for (wrote = 0; wrote < bufSize; ) {
> int ret = write(output_fd, buf+wrote, bufSize-wrote);
> if (ret == -1)
> err(1, "write");
> wrote += ret;
> }
> }
> }
>
> int main(int argc, char **argv)
> {
> int cpu, flush_cpu = 1, output_fd;
> const char *output;
>
> if (argc != 2)
> errx(1, "usage: output_file");
>
> output = argv[1];
> bufSize = getpagesize();
> buf = malloc(getpagesize());
> if (buf == NULL)
> errx(1, "malloc failed");
>
> output_fd = open(output, O_CREAT|O_RDWR);
> if (output_fd == -1)
> err(1, "open(%s)", output);
>
> for (cpu = 0; cpu < get_nprocs(); cpu++) {
> if (cpu != flush_cpu)
> dirty_on(output_fd, cpu);
> }
>
> set_affinity(flush_cpu);
> if (fsync(output_fd))
> err(1, "fsync(%s)", output);
> if (close(output_fd))
> err(1, "close(%s)", output);
> free(buf);
> }
>
> Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
> collect exact per memcg counters. This avoids the aforementioned oom
> kills.
>
> This does not affect the overhead of memory.stat, which still reads the
> single atomic counter.
>
> Why not use percpu_counter? memcg already handles cpus going offline,
> so no need for that overhead from percpu_counter. And the
> percpu_counter spinlocks are more heavyweight than is required.
>
> It probably also makes sense to use exact dirty and writeback counters
> in memcg oom reports. But that is saved for later.
>
> Cc: stable@vger.kernel.org # v4.16+
> Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Thanks Greg!
next prev parent reply other threads:[~2019-04-01 17:47 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-03-29 17:46 [PATCH v2] writeback: use exact memcg dirty counts Greg Thelen
2019-03-29 20:38 ` Roman Gushchin
2019-04-01 17:32 ` Johannes Weiner [this message]
2019-04-01 18:20 ` Johannes Weiner
2019-04-02 21:32 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190401173202.GA2953@cmpxchg.org \
--to=hannes@cmpxchg.org \
--cc=akpm@linux-foundation.org \
--cc=gthelen@google.com \
--cc=guro@fb.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=stable@vger.kernel.org \
--cc=tj@kernel.org \
--cc=vdavydov.dev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).