LKML Archive on lore.kernel.org
 help / color / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Greg Thelen <gthelen@google.com>,
	Roman Gushchin <guro@fb.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Tejun Heo <tj@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Sasha Levin <sashal@kernel.org>
Subject: [PATCH 4.14 36/52] mm: writeback: use exact memcg dirty counts
Date: Mon,  5 Apr 2021 10:54:02 +0200
Message-ID: <20210405085023.164430007@linuxfoundation.org> (raw)
In-Reply-To: <20210405085021.996963957@linuxfoundation.org>

From: Greg Thelen <gthelen@google.com>

commit 0b3d6e6f2dd0a7b697b1aa8c167265908940624b upstream.

Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
memory.stat reporting") memcg dirty and writeback counters are managed
as:

 1) per-memcg per-cpu values in range of [-32..32]

 2) per-memcg atomic counter

When a per-cpu counter cannot fit in [-32..32] it's flushed to the
atomic.  Stat readers only check the atomic.  Thus readers such as
balance_dirty_pages() may see a nontrivial error margin: 32 pages per
cpu.

Assuming 100 cpus:
   4k x86 page_size:  13 MiB error per memcg
  64k ppc page_size: 200 MiB error per memcg

Considering that dirty+writeback are used together for some decisions the
errors double.

This inaccuracy can lead to undeserved oom kills.  One nasty case is
when all per-cpu counters hold positive values offsetting an atomic
negative value (i.e.  per_cpu[*]=32, atomic=n_cpu*-32).
balance_dirty_pages() only consults the atomic and does not consider
throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
13..200 MiB range then there's absolutely no dirty throttling, which
burdens vmscan with only dirty+writeback pages thus resorting to oom
kill.

It could be argued that tiny containers are not supported, but it's more
subtle.  It's the amount the space available for file lru that matters.
If a container has memory.max-200MiB of non reclaimable memory, then it
will also suffer such oom kills on a 100 cpu machine.

The following test reliably ooms without this patch.  This patch avoids
oom kills.

  $ cat test
  mount -t cgroup2 none /dev/cgroup
  cd /dev/cgroup
  echo +io +memory > cgroup.subtree_control
  mkdir test
  cd test
  echo 10M > memory.max
  (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
  (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)

  $ cat memcg-writeback-stress.c
  /*
   * Dirty pages from all but one cpu.
   * Clean pages from the non dirtying cpu.
   * This is to stress per cpu counter imbalance.
   * On a 100 cpu machine:
   * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
   * - per memcg atomic is -99*32 pages
   * - thus the complete dirty limit: sum of all counters 0
   * - balance_dirty_pages() only sees atomic count -99*32 pages, which
   *   it max()s to 0.
   * - So a workload can dirty -99*32 pages before balance_dirty_pages()
   *   cares.
   */
  #define _GNU_SOURCE
  #include <err.h>
  #include <fcntl.h>
  #include <sched.h>
  #include <stdlib.h>
  #include <stdio.h>
  #include <sys/stat.h>
  #include <sys/sysinfo.h>
  #include <sys/types.h>
  #include <unistd.h>

  static char *buf;
  static int bufSize;

  static void set_affinity(int cpu)
  {
  	cpu_set_t affinity;

  	CPU_ZERO(&affinity);
  	CPU_SET(cpu, &affinity);
  	if (sched_setaffinity(0, sizeof(affinity), &affinity))
  		err(1, "sched_setaffinity");
  }

  static void dirty_on(int output_fd, int cpu)
  {
  	int i, wrote;

  	set_affinity(cpu);
  	for (i = 0; i < 32; i++) {
  		for (wrote = 0; wrote < bufSize; ) {
  			int ret = write(output_fd, buf+wrote, bufSize-wrote);
  			if (ret == -1)
  				err(1, "write");
  			wrote += ret;
  		}
  	}
  }

  int main(int argc, char **argv)
  {
  	int cpu, flush_cpu = 1, output_fd;
  	const char *output;

  	if (argc != 2)
  		errx(1, "usage: output_file");

  	output = argv[1];
  	bufSize = getpagesize();
  	buf = malloc(getpagesize());
  	if (buf == NULL)
  		errx(1, "malloc failed");

  	output_fd = open(output, O_CREAT|O_RDWR);
  	if (output_fd == -1)
  		err(1, "open(%s)", output);

  	for (cpu = 0; cpu < get_nprocs(); cpu++) {
  		if (cpu != flush_cpu)
  			dirty_on(output_fd, cpu);
  	}

  	set_affinity(flush_cpu);
  	if (fsync(output_fd))
  		err(1, "fsync(%s)", output);
  	if (close(output_fd))
  		err(1, "close(%s)", output);
  	free(buf);
  }

Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
collect exact per memcg counters.  This avoids the aforementioned oom
kills.

This does not affect the overhead of memory.stat, which still reads the
single atomic counter.

Why not use percpu_counter? memcg already handles cpus going offline, so
no need for that overhead from percpu_counter.  And the percpu_counter
spinlocks are more heavyweight than is required.

It probably also makes sense to use exact dirty and writeback counters
in memcg oom reports.  But that is saved for later.

Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.com
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>	[4.16+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 include/linux/memcontrol.h |  5 ++++-
 mm/memcontrol.c            | 20 ++++++++++++++++++--
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b5cd86e320ff..365d5079d1cb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -507,7 +507,10 @@ struct mem_cgroup *lock_page_memcg(struct page *page);
 void __unlock_page_memcg(struct mem_cgroup *memcg);
 void unlock_page_memcg(struct page *page);
 
-/* idx can be of type enum memcg_stat_item or node_stat_item */
+/*
+ * idx can be of type enum memcg_stat_item or node_stat_item.
+ * Keep in sync with memcg_exact_page_state().
+ */
 static inline unsigned long memcg_page_state(struct mem_cgroup *memcg,
 					     int idx)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ef6d996a920a..5e8b8e1b7d90 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3701,6 +3701,22 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
 	return &memcg->cgwb_domain;
 }
 
+/*
+ * idx can be of type enum memcg_stat_item or node_stat_item.
+ * Keep in sync with memcg_exact_page().
+ */
+static unsigned long memcg_exact_page_state(struct mem_cgroup *memcg, int idx)
+{
+	long x = atomic_long_read(&memcg->stat[idx]);
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		x += per_cpu_ptr(memcg->stat_cpu, cpu)->count[idx];
+	if (x < 0)
+		x = 0;
+	return x;
+}
+
 /**
  * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg
  * @wb: bdi_writeback in question
@@ -3726,10 +3742,10 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;
 
-	*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
+	*pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY);
 
 	/* this should eventually include NR_UNSTABLE_NFS */
-	*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
+	*pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK);
 	*pfilepages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) |
 						     (1 << LRU_ACTIVE_FILE));
 	*pheadroom = PAGE_COUNTER_MAX;
-- 
2.30.2




  parent reply index

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-05  8:53 [PATCH 4.14 00/52] 4.14.229-rc1 review Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 01/52] selinux: vsock: Set SID for socket returned by accept() Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 02/52] ipv6: weaken the v4mapped source check Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 03/52] ext4: fix bh ref count on error paths Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 04/52] rpc: fix NULL dereference on kmalloc failure Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 05/52] ASoC: rt5640: Fix dac- and adc- vol-tlv values being off by a factor of 10 Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 06/52] ASoC: rt5651: " Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 07/52] ASoC: sgtl5000: set DAP_AVC_CTRL register to correct default value on probe Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 08/52] ASoC: es8316: Simplify adc_pga_gain_tlv table Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 09/52] ASoC: cs42l42: Fix mixer volume control Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 10/52] ASoC: cs42l42: Always wait at least 3ms after reset Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 11/52] powerpc: Force inlining of cpu_has_feature() to avoid build failure Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 12/52] vhost: Fix vhost_vq_reset() Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 13/52] scsi: st: Fix a use after free in st_open() Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 14/52] scsi: qla2xxx: Fix broken #endif placement Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 15/52] staging: comedi: cb_pcidas: fix request_irq() warn Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 16/52] staging: comedi: cb_pcidas64: " Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 17/52] ASoC: rt5659: Update MCLK rate in set_sysclk() Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 18/52] ext4: do not iput inode under running transaction in ext4_rename() Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 19/52] brcmfmac: clear EAP/association status bits on linkdown events Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 20/52] net: ethernet: aquantia: Handle error cleanup of start on open Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 21/52] appletalk: Fix skb allocation size in loopback case Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 22/52] net: wan/lmc: unregister device when no matching device is found Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 23/52] bpf: Remove MTU check in __bpf_skb_max_len Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 24/52] ALSA: usb-audio: Apply sample rate quirk to Logitech Connect Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 25/52] ALSA: hda/realtek: fix a determine_headset_type issue for a Dell AIO Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 26/52] ALSA: hda/realtek: call alc_update_headset_mode() in hp_automute_hook Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 27/52] tracing: Fix stack trace event size Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 28/52] mm: fix race by making init_zero_pfn() early_initcall Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 29/52] drm/amdgpu: fix offset calculation in amdgpu_vm_bo_clear_mappings() Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 30/52] drm/amdgpu: check alignment on CPU page for bo map Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 31/52] reiserfs: update reiserfs_xattrs_initialized() condition Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 32/52] mm: memcontrol: fix NR_WRITEBACK leak in memcg and system stats Greg Kroah-Hartman
2021-04-05  8:53 ` [PATCH 4.14 33/52] mm: memcg: make sure memory.events is uptodate when waking pollers Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 4.14 34/52] mem_cgroup: make sure moving_account, move_lock_task and stat_cpu in the same cacheline Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 4.14 35/52] mm: fix oom_kill event handling Greg Kroah-Hartman
2021-04-05  8:54 ` Greg Kroah-Hartman [this message]
2021-04-05  8:54 ` [PATCH 4.14 37/52] pinctrl: rockchip: fix restore error in resume Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 4.14 38/52] extcon: Add stubs for extcon_register_notifier_all() functions Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 4.14 39/52] extcon: Fix error handling in extcon_dev_register Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 4.14 40/52] firewire: nosy: Fix a use-after-free bug in nosy_ioctl() Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 4.14 41/52] usbip: vhci_hcd fix shift out-of-bounds in vhci_hub_control() Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 4.14 42/52] USB: quirks: ignore remote wake-up on Fibocom L850-GL LTE modem Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 4.14 43/52] usb: musb: Fix suspend with devices connected for a64 Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 4.14 44/52] usb: xhci-mtk: fix broken streams issue on 0.96 xHCI Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 4.14 45/52] cdc-acm: fix BREAK rx code path adding necessary calls Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 4.14 46/52] USB: cdc-acm: untangle a circular dependency between callback and softint Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 4.14 47/52] USB: cdc-acm: downgrade message to debug Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 4.14 48/52] USB: cdc-acm: fix use-after-free after probe failure Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 4.14 49/52] usb: gadget: udc: amd5536udc_pci fix null-ptr-dereference Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 4.14 50/52] staging: rtl8192e: Fix incorrect source in memcpy() Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 4.14 51/52] staging: rtl8192e: Change state information from u16 to u8 Greg Kroah-Hartman
2021-04-05  8:54 ` [PATCH 4.14 52/52] drivers: video: fbcon: fix NULL dereference in fbcon_cursor() Greg Kroah-Hartman
2021-04-05 17:57 ` [PATCH 4.14 00/52] 4.14.229-rc1 review Guenter Roeck
2021-04-06  7:14 ` Naresh Kamboju
2021-04-07  0:58 ` Samuel Zou

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210405085023.164430007@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=gthelen@google.com \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhocko@kernel.org \
    --cc=sashal@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git
	git clone --mirror https://lore.kernel.org/lkml/7 lkml/git/7.git
	git clone --mirror https://lore.kernel.org/lkml/8 lkml/git/8.git
	git clone --mirror https://lore.kernel.org/lkml/9 lkml/git/9.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org
	public-inbox-index lkml

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git