All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: stable@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: Roman Gushchin <guro@fb.com>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Sasha Levin <sashal@kernel.org>,
	linux-doc@vger.kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org
Subject: [PATCH AUTOSEL 4.19 40/44] mm: don't raise MEMCG_OOM event due to failed high-order allocation
Date: Tue, 13 Nov 2018 00:49:46 -0500	[thread overview]
Message-ID: <20181113054950.77898-40-sashal@kernel.org> (raw)
In-Reply-To: <20181113054950.77898-1-sashal@kernel.org>

From: Roman Gushchin <guro@fb.com>

[ Upstream commit 7a1adfddaf0d11a39fdcaf6e82a88e9c0586e08b ]

It was reported that on some of our machines containers were restarted
with OOM symptoms without an obvious reason.  Despite there were almost no
memory pressure and plenty of page cache, MEMCG_OOM event was raised
occasionally, causing the container management software to think, that OOM
has happened.  However, no tasks have been killed.

The following investigation showed that the problem is caused by a failing
attempt to charge a high-order page.  In such case, the OOM killer is
never invoked.  As shown below, it can happen under conditions, which are
very far from a real OOM: e.g.  there is plenty of clean page cache and no
memory pressure.

There is no sense in raising an OOM event in this case, as it might
confuse a user and lead to wrong and excessive actions (e.g.  restart the
workload, as in my case).

Let's look at the charging path in try_charge().  If the memory usage is
about memory.max, which is absolutely natural for most memory cgroups, we
try to reclaim some pages.  Even if we were able to reclaim enough memory
for the allocation, the following check can fail due to a race with
another concurrent allocation:

    if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
        goto retry;

For regular pages the following condition will save us from triggering
the OOM:

   if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
       goto retry;

But for high-order allocation this condition will intentionally fail.  The
reason behind is that we'll likely fall to regular pages anyway, so it's
ok and even preferred to return ENOMEM.

In this case the idea of raising MEMCG_OOM looks dubious.

Fix this by moving MEMCG_OOM raising to mem_cgroup_oom() after allocation
order check, so that the event won't be raised for high order allocations.
This change doesn't affect regular pages allocation and charging.

Link: http://lkml.kernel.org/r/20181004214050.7417-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 Documentation/admin-guide/cgroup-v2.rst | 4 ++++
 mm/memcontrol.c                         | 4 ++--
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 184193bcb262..5d9939388a78 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1127,6 +1127,10 @@ PAGE_SIZE multiple when read back.
 		disk readahead.  For now OOM in memory cgroup kills
 		tasks iff shortage has happened inside page fault.
 
+		This event is not raised if the OOM killer is not
+		considered as an option, e.g. for failed high-order
+		allocations.
+
 	  oom_kill
 		The number of processes belonging to this cgroup
 		killed by any kind of OOM killer.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e79cb59552d9..07c7af6f5e59 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1669,6 +1669,8 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
 	if (order > PAGE_ALLOC_COSTLY_ORDER)
 		return OOM_SKIPPED;
 
+	memcg_memory_event(memcg, MEMCG_OOM);
+
 	/*
 	 * We are in the middle of the charge context here, so we
 	 * don't want to block when potentially sitting on a callstack
@@ -2250,8 +2252,6 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (fatal_signal_pending(current))
 		goto force;
 
-	memcg_memory_event(mem_over_limit, MEMCG_OOM);
-
 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
 	 * a forward progress or bypass the charge if the oom killer
-- 
2.17.1


  parent reply	other threads:[~2018-11-13  6:03 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-13  5:49 [PATCH AUTOSEL 4.19 01/44] bfs: add sanity check at bfs_fill_super() Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 02/44] cifs: don't dereference smb_file_target before null check Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 03/44] cifs: fix return value for cifs_listxattr Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 04/44] arm64: kprobe: make page to RO mode when allocate it Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 05/44] nvme-pci: fix conflicting p2p resource adds Sasha Levin
2018-11-13  5:49   ` Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 06/44] block: brd: associate with queue until adding disk Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 07/44] bpf: fix partial copy of map_ptr when dst is scalar Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 08/44] net: hns3: bugfix for rtnl_lock's range in the hclgevf_reset() Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 09/44] net: hns3: bugfix for rtnl_lock's range in the hclge_reset() Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 10/44] net: hns3: bugfix for handling mailbox while the command queue reinitialized Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 11/44] net: hns3: bugfix for the initialization of command queue's spin lock Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 12/44] ixgbe: fix MAC anti-spoofing filter after VFLR Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 13/44] mm: Fix warning in insert_pfn() Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 14/44] mm/memory_hotplug: make add_memory() take the device_hotplug_lock Sasha Levin
2018-11-13  5:49   ` Sasha Levin
2018-11-13  5:49   ` Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 15/44] reiserfs: propagate errors from fill_with_dentries() properly Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 16/44] hfs: prevent btree data loss on root split Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 17/44] hfsplus: " Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 18/44] mm/gup_benchmark.c: prevent integer overflow in ioctl Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 19/44] perf unwind: Take pgoff into account when reporting elf to libdwfl Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 20/44] um: Give start_idle_thread() a return code Sasha Levin
2018-11-13  5:49   ` Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 21/44] drm/edid: Add 6 bpc quirk for BOE panel Sasha Levin
2018-11-13  5:49   ` Sasha Levin
2018-11-13  5:49   ` Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 22/44] afs: Handle EIO from delivery function Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 23/44] platform/x86: intel_telemetry: report debugfs failure Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 24/44] clk: fixed-rate: fix of_node_get-put imbalance Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 25/44] perf symbols: Set PLT entry/header sizes properly on Sparc Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 26/44] fs/exofs: fix potential memory leak in mount option parsing Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 27/44] clk: samsung: exynos5420: Enable PERIS clocks for suspend Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 28/44] apparmor: Fix uninitialized value in aa_split_fqname Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 29/44] x86/earlyprintk: Add a force option for pciserial device Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 30/44] platform/x86: acerhdf: Add BIOS entry for Gateway LT31 v1.3307 Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 31/44] clk: meson-axg: pcie: drop the mpll3 clock parent Sasha Levin
2018-11-13  5:49   ` Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 32/44] arm64: percpu: Initialize ret in the default case Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 33/44] clk: meson: clk-pll: drop CLK_GET_RATE_NOCACHE where unnecessary Sasha Levin
2018-11-13  5:49   ` Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 34/44] clk: renesas: r9a06g032: Fix UART34567 clock rate Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 35/44] clk: sunxi-ng: sun50i: h6: Add 2x fixed post-divider to MMC module clocks Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 36/44] clk: ti: fix OF child-node lookup Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 37/44] mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 38/44] mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 39/44] mm: calculate deferred pages after skipping mirrored memory Sasha Levin
2018-11-13  5:49   ` Sasha Levin
2018-11-13  5:49 ` Sasha Levin [this message]
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 41/44] mm/vmstat.c: assert that vmstat_text is in sync with stat_items_size Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 42/44] userfaultfd: allow get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) to trigger userfaults Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 43/44] mm: don't miss the last page because of round-off error Sasha Levin
2018-11-13  5:49 ` [PATCH AUTOSEL 4.19 44/44] mm: don't warn about large allocations for slab Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181113054950.77898-40-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=guro@fb.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=stable@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.