linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm/page_alloc: detect allocation forbidden by cpuset and bail out early
@ 2021-09-07  8:25 Feng Tang
  2021-09-07  8:44 ` Michal Hocko
  0 siblings, 1 reply; 11+ messages in thread
From: Feng Tang @ 2021-09-07  8:25 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, David Rientjes, Mel Gorman,
	Vlastimil Babka, linux-mm
  Cc: linux-kernel, Feng Tang

There was report that starting an Ubuntu in docker while using cpuset
to bind it to movlabe nodes (a node only has movable zone, like a node
for hotplug or a Persistent Memory  node in normal usage) will fail
due to memory allocation failure, and then OOM is involved and many
other innocent processes got killed. It can be reproduced with command:
$docker run -it --rm  --cpuset-mems 4 ubuntu:latest bash -c
"grep Mems_allowed /proc/self/status" (node 4 is a movable node)

The reason is, in the case, the target cpuset nodes only have movable
zone, while the creation of an OS in docker sometimes needs to allocate
memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and
the cpuset limit forbids the allocation, then out-of-memory killing is
involved even when normal nodes and movable nodes both have many free
memory.

The failure is reasonable, but still there is one problem, that when
the usage fails as it's an mission impossible due to the cpuset limit,
the allocation should just not trigger reclaim/compaction, and more
importantly, not get any innocent process oom-killed.

So add detection for cases like this in the slowpath of allocation,
and bail out early returning NULL for the allocation.

We've run some cases of malloc/mmap/page_fault/lru-shm/swap from
will-it-scale and vm-scalability, and didn't see obvious performance
change (all inside +/- 1%), test boxes are 2 socket Cascade Lake and
Icelake servers.

[thanks to Micho Hocko and David Rientjes for suggesting not handle
 it inside OOM code]

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---

Changelog:

  since RFC
  * move the handling from oom code to page allocation path (Michal/David)

 mm/page_alloc.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f95e1d2386a1..d6657f68d1fb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4929,6 +4929,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (!ac->preferred_zoneref->zone)
 		goto nopage;
 
+	/*
+	 * Check for insane configurations where the cpuset doesn't contain
+	 * any suitable zone to satisfy the request - e.g. non-movable
+	 * GFP_HIGHUSER allocations from MOVABLE nodes only.
+	 */
+	if (cpusets_enabled() && (gfp_mask & __GFP_HARDWALL)) {
+		struct zoneref *z = first_zones_zonelist(ac->zonelist,
+					ac->highest_zoneidx,
+					&cpuset_current_mems_allowed);
+		if (!z->zone)
+			goto nopage;
+	}
+
 	if (alloc_flags & ALLOC_KSWAPD)
 		wake_all_kswapds(order, gfp_mask, ac);
 
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-09-10 11:44 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-07  8:25 [PATCH] mm/page_alloc: detect allocation forbidden by cpuset and bail out early Feng Tang
2021-09-07  8:44 ` Michal Hocko
2021-09-08  1:50   ` Feng Tang
2021-09-08  7:06     ` Michal Hocko
2021-09-08  8:12       ` Feng Tang
2021-09-10  7:44       ` Feng Tang
2021-09-10  8:35         ` Michal Hocko
2021-09-10  9:21           ` Feng Tang
2021-09-10 10:35             ` Michal Hocko
2021-09-10 11:29               ` Feng Tang
2021-09-10 11:43                 ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).