linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Feng Tang <feng.tang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Matthew Wilcox <willy@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	dave.hansen@intel.com, ying.huang@intel.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 2/2] mm, page_alloc: loose the node binding check to avoid helpless oom killing
Date: Wed, 4 Nov 2020 08:23:58 +0100	[thread overview]
Message-ID: <20201104072358.GP21990@dhcp22.suse.cz> (raw)
In-Reply-To: <1604470210-124827-3-git-send-email-feng.tang@intel.com>

On Wed 04-11-20 14:10:10, Feng Tang wrote:
> With the incoming of memory hotplug feature and persitent memory, in
> some platform there are memory nodes which only have movable zone.
> 
> Users may bind some of their workload(like docker/container) to
> these nodes, and there are many reports of OOM and page allocation
> failures, one callstack is:
> 
> 	[ 1387.877565] runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
> 	[ 1387.877568] CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
> 	[ 1387.877569] Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
> 	[ 1387.877570] Call Trace:
> 	[ 1387.877579]  dump_stack+0x6b/0x88
> 	[ 1387.877584]  dump_header+0x4a/0x1e2
> 	[ 1387.877586]  oom_kill_process.cold+0xb/0x10
> 	[ 1387.877588]  out_of_memory.part.0+0xaf/0x230
> 	[ 1387.877591]  out_of_memory+0x3d/0x80
> 	[ 1387.877595]  __alloc_pages_slowpath.constprop.0+0x954/0xa20
> 	[ 1387.877599]  __alloc_pages_nodemask+0x2d3/0x300
> 	[ 1387.877602]  pipe_write+0x322/0x590
> 	[ 1387.877607]  new_sync_write+0x196/0x1b0
> 	[ 1387.877609]  vfs_write+0x1c3/0x1f0
> 	[ 1387.877611]  ksys_write+0xa7/0xe0
> 	[ 1387.877617]  do_syscall_64+0x52/0xd0
> 	[ 1387.877621]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> In a full container run, like installing and running the stress tool
> "stress-ng", there are many different kinds of page requests (gfp_masks),
> many of which only allow non-movable zones. Some of them can fall back
> to other nodes with NORMAL/DMA32/DMA zones, but others are blocked by
> the __GFP_HARDWALL or ALLOC_CPUSET check, and cause OOM killing. But
> OOM killing won't do any help here, as this is not an issue of lack of
> free memory, but simply blocked by the node binding policy check.
> 
> So loose the policy check for this case.

This allows to spill memory allocations over to any other node which has
Normal (or other lower) zones and as such it breaks cpuset isolation. As
I've pointed out in the reply to your cover letter it seems that this is
more of a misconfiguration than a bug.

I do understand that killing any other task which can allocate from this
node is quite goofy and that is something we can detect and better
target. E.g. fail the allocation or kill the allocating context when the
allocation request cannot be satisfied by no means. But breaking the
node isolation which is a user contract sounds like a bad workaround.
Binding to a movable node(s) without any other fallback is simply
something you shouldn't do.

> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>  mm/page_alloc.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d772206..efd49a9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4669,6 +4669,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (!ac->preferred_zoneref->zone)
>  		goto nopage;
>  
> +	/*
> +	 * If the task's target memory nodes only has movable zones, while the
> +	 * gfp_mask allowed zone is lower than ZONE_MOVABLE, loose the check
> +	 * for __GFP_HARDWALL and ALLOC_CPUSET, otherwise it could trigger OOM
> +	 * killing, which still can not solve this policy check.
> +	 */
> +	if (ac->highest_zoneidx <= ZONE_NORMAL) {
> +		int nid;
> +		unsigned long unmovable = 0;
> +
> +		/* FIXME: this could be a separate function */
> +		for_each_node_mask(nid, cpuset_current_mems_allowed) {
> +			unmovable += NODE_DATA(nid)->node_present_pages -
> +				NODE_DATA(nid)->node_zones[ZONE_MOVABLE].present_pages;
> +		}
> +
> +		if (!unmovable) {
> +			gfp_mask &= ~(__GFP_HARDWALL);
> +			alloc_flags &= ~ALLOC_CPUSET;
> +		}
> +	}
> +
>  	if (alloc_flags & ALLOC_KSWAPD)
>  		wake_all_kswapds(order, gfp_mask, ac);
>  
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs

  reply	other threads:[~2020-11-04  7:24 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-04  6:10 [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node Feng Tang
2020-11-04  6:10 ` [RFC PATCH 1/2] mm, oom: dump meminfo for all memory nodes Feng Tang
2020-11-04  7:18   ` Michal Hocko
2020-11-04  6:10 ` [RFC PATCH 2/2] mm, page_alloc: loose the node binding check to avoid helpless oom killing Feng Tang
2020-11-04  7:23   ` Michal Hocko [this message]
2020-11-04  7:13 ` [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node Michal Hocko
2020-11-04  7:38   ` Feng Tang
2020-11-04  7:58     ` Michal Hocko
2020-11-04  8:40       ` Feng Tang
2020-11-04  8:53         ` Michal Hocko
2020-11-05  1:40           ` Feng Tang
2020-11-05 12:08             ` Michal Hocko
2020-11-05 12:53               ` Vlastimil Babka
2020-11-05 12:58                 ` Michal Hocko
2020-11-05 13:07                   ` Feng Tang
2020-11-05 13:12                     ` Michal Hocko
2020-11-05 13:43                       ` Feng Tang
2020-11-05 16:16                         ` Michal Hocko
2020-11-06  7:06                           ` Feng Tang
2020-11-06  8:10                             ` Michal Hocko
2020-11-06  9:08                               ` Feng Tang
2020-11-06 10:35                                 ` Michal Hocko
2020-11-05 13:14                   ` Vlastimil Babka
2020-11-05 13:19                     ` Michal Hocko
2020-11-05 13:34                       ` Vlastimil Babka
2020-11-06  4:32               ` Huang, Ying
2020-11-06  7:43                 ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201104072358.GP21990@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=feng.tang@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).