linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] mm/oom: detect and kill task which has allocation forbidden by cpuset limit
@ 2021-08-31  8:38 Feng Tang
  2021-08-31 15:57 ` Michal Hocko
  0 siblings, 1 reply; 7+ messages in thread
From: Feng Tang @ 2021-08-31  8:38 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Michal Hocko, Christian Brauner
  Cc: linux-kernel, Feng Tang

There was report that starting an Ubuntu in docker while using cpuset
to bind it to movlabe nodes (a node only has movable zone, like a node
for hotplug or a PMEM node in normal usage) will fail due to memory
allocation failure, and then OOM is involved and many other innocent
processes got killed. It can be reproduced with command:
$docker run -it --rm  --cpuset-mems 4 ubuntu:latest bash -c
"grep Mems_allowed /proc/self/status" (node 4 is a movable node)

The reason is, in the case, the target cpuset nodes only have movable
zone, while the creation of an OS in docker sometimes needs to allocate
memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and
the cpuset limit forbids the allocation, then out-of-memory killing is
involved even when normal nodes and movable nodes both have many free
memory.

We've posted patches to LKML trying to make the usage working by
loosening the check, which is not agreed as the cpuset binding should
be respected, and should not be bypassed [1]

But still there is another problem, that when the usage fails as it's an
mission impossible due to the cpuset limit, the allocating should just
be killed first, before any other innocent processes get killed.

Add detection for cases like this, and kill the allocating task only.

[1].https://lore.kernel.org/lkml/1604470210-124827-1-git-send-email-feng.tang@intel.com/

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 include/linux/oom.h |  1 +
 mm/oom_kill.c       | 13 ++++++++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 2db9a1432511..bf470d8cc421 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -18,6 +18,7 @@ struct task_struct;
 enum oom_constraint {
 	CONSTRAINT_NONE,
 	CONSTRAINT_CPUSET,
+	CONSTRAINT_CPUSET_NONE,	/* no available zone from cpuset's mem nodes */
 	CONSTRAINT_MEMORY_POLICY,
 	CONSTRAINT_MEMCG,
 };
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 431d38c3bba8..021ec8954279 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -247,6 +247,7 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
 static const char * const oom_constraint_text[] = {
 	[CONSTRAINT_NONE] = "CONSTRAINT_NONE",
 	[CONSTRAINT_CPUSET] = "CONSTRAINT_CPUSET",
+	[CONSTRAINT_CPUSET_NONE] = "CONSTRAINT_CPUSET_NONE",
 	[CONSTRAINT_MEMORY_POLICY] = "CONSTRAINT_MEMORY_POLICY",
 	[CONSTRAINT_MEMCG] = "CONSTRAINT_MEMCG",
 };
@@ -275,6 +276,14 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 
 	if (!oc->zonelist)
 		return CONSTRAINT_NONE;
+
+	if (cpusets_enabled() && (oc->gfp_mask & __GFP_HARDWALL)) {
+		z = first_zones_zonelist(oc->zonelist,
+			highest_zoneidx, &cpuset_current_mems_allowed);
+		if (!z->zone)
+			return CONSTRAINT_CPUSET_NONE;
+	}
+
 	/*
 	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
 	 * to kill current.We have to random task kill in this case.
@@ -1093,7 +1102,9 @@ bool out_of_memory(struct oom_control *oc)
 		oc->nodemask = NULL;
 	check_panic_on_oom(oc);
 
-	if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
+	if (!is_memcg_oom(oc) &&
+	    (sysctl_oom_kill_allocating_task ||
+	       oc->constraint == CONSTRAINT_CPUSET_NONE) &&
 	    current->mm && !oom_unkillable_task(current) &&
 	    oom_cpuset_eligible(current, oc) &&
 	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
-- 
2.14.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] mm/oom: detect and kill task which has allocation forbidden by cpuset limit
  2021-08-31  8:38 [RFC PATCH] mm/oom: detect and kill task which has allocation forbidden by cpuset limit Feng Tang
@ 2021-08-31 15:57 ` Michal Hocko
  2021-09-01  1:06   ` David Rientjes
  0 siblings, 1 reply; 7+ messages in thread
From: Michal Hocko @ 2021-08-31 15:57 UTC (permalink / raw)
  To: Feng Tang; +Cc: linux-mm, Andrew Morton, Christian Brauner, linux-kernel

On Tue 31-08-21 16:38:05, Feng Tang wrote:
> There was report that starting an Ubuntu in docker while using cpuset
> to bind it to movlabe nodes (a node only has movable zone, like a node
> for hotplug or a PMEM node in normal usage) will fail due to memory
> allocation failure, and then OOM is involved and many other innocent
> processes got killed. It can be reproduced with command:
> $docker run -it --rm  --cpuset-mems 4 ubuntu:latest bash -c
> "grep Mems_allowed /proc/self/status" (node 4 is a movable node)

Is there any valid usecase to allow cpusets to be configured only to
movable nodes? Wouldn't it be better to simply disallow such a setup?
I do understand that we usually allow people to shoot their feet but
this one has some wider consequences.

> The reason is, in the case, the target cpuset nodes only have movable
> zone, while the creation of an OS in docker sometimes needs to allocate
> memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and
> the cpuset limit forbids the allocation, then out-of-memory killing is
> involved even when normal nodes and movable nodes both have many free
> memory.
> 
> We've posted patches to LKML trying to make the usage working by
> loosening the check, which is not agreed as the cpuset binding should
> be respected, and should not be bypassed [1]
> 
> But still there is another problem, that when the usage fails as it's an
> mission impossible due to the cpuset limit, the allocating should just
> be killed first, before any other innocent processes get killed.

I do not like this solution TBH. We know that that it is impossible to
satisfy the allocation at the page allocator level so dealing with it at
the OOM killer level is just a bad layering and a lot of wasted cycles
to reach that point. Why cannot we simply fail the allocation if cpusets
filtering leads to an empty zone intersection?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] mm/oom: detect and kill task which has allocation forbidden by cpuset limit
  2021-08-31 15:57 ` Michal Hocko
@ 2021-09-01  1:06   ` David Rientjes
  2021-09-01  2:44     ` Feng Tang
  0 siblings, 1 reply; 7+ messages in thread
From: David Rientjes @ 2021-09-01  1:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Feng Tang, linux-mm, Andrew Morton, Christian Brauner, linux-kernel

On Tue, 31 Aug 2021, Michal Hocko wrote:

> I do not like this solution TBH. We know that that it is impossible to
> satisfy the allocation at the page allocator level so dealing with it at
> the OOM killer level is just a bad layering and a lot of wasted cycles
> to reach that point. Why cannot we simply fail the allocation if cpusets
> filtering leads to an empty zone intersection?

Cpusets will guarantee our effective nodemask will include at least one 
node in N_MEMORY (cpuset_mems_allowed()) so we'll always have at least one 
zone in our zonelist.

Issue in this case appears to be that the zone will never satisfy 
non-movable allocations.  I think this would be very similar to a GFP_DMA 
allocation when bound to a node without lowmem, in which case we get a 
page allocation failure.  We don't kill current like this patch.

So I'd agree in this case that it would be better to simply fail the 
allocation.

Feng, would you move this check to __alloc_pages_may_oom() like the other 
special cases and simply fail rather than call into the oom killer?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] mm/oom: detect and kill task which has allocation forbidden by cpuset limit
  2021-09-01  1:06   ` David Rientjes
@ 2021-09-01  2:44     ` Feng Tang
  2021-09-01 13:42       ` Feng Tang
  0 siblings, 1 reply; 7+ messages in thread
From: Feng Tang @ 2021-09-01  2:44 UTC (permalink / raw)
  To: David Rientjes, Michal Hocko
  Cc: Michal Hocko, linux-mm, Andrew Morton, Christian Brauner, linux-kernel

Hi David and Michal,

On Tue, Aug 31, 2021 at 06:06:17PM -0700, David Rientjes wrote:
> On Tue, 31 Aug 2021, Michal Hocko wrote:
> 
> > I do not like this solution TBH. We know that that it is impossible to
> > satisfy the allocation at the page allocator level so dealing with it at
> > the OOM killer level is just a bad layering and a lot of wasted cycles
> > to reach that point. Why cannot we simply fail the allocation if cpusets
> > filtering leads to an empty zone intersection?
> 
> Cpusets will guarantee our effective nodemask will include at least one 
> node in N_MEMORY (cpuset_mems_allowed()) so we'll always have at least one 
> zone in our zonelist.
> 
> Issue in this case appears to be that the zone will never satisfy 
> non-movable allocations.  I think this would be very similar to a GFP_DMA 
> allocation when bound to a node without lowmem, in which case we get a 
> page allocation failure.  We don't kill current like this patch.
 
Thanks for sharing the case, the DMA case is quite simliar. And in our usage,
the allocating task is finally killed after many OS routine/GUI tasks get
killed.

> So I'd agree in this case that it would be better to simply fail the 
> allocation.

I agree with yours and Michal's comments, putting it in the OOM code
is a little late and wastes cpu cycles.

> Feng, would you move this check to __alloc_pages_may_oom() like the other 
> special cases and simply fail rather than call into the oom killer?

Will explore more in this direction, thanks!

- Feng

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] mm/oom: detect and kill task which has allocation forbidden by cpuset limit
  2021-09-01  2:44     ` Feng Tang
@ 2021-09-01 13:42       ` Feng Tang
  2021-09-01 14:05         ` Michal Hocko
  0 siblings, 1 reply; 7+ messages in thread
From: Feng Tang @ 2021-09-01 13:42 UTC (permalink / raw)
  To: David Rientjes, Michal Hocko
  Cc: linux-mm, Andrew Morton, Christian Brauner, linux-kernel

On Wed, Sep 01, 2021 at 10:44:02AM +0800, Tang, Feng wrote:
[SNIP]
> > So I'd agree in this case that it would be better to simply fail the 
> > allocation.
> 
> I agree with yours and Michal's comments, putting it in the OOM code
> is a little late and wastes cpu cycles.
> 
> > Feng, would you move this check to __alloc_pages_may_oom() like the other 
> > special cases and simply fail rather than call into the oom killer?
> 
> Will explore more in this direction, thanks!
 
I tried below patch, which can solve the blindly killing issue, that
the docker processes will see page allocation errors, and eventually 
quit running.

Thanks,
Feng

---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eeb3a9cb36bb..d1ae77be45a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4271,10 +4271,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		.gfp_mask = gfp_mask,
 		.order = order,
 	};
-	struct page *page;
+	struct page *page = NULL;
+	struct zoneref *z;
 
 	*did_some_progress = 0;
 
+	if (cpusets_enabled() && (gfp_mask & __GFP_HARDWALL)) {
+		z = first_zones_zonelist(ac->zonelist,
+			gfp_zone(gfp_mask), &cpuset_current_mems_allowed);
+		if (!z->zone)
+			goto out;
+	}
+
 	/*
 	 * Acquire the oom lock.  If that fails, somebody else is
 	 * making progress for us.

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] mm/oom: detect and kill task which has allocation forbidden by cpuset limit
  2021-09-01 13:42       ` Feng Tang
@ 2021-09-01 14:05         ` Michal Hocko
  2021-09-02  7:34           ` Feng Tang
  0 siblings, 1 reply; 7+ messages in thread
From: Michal Hocko @ 2021-09-01 14:05 UTC (permalink / raw)
  To: Feng Tang
  Cc: David Rientjes, linux-mm, Andrew Morton, Christian Brauner, linux-kernel

On Wed 01-09-21 21:42:00, Feng Tang wrote:
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index eeb3a9cb36bb..d1ae77be45a2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4271,10 +4271,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		.gfp_mask = gfp_mask,
>  		.order = order,
>  	};
> -	struct page *page;
> +	struct page *page = NULL;
> +	struct zoneref *z;
>  
>  	*did_some_progress = 0;
>  
> +	if (cpusets_enabled() && (gfp_mask & __GFP_HARDWALL)) {
> +		z = first_zones_zonelist(ac->zonelist,
> +			gfp_zone(gfp_mask), &cpuset_current_mems_allowed);
> +		if (!z->zone)
> +			goto out;
> +	}
> +

This looks better than the previous attempt. It would be still better to
solve this at the page allocator layer. The slowpath is already doing
this for the nodemask. E.g.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index eeb3a9cb36bb..a3193134540d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4929,6 +4929,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (!ac->preferred_zoneref->zone)
 		goto nopage;
 
+	/*
+	 * Check for insane configurations where the cpuset doesn't contain any suitable
+	 * zone to satisfy the request - e.g. kernel allocations from MOVABLE nodes only
+	 */
+	if (cpusets_enabled() && (gfp_mask & __GFP_HARDWALL)) {
+		struct zoneref *z = first_zones_zonelist(ac->zonelist, ac->highest_zoneidx,
+				&cpuset_current_mems_allowed);
+		if (!z->zone)
+			goto nopage;
+	}
+
 	if (alloc_flags & ALLOC_KSWAPD)
 		wake_all_kswapds(order, gfp_mask, ac);
 
if this is seen as an additional overhead for an insane configuration
then we can add insane_cpusets_enabled() which would be a static branch
enabled when somebody actually tries to configure movable only cpusets
or potentially other dubious usage.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] mm/oom: detect and kill task which has allocation forbidden by cpuset limit
  2021-09-01 14:05         ` Michal Hocko
@ 2021-09-02  7:34           ` Feng Tang
  0 siblings, 0 replies; 7+ messages in thread
From: Feng Tang @ 2021-09-02  7:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, linux-mm, Andrew Morton, Christian Brauner, linux-kernel

On Wed, Sep 01, 2021 at 04:05:40PM +0200, Michal Hocko wrote:
[SNIP]
> 
> This looks better than the previous attempt. It would be still better to
> solve this at the page allocator layer. The slowpath is already doing
> this for the nodemask. E.g.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index eeb3a9cb36bb..a3193134540d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4929,6 +4929,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (!ac->preferred_zoneref->zone)
>  		goto nopage;
>  
> +	/*
> +	 * Check for insane configurations where the cpuset doesn't contain any suitable
> +	 * zone to satisfy the request - e.g. kernel allocations from MOVABLE nodes only
> +	 */
> +	if (cpusets_enabled() && (gfp_mask & __GFP_HARDWALL)) {
> +		struct zoneref *z = first_zones_zonelist(ac->zonelist, ac->highest_zoneidx,
> +				&cpuset_current_mems_allowed);
> +		if (!z->zone)
> +			goto nopage;
> +	}
> +
>  	if (alloc_flags & ALLOC_KSWAPD)
>  		wake_all_kswapds(order, gfp_mask, ac);

Thanks for the suggestion! It dose bail out early skipping the kswapd,
direct reclaim and compaction.

I also looked at prepare_alloc_pages() which does some cpuset check
and zone initialization, but I'd better leave it alone as it's in a
real hot path, while here is in slowpath anyway. 

Will run some page fault benchmark cases with this patch.

Thanks,
Feng

> if this is seen as an additional overhead for an insane configuration
> then we can add insane_cpusets_enabled() which would be a static branch
> enabled when somebody actually tries to configure movable only cpusets
> or potentially other dubious usage.
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-09-02  7:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-31  8:38 [RFC PATCH] mm/oom: detect and kill task which has allocation forbidden by cpuset limit Feng Tang
2021-08-31 15:57 ` Michal Hocko
2021-09-01  1:06   ` David Rientjes
2021-09-01  2:44     ` Feng Tang
2021-09-01 13:42       ` Feng Tang
2021-09-01 14:05         ` Michal Hocko
2021-09-02  7:34           ` Feng Tang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).