[RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
@ 2020-11-04  6:10 Feng Tang
  2020-11-04  6:10 ` [RFC PATCH 1/2] mm, oom: dump meminfo for all memory nodes Feng Tang
                   ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Feng Tang @ 2020-11-04  6:10 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, dave.hansen, ying.huang, linux-mm, linux-kernel
  Cc: Feng Tang

Hi,

This patchset tries to report a problem and get suggestion/review
for the RFC fix patches.

We recently got a OOM report, that when user try to bind a docker(container)
instance to a memory node which only has movable zones, and OOM killing
still can't solve the page allocation failure.

The callstack was:

	[ 1387.877565] runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
	[ 1387.877568] CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
	[ 1387.877569] Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
	[ 1387.877570] Call Trace:
	[ 1387.877579]  dump_stack+0x6b/0x88
	[ 1387.877584]  dump_header+0x4a/0x1e2
	[ 1387.877586]  oom_kill_process.cold+0xb/0x10
	[ 1387.877588]  out_of_memory.part.0+0xaf/0x230
	[ 1387.877591]  out_of_memory+0x3d/0x80
	[ 1387.877595]  __alloc_pages_slowpath.constprop.0+0x954/0xa20
	[ 1387.877599]  __alloc_pages_nodemask+0x2d3/0x300
	[ 1387.877602]  pipe_write+0x322/0x590
	[ 1387.877607]  new_sync_write+0x196/0x1b0
	[ 1387.877609]  vfs_write+0x1c3/0x1f0
	[ 1387.877611]  ksys_write+0xa7/0xe0
	[ 1387.877617]  do_syscall_64+0x52/0xd0
	[ 1387.877621]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

The meminfo log only shows the movable only node, which has plenty
of free memory. And in our reproducing with 1/2 patch, the normal
node (has DMA/DMA32/Normal) also has lot of free memory when OOM
happens. 

If we hack to make this (GFP_HIGHUSER|__GFP_ACCOUNT) request get
a page, and following full docker run (like installing and running
'stress-ng' stress test) will see more allocation failures due to
different kinds of request(gfp_masks). And the 2/2 patch will detect
such cases that the allowed target nodes only have movable zones
and loose the binding check, otherwise it will trigger OOM while
the OOM won't do any help, as the problem is not lack of free memory.

Feng Tang (2):
  mm, oom: dump meminfo for all memory nodes
  mm, page_alloc: loose the node binding check to avoid helpless oom
    killing

 mm/oom_kill.c   |  2 +-
 mm/page_alloc.c | 22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+), 1 deletion(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 1/2] mm, oom: dump meminfo for all memory nodes
  2020-11-04  6:10 [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node Feng Tang
@ 2020-11-04  6:10 ` Feng Tang
  2020-11-04  7:18   ` Michal Hocko
  2020-11-04  6:10 ` [RFC PATCH 2/2] mm, page_alloc: loose the node binding check to avoid helpless oom killing Feng Tang
  2020-11-04  7:13 ` [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node Michal Hocko
  2 siblings, 1 reply; 27+ messages in thread
From: Feng Tang @ 2020-11-04  6:10 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, dave.hansen, ying.huang, linux-mm, linux-kernel
  Cc: Feng Tang

In some OOM cases, if there is memory node binding(current->mems_allowed
is not NULL), system may only print the meminfo for these bound nodes,
while other nodes' info could still be important for debugging.

For example on a platform with one normal node (has DMA/DMA32/NORMAL...
zones) and one node which only has movable zone (either for memory hotplug
case or a persistent memory node), some user will run docker while binding
memory to the movable node. many memory allocations originated from the
docker instance will fall back to the other node, and when a OOM happens,
meminfo for both nodes are needed.

So extend the show_mem() to cover all memory nodes.

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/oom_kill.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8b84661..601476cc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -462,7 +462,7 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
 	if (is_memcg_oom(oc))
 		mem_cgroup_print_oom_meminfo(oc->memcg);
 	else {
-		show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask);
+		show_mem(SHOW_MEM_FILTER_NODES, &node_states[N_MEMORY]);
 		if (is_dump_unreclaim_slabs())
 			dump_unreclaimable_slab();
 	}
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 2/2] mm, page_alloc: loose the node binding check to avoid helpless oom killing
  2020-11-04  6:10 [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node Feng Tang
  2020-11-04  6:10 ` [RFC PATCH 1/2] mm, oom: dump meminfo for all memory nodes Feng Tang
@ 2020-11-04  6:10 ` Feng Tang
  2020-11-04  7:23   ` Michal Hocko
  2020-11-04  7:13 ` [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node Michal Hocko
  2 siblings, 1 reply; 27+ messages in thread
From: Feng Tang @ 2020-11-04  6:10 UTC (permalink / raw)
  To: Andrew Morton, Michal Hocko, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, dave.hansen, ying.huang, linux-mm, linux-kernel
  Cc: Feng Tang

With the incoming of memory hotplug feature and persitent memory, in
some platform there are memory nodes which only have movable zone.

Users may bind some of their workload(like docker/container) to
these nodes, and there are many reports of OOM and page allocation
failures, one callstack is:

	[ 1387.877565] runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
	[ 1387.877568] CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
	[ 1387.877569] Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
	[ 1387.877570] Call Trace:
	[ 1387.877579]  dump_stack+0x6b/0x88
	[ 1387.877584]  dump_header+0x4a/0x1e2
	[ 1387.877586]  oom_kill_process.cold+0xb/0x10
	[ 1387.877588]  out_of_memory.part.0+0xaf/0x230
	[ 1387.877591]  out_of_memory+0x3d/0x80
	[ 1387.877595]  __alloc_pages_slowpath.constprop.0+0x954/0xa20
	[ 1387.877599]  __alloc_pages_nodemask+0x2d3/0x300
	[ 1387.877602]  pipe_write+0x322/0x590
	[ 1387.877607]  new_sync_write+0x196/0x1b0
	[ 1387.877609]  vfs_write+0x1c3/0x1f0
	[ 1387.877611]  ksys_write+0xa7/0xe0
	[ 1387.877617]  do_syscall_64+0x52/0xd0
	[ 1387.877621]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

In a full container run, like installing and running the stress tool
"stress-ng", there are many different kinds of page requests (gfp_masks),
many of which only allow non-movable zones. Some of them can fall back
to other nodes with NORMAL/DMA32/DMA zones, but others are blocked by
the __GFP_HARDWALL or ALLOC_CPUSET check, and cause OOM killing. But
OOM killing won't do any help here, as this is not an issue of lack of
free memory, but simply blocked by the node binding policy check.

So loose the policy check for this case.

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/page_alloc.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d772206..efd49a9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4669,6 +4669,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (!ac->preferred_zoneref->zone)
 		goto nopage;
 
+	/*
+	 * If the task's target memory nodes only has movable zones, while the
+	 * gfp_mask allowed zone is lower than ZONE_MOVABLE, loose the check
+	 * for __GFP_HARDWALL and ALLOC_CPUSET, otherwise it could trigger OOM
+	 * killing, which still can not solve this policy check.
+	 */
+	if (ac->highest_zoneidx <= ZONE_NORMAL) {
+		int nid;
+		unsigned long unmovable = 0;
+
+		/* FIXME: this could be a separate function */
+		for_each_node_mask(nid, cpuset_current_mems_allowed) {
+			unmovable += NODE_DATA(nid)->node_present_pages -
+				NODE_DATA(nid)->node_zones[ZONE_MOVABLE].present_pages;
+		}
+
+		if (!unmovable) {
+			gfp_mask &= ~(__GFP_HARDWALL);
+			alloc_flags &= ~ALLOC_CPUSET;
+		}
+	}
+
 	if (alloc_flags & ALLOC_KSWAPD)
 		wake_all_kswapds(order, gfp_mask, ac);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-04  6:10 [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node Feng Tang
  2020-11-04  6:10 ` [RFC PATCH 1/2] mm, oom: dump meminfo for all memory nodes Feng Tang
  2020-11-04  6:10 ` [RFC PATCH 2/2] mm, page_alloc: loose the node binding check to avoid helpless oom killing Feng Tang
@ 2020-11-04  7:13 ` Michal Hocko
  2020-11-04  7:38   ` Feng Tang
  2 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2020-11-04  7:13 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	dave.hansen, ying.huang, linux-mm, linux-kernel

On Wed 04-11-20 14:10:08, Feng Tang wrote:
> Hi,
> 
> This patchset tries to report a problem and get suggestion/review
> for the RFC fix patches.
> 
> We recently got a OOM report, that when user try to bind a docker(container)
> instance to a memory node which only has movable zones, and OOM killing
> still can't solve the page allocation failure.

This is a cpuset node binding right?

> The callstack was:
> 
> 	[ 1387.877565] runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
> 	[ 1387.877568] CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
> 	[ 1387.877569] Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
> 	[ 1387.877570] Call Trace:
> 	[ 1387.877579]  dump_stack+0x6b/0x88
> 	[ 1387.877584]  dump_header+0x4a/0x1e2
> 	[ 1387.877586]  oom_kill_process.cold+0xb/0x10
> 	[ 1387.877588]  out_of_memory.part.0+0xaf/0x230
> 	[ 1387.877591]  out_of_memory+0x3d/0x80
> 	[ 1387.877595]  __alloc_pages_slowpath.constprop.0+0x954/0xa20
> 	[ 1387.877599]  __alloc_pages_nodemask+0x2d3/0x300
> 	[ 1387.877602]  pipe_write+0x322/0x590
> 	[ 1387.877607]  new_sync_write+0x196/0x1b0
> 	[ 1387.877609]  vfs_write+0x1c3/0x1f0
> 	[ 1387.877611]  ksys_write+0xa7/0xe0
> 	[ 1387.877617]  do_syscall_64+0x52/0xd0
> 	[ 1387.877621]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> The meminfo log only shows the movable only node, which has plenty
> of free memory. And in our reproducing with 1/2 patch, the normal
> node (has DMA/DMA32/Normal) also has lot of free memory when OOM
> happens. 

OK, so you are bidning to a movable node only and your above request is
for GFP_HIGHUSER which _cannot_ be satisfied from the movable zones
because that memory is not movable. So the system behaves as expected.
Your cpuset is misconfigured IMHO. Movable only nodes come with their
risk and configuration price.

> If we hack to make this (GFP_HIGHUSER|__GFP_ACCOUNT) request get
> a page, and following full docker run (like installing and running
> 'stress-ng' stress test) will see more allocation failures due to
> different kinds of request(gfp_masks). And the 2/2 patch will detect
> such cases that the allowed target nodes only have movable zones
> and loose the binding check, otherwise it will trigger OOM while
> the OOM won't do any help, as the problem is not lack of free memory.

Well, this breaks the cpuset containment, right? I consider this quite
unexpected for something that looks like a misconfiguration. I do agree
that this is unexpected for anybody who is not really familiar with
concept of movable zone but we should probably call out all these
details rather than tweak the existing semantic.

Could you be more specific about the usecase here? Why do you need a
binding to a pure movable node? 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] mm, oom: dump meminfo for all memory nodes
  2020-11-04  6:10 ` [RFC PATCH 1/2] mm, oom: dump meminfo for all memory nodes Feng Tang
@ 2020-11-04  7:18   ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2020-11-04  7:18 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	dave.hansen, ying.huang, linux-mm, linux-kernel

On Wed 04-11-20 14:10:09, Feng Tang wrote:
> In some OOM cases, if there is memory node binding(current->mems_allowed
> is not NULL), system may only print the meminfo for these bound nodes,
> while other nodes' info could still be important for debugging.
> 
> For example on a platform with one normal node (has DMA/DMA32/NORMAL...
> zones) and one node which only has movable zone (either for memory hotplug
> case or a persistent memory node), some user will run docker while binding
> memory to the movable node. many memory allocations originated from the
> docker instance will fall back to the other node, and when a OOM happens,
> meminfo for both nodes are needed.
> 
> So extend the show_mem() to cover all memory nodes.

I do not like this change. The reason why we print only relevant numa
nodes is the size of the oom report. Also all other numa nodes are not
really relevant to the allocation so there is no real reason to print
their info. We used to do that in the past and decided that this is more
than suboptimal.

I do understand that this is a preliminary work for your later patch
which tweaks the node binding and so more numa nodes are eligible but
then I would propose to merge the two patches.
 
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>  mm/oom_kill.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8b84661..601476cc 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -462,7 +462,7 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
>  	if (is_memcg_oom(oc))
>  		mem_cgroup_print_oom_meminfo(oc->memcg);
>  	else {
> -		show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask);
> +		show_mem(SHOW_MEM_FILTER_NODES, &node_states[N_MEMORY]);
>  		if (is_dump_unreclaim_slabs())
>  			dump_unreclaimable_slab();
>  	}
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 2/2] mm, page_alloc: loose the node binding check to avoid helpless oom killing
  2020-11-04  6:10 ` [RFC PATCH 2/2] mm, page_alloc: loose the node binding check to avoid helpless oom killing Feng Tang
@ 2020-11-04  7:23   ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2020-11-04  7:23 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	dave.hansen, ying.huang, linux-mm, linux-kernel

On Wed 04-11-20 14:10:10, Feng Tang wrote:
> With the incoming of memory hotplug feature and persitent memory, in
> some platform there are memory nodes which only have movable zone.
> 
> Users may bind some of their workload(like docker/container) to
> these nodes, and there are many reports of OOM and page allocation
> failures, one callstack is:
> 
> 	[ 1387.877565] runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
> 	[ 1387.877568] CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
> 	[ 1387.877569] Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
> 	[ 1387.877570] Call Trace:
> 	[ 1387.877579]  dump_stack+0x6b/0x88
> 	[ 1387.877584]  dump_header+0x4a/0x1e2
> 	[ 1387.877586]  oom_kill_process.cold+0xb/0x10
> 	[ 1387.877588]  out_of_memory.part.0+0xaf/0x230
> 	[ 1387.877591]  out_of_memory+0x3d/0x80
> 	[ 1387.877595]  __alloc_pages_slowpath.constprop.0+0x954/0xa20
> 	[ 1387.877599]  __alloc_pages_nodemask+0x2d3/0x300
> 	[ 1387.877602]  pipe_write+0x322/0x590
> 	[ 1387.877607]  new_sync_write+0x196/0x1b0
> 	[ 1387.877609]  vfs_write+0x1c3/0x1f0
> 	[ 1387.877611]  ksys_write+0xa7/0xe0
> 	[ 1387.877617]  do_syscall_64+0x52/0xd0
> 	[ 1387.877621]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> In a full container run, like installing and running the stress tool
> "stress-ng", there are many different kinds of page requests (gfp_masks),
> many of which only allow non-movable zones. Some of them can fall back
> to other nodes with NORMAL/DMA32/DMA zones, but others are blocked by
> the __GFP_HARDWALL or ALLOC_CPUSET check, and cause OOM killing. But
> OOM killing won't do any help here, as this is not an issue of lack of
> free memory, but simply blocked by the node binding policy check.
> 
> So loose the policy check for this case.

This allows to spill memory allocations over to any other node which has
Normal (or other lower) zones and as such it breaks cpuset isolation. As
I've pointed out in the reply to your cover letter it seems that this is
more of a misconfiguration than a bug.

I do understand that killing any other task which can allocate from this
node is quite goofy and that is something we can detect and better
target. E.g. fail the allocation or kill the allocating context when the
allocation request cannot be satisfied by no means. But breaking the
node isolation which is a user contract sounds like a bad workaround.
Binding to a movable node(s) without any other fallback is simply
something you shouldn't do.

> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>  mm/page_alloc.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d772206..efd49a9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4669,6 +4669,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (!ac->preferred_zoneref->zone)
>  		goto nopage;
>  
> +	/*
> +	 * If the task's target memory nodes only has movable zones, while the
> +	 * gfp_mask allowed zone is lower than ZONE_MOVABLE, loose the check
> +	 * for __GFP_HARDWALL and ALLOC_CPUSET, otherwise it could trigger OOM
> +	 * killing, which still can not solve this policy check.
> +	 */
> +	if (ac->highest_zoneidx <= ZONE_NORMAL) {
> +		int nid;
> +		unsigned long unmovable = 0;
> +
> +		/* FIXME: this could be a separate function */
> +		for_each_node_mask(nid, cpuset_current_mems_allowed) {
> +			unmovable += NODE_DATA(nid)->node_present_pages -
> +				NODE_DATA(nid)->node_zones[ZONE_MOVABLE].present_pages;
> +		}
> +
> +		if (!unmovable) {
> +			gfp_mask &= ~(__GFP_HARDWALL);
> +			alloc_flags &= ~ALLOC_CPUSET;
> +		}
> +	}
> +
>  	if (alloc_flags & ALLOC_KSWAPD)
>  		wake_all_kswapds(order, gfp_mask, ac);
>  
> -- 
> 2.7.4
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-04  7:13 ` [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node Michal Hocko
@ 2020-11-04  7:38   ` Feng Tang
  2020-11-04  7:58     ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Feng Tang @ 2020-11-04  7:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	dave.hansen, ying.huang, linux-mm, linux-kernel

Hi Michal,

Thanks for the prompt review!

On Wed, Nov 04, 2020 at 08:13:08AM +0100, Michal Hocko wrote:
> On Wed 04-11-20 14:10:08, Feng Tang wrote:
> > Hi,
> > 
> > This patchset tries to report a problem and get suggestion/review
> > for the RFC fix patches.
> > 
> > We recently got a OOM report, that when user try to bind a docker(container)
> > instance to a memory node which only has movable zones, and OOM killing
> > still can't solve the page allocation failure.
> 
> This is a cpuset node binding right?

Yes.

A simple test command is 'docker run -it --cpuset-mems 3 ubuntu:latest'
while the node 3 is a movable only PMEM node.
 
> > The callstack was:
> > 
> > 	[ 1387.877565] runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
> > 	[ 1387.877568] CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
> > 	[ 1387.877569] Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
> > 	[ 1387.877570] Call Trace:
> > 	[ 1387.877579]  dump_stack+0x6b/0x88
> > 	[ 1387.877584]  dump_header+0x4a/0x1e2
> > 	[ 1387.877586]  oom_kill_process.cold+0xb/0x10
> > 	[ 1387.877588]  out_of_memory.part.0+0xaf/0x230
> > 	[ 1387.877591]  out_of_memory+0x3d/0x80
> > 	[ 1387.877595]  __alloc_pages_slowpath.constprop.0+0x954/0xa20
> > 	[ 1387.877599]  __alloc_pages_nodemask+0x2d3/0x300
> > 	[ 1387.877602]  pipe_write+0x322/0x590
> > 	[ 1387.877607]  new_sync_write+0x196/0x1b0
> > 	[ 1387.877609]  vfs_write+0x1c3/0x1f0
> > 	[ 1387.877611]  ksys_write+0xa7/0xe0
> > 	[ 1387.877617]  do_syscall_64+0x52/0xd0
> > 	[ 1387.877621]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > The meminfo log only shows the movable only node, which has plenty
> > of free memory. And in our reproducing with 1/2 patch, the normal
> > node (has DMA/DMA32/Normal) also has lot of free memory when OOM
> > happens. 
> 
> OK, so you are bidning to a movable node only and your above request is
> for GFP_HIGHUSER which _cannot_ be satisfied from the movable zones
> because that memory is not movable. So the system behaves as expected.
> Your cpuset is misconfigured IMHO. Movable only nodes come with their
> risk and configuration price.

Aha, this is what we told the reporter at first. Their platform is 2S
platform, and each socket has one DRAM node + one persistent memory node,
and we suggested to bind the docker to one DRAM + one PMEM node.

> > If we hack to make this (GFP_HIGHUSER|__GFP_ACCOUNT) request get
> > a page, and following full docker run (like installing and running
> > 'stress-ng' stress test) will see more allocation failures due to
> > different kinds of request(gfp_masks). And the 2/2 patch will detect
> > such cases that the allowed target nodes only have movable zones
> > and loose the binding check, otherwise it will trigger OOM while
> > the OOM won't do any help, as the problem is not lack of free memory.
> 
> Well, this breaks the cpuset containment, right? I consider this quite
> unexpected for something that looks like a misconfiguration. I do agree
> that this is unexpected for anybody who is not really familiar with
> concept of movable zone but we should probably call out all these
> details rather than tweak the existing semantic.

Yes, it does break the cpuset containment 

> Could you be more specific about the usecase here? Why do you need a
> binding to a pure movable node? 

One common configuration for a platform is small size of DRAM plus huge
size of PMEM (which is slower but cheaper), and my guess of their use
is to try to lead the bulk of user space allocation (GFP_HIGHUSER_MOVABLE)
to PMEM node, and only let DRAM be used as less as possible. 

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-04  7:38   ` Feng Tang
@ 2020-11-04  7:58     ` Michal Hocko
  2020-11-04  8:40       ` Feng Tang
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2020-11-04  7:58 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	dave.hansen, ying.huang, linux-mm, linux-kernel

On Wed 04-11-20 15:38:26, Feng Tang wrote:
[...]
> > Could you be more specific about the usecase here? Why do you need a
> > binding to a pure movable node? 
> 
> One common configuration for a platform is small size of DRAM plus huge
> size of PMEM (which is slower but cheaper), and my guess of their use
> is to try to lead the bulk of user space allocation (GFP_HIGHUSER_MOVABLE)
> to PMEM node, and only let DRAM be used as less as possible. 

While this is possible, it is a tricky configuration. It is essentially 
get us back to 32b and highmem...

As I've said in reply to your second patch. I think we can make the oom
killer behavior more sensible in this misconfigured cases but I do not
think we want break the cpuset isolation for such a configuration.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-04  7:58     ` Michal Hocko
@ 2020-11-04  8:40       ` Feng Tang
  2020-11-04  8:53         ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Feng Tang @ 2020-11-04  8:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	dave.hansen, ying.huang, linux-mm, linux-kernel

On Wed, Nov 04, 2020 at 08:58:19AM +0100, Michal Hocko wrote:
> On Wed 04-11-20 15:38:26, Feng Tang wrote:
> [...]
> > > Could you be more specific about the usecase here? Why do you need a
> > > binding to a pure movable node? 
> > 
> > One common configuration for a platform is small size of DRAM plus huge
> > size of PMEM (which is slower but cheaper), and my guess of their use
> > is to try to lead the bulk of user space allocation (GFP_HIGHUSER_MOVABLE)
> > to PMEM node, and only let DRAM be used as less as possible. 
> 
> While this is possible, it is a tricky configuration. It is essentially 
> get us back to 32b and highmem...

:) Another possible case is similar binding on a memory hotplugable
platform, which has one unplugable node and several other nodes configured
as movable only to be hot removable when needed

> As I've said in reply to your second patch. I think we can make the oom
> killer behavior more sensible in this misconfigured cases but I do not
> think we want break the cpuset isolation for such a configuration.

Do you mean we skip the killing and just let the allocation fail? We've
checked the oom killer code first, when the oom happens, both DRAM
node and unmovable node have lots of free memory, and killing process
won't improve the situation.

(Folloing is copied from your comments for 2/2) 
> This allows to spill memory allocations over to any other node which
> has Normal (or other lower) zones and as such it breaks cpuset isolation.
> As I've pointed out in the reply to your cover letter it seems that
> this is more of a misconfiguration than a bug.

For the usage case (docker container running), the spilling is already
happening, I traced its memory allocation requests, many of them are
movable, and got fallback to the normal node naturally with current
code, only a few got blocked, as many of __alloc_pages_nodemask are
called witih 'NULL' nodemask parameter.

And I made this RFC patch inspired by code in __alloc_pages_may_oom():

	if (gfp_mask & __GFP_NOFAIL)
		page = __alloc_pages_cpuset_fallback(gfp_mask, order,
				ALLOC_NO_WATERMARKS, ac);

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-04  8:40       ` Feng Tang
@ 2020-11-04  8:53         ` Michal Hocko
  2020-11-05  1:40           ` Feng Tang
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2020-11-04  8:53 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	dave.hansen, ying.huang, linux-mm, linux-kernel

On Wed 04-11-20 16:40:21, Feng Tang wrote:
> On Wed, Nov 04, 2020 at 08:58:19AM +0100, Michal Hocko wrote:
> > On Wed 04-11-20 15:38:26, Feng Tang wrote:
> > [...]
> > > > Could you be more specific about the usecase here? Why do you need a
> > > > binding to a pure movable node? 
> > > 
> > > One common configuration for a platform is small size of DRAM plus huge
> > > size of PMEM (which is slower but cheaper), and my guess of their use
> > > is to try to lead the bulk of user space allocation (GFP_HIGHUSER_MOVABLE)
> > > to PMEM node, and only let DRAM be used as less as possible. 
> > 
> > While this is possible, it is a tricky configuration. It is essentially 
> > get us back to 32b and highmem...
> 
> :) Another possible case is similar binding on a memory hotplugable
> platform, which has one unplugable node and several other nodes configured
> as movable only to be hot removable when needed

Yes, another way to shoot your foot ;)

> > As I've said in reply to your second patch. I think we can make the oom
> > killer behavior more sensible in this misconfigured cases but I do not
> > think we want break the cpuset isolation for such a configuration.
> 
> Do you mean we skip the killing and just let the allocation fail? We've
> checked the oom killer code first, when the oom happens, both DRAM
> node and unmovable node have lots of free memory, and killing process
> won't improve the situation.

We already do skip oom killer and fail for lowmem allocation requests already.
This is similar in some sense. Another option would be to kill the
allocating context which will have less corner cases potentially because
some allocation failures might be unexpected.

> (Folloing is copied from your comments for 2/2) 
> > This allows to spill memory allocations over to any other node which
> > has Normal (or other lower) zones and as such it breaks cpuset isolation.
> > As I've pointed out in the reply to your cover letter it seems that
> > this is more of a misconfiguration than a bug.
> 
> For the usage case (docker container running), the spilling is already
> happening, I traced its memory allocation requests, many of them are
> movable, and got fallback to the normal node naturally with current

Could you be more specific? This sounds like a bug. Allocations
shouldn't spill over to a node which is not in the cpuset. There are few
exceptions like IRQ context but that shouldn't happen regurarly.

> code, only a few got blocked, as many of __alloc_pages_nodemask are
> called witih 'NULL' nodemask parameter.
> 
> And I made this RFC patch inspired by code in __alloc_pages_may_oom():
> 
> 	if (gfp_mask & __GFP_NOFAIL)
> 		page = __alloc_pages_cpuset_fallback(gfp_mask, order,
> 				ALLOC_NO_WATERMARKS, ac);

I am not really sure I follow here. __GFP_NOFAIL is a special beast
because such an allocation must not fail. Breaking node affinity is the
only option left. This shouldn't be something used for regular
allocation requests.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-04  8:53         ` Michal Hocko
@ 2020-11-05  1:40           ` Feng Tang
  2020-11-05 12:08             ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Feng Tang @ 2020-11-05  1:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	dave.hansen, ying.huang, linux-mm, linux-kernel

On Wed, Nov 04, 2020 at 09:53:43AM +0100, Michal Hocko wrote:
 
> > > As I've said in reply to your second patch. I think we can make the oom
> > > killer behavior more sensible in this misconfigured cases but I do not
> > > think we want break the cpuset isolation for such a configuration.
> > 
> > Do you mean we skip the killing and just let the allocation fail? We've
> > checked the oom killer code first, when the oom happens, both DRAM
> > node and unmovable node have lots of free memory, and killing process
> > won't improve the situation.
> 
> We already do skip oom killer and fail for lowmem allocation requests already.
> This is similar in some sense. Another option would be to kill the
> allocating context which will have less corner cases potentially because
> some allocation failures might be unexpected.

Yes, this can avoid the helpless oom killing to kill a good process(no
memory pressure at all)

And I think the important thing is to judge whether this usage (binding
docker like workload to unmovable node) is a valid case :) 

Initially, I thought it invalid too, but later think it still makes some
sense for the 2 cases:
    * user want to bind his workload to one node(most of user space
      memory) to avoid cross-node traffic, and that node happens to
      be configured as unmovable
    * one small DRAM node + big PMEM node, and memory latency insensitive
      workload could be bound to the cheaper unmovable PMEM node
 
> > (Folloing is copied from your comments for 2/2) 
> > > This allows to spill memory allocations over to any other node which
> > > has Normal (or other lower) zones and as such it breaks cpuset isolation.
> > > As I've pointed out in the reply to your cover letter it seems that
> > > this is more of a misconfiguration than a bug.
> > 
> > For the usage case (docker container running), the spilling is already
> > happening, I traced its memory allocation requests, many of them are
> > movable, and got fallback to the normal node naturally with current
> 
> Could you be more specific? This sounds like a bug. Allocations
> shouldn't spill over to a node which is not in the cpuset. There are few
> exceptions like IRQ context but that shouldn't happen regurarly.

I mean when the docker starts, it will spawn many processes which obey
the mem binding set, and they have some kernel page requests, which got
successfully allocated, like the following callstack:

	[  567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] Tainted: G        W I       5.9.0-rc8+ #6
	[  567.044956] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
	[  567.044958] Call Trace:
	[  567.044972]  dump_stack+0x74/0x9a
	[  567.044978]  __alloc_pages_nodemask.cold+0x22/0xe5
	[  567.044986]  alloc_pages_current+0x87/0xe0
	[  567.044991]  allocate_slab+0x2e5/0x4f0
	[  567.044996]  ___slab_alloc+0x380/0x5d0
	[  567.045021]  __slab_alloc+0x20/0x40
	[  567.045025]  kmem_cache_alloc+0x2a0/0x2e0
	[  567.045033]  mqueue_alloc_inode+0x1a/0x30
	[  567.045041]  alloc_inode+0x22/0xa0
	[  567.045045]  new_inode_pseudo+0x12/0x60
	[  567.045049]  new_inode+0x17/0x30
	[  567.045052]  mqueue_get_inode+0x45/0x3b0
	[  567.045060]  mqueue_fill_super+0x41/0x70
	[  567.045067]  vfs_get_super+0x7f/0x100
	[  567.045074]  get_tree_keyed+0x1d/0x20
	[  567.045080]  mqueue_get_tree+0x1c/0x20
	[  567.045086]  vfs_get_tree+0x2a/0xc0
	[  567.045092]  fc_mount+0x13/0x50
	[  567.045099]  mq_create_mount+0x92/0xe0
	[  567.045102]  mq_init_ns+0x3b/0x50
	[  567.045106]  copy_ipcs+0x10a/0x1b0
	[  567.045113]  create_new_namespaces+0xa6/0x2b0
	[  567.045118]  unshare_nsproxy_namespaces+0x5a/0xb0
	[  567.045124]  ksys_unshare+0x19f/0x360
	[  567.045129]  __x64_sys_unshare+0x12/0x20
	[  567.045135]  do_syscall_64+0x38/0x90
	[  567.045143]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

For it, the __alloc_pages_nodemask() will first try process's targed
nodemask(unmovable node here), and there is no availabe zone, so it
goes with the NULL nodemask, and get a page in the slowpath.

And this process happens to the user space allocation as well, but
it got blocked by the CPUSET node binding check.

Thanks,
Feng



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-05  1:40           ` Feng Tang
@ 2020-11-05 12:08             ` Michal Hocko
  2020-11-05 12:53               ` Vlastimil Babka
  2020-11-06  4:32               ` Huang, Ying
  0 siblings, 2 replies; 27+ messages in thread
From: Michal Hocko @ 2020-11-05 12:08 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	dave.hansen, ying.huang, linux-mm, linux-kernel

On Thu 05-11-20 09:40:28, Feng Tang wrote:
> On Wed, Nov 04, 2020 at 09:53:43AM +0100, Michal Hocko wrote:
>  
> > > > As I've said in reply to your second patch. I think we can make the oom
> > > > killer behavior more sensible in this misconfigured cases but I do not
> > > > think we want break the cpuset isolation for such a configuration.
> > > 
> > > Do you mean we skip the killing and just let the allocation fail? We've
> > > checked the oom killer code first, when the oom happens, both DRAM
> > > node and unmovable node have lots of free memory, and killing process
> > > won't improve the situation.
> > 
> > We already do skip oom killer and fail for lowmem allocation requests already.
> > This is similar in some sense. Another option would be to kill the
> > allocating context which will have less corner cases potentially because
> > some allocation failures might be unexpected.
> 
> Yes, this can avoid the helpless oom killing to kill a good process(no
> memory pressure at all)
> 
> And I think the important thing is to judge whether this usage (binding
> docker like workload to unmovable node) is a valid case :) 

I am confused. Why wouldbe an unmovable node a problem. Movable
allocations can be satisfied from the Zone Normal just fine. It is other
way around that is a problem.

> Initially, I thought it invalid too, but later think it still makes some
> sense for the 2 cases:
>     * user want to bind his workload to one node(most of user space
>       memory) to avoid cross-node traffic, and that node happens to
>       be configured as unmovable

See above

>     * one small DRAM node + big PMEM node, and memory latency insensitive
>       workload could be bound to the cheaper unmovable PMEM node

Please elaborate some more. As long as you have movable and normal nodes
then this should be possible with a deal of care - most notably the
movable:kernel ratio memory shouldn't be too big.

Besides that why does PMEM node have to be MOVABLE only in the first
place?

> > > (Folloing is copied from your comments for 2/2) 
> > > > This allows to spill memory allocations over to any other node which
> > > > has Normal (or other lower) zones and as such it breaks cpuset isolation.
> > > > As I've pointed out in the reply to your cover letter it seems that
> > > > this is more of a misconfiguration than a bug.
> > > 
> > > For the usage case (docker container running), the spilling is already
> > > happening, I traced its memory allocation requests, many of them are
> > > movable, and got fallback to the normal node naturally with current
> > 
> > Could you be more specific? This sounds like a bug. Allocations
> > shouldn't spill over to a node which is not in the cpuset. There are few
> > exceptions like IRQ context but that shouldn't happen regurarly.
> 
> I mean when the docker starts, it will spawn many processes which obey
> the mem binding set, and they have some kernel page requests, which got
> successfully allocated, like the following callstack:
> 
> 	[  567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] Tainted: G        W I       5.9.0-rc8+ #6
> 	[  567.044956] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
> 	[  567.044958] Call Trace:
> 	[  567.044972]  dump_stack+0x74/0x9a
> 	[  567.044978]  __alloc_pages_nodemask.cold+0x22/0xe5
> 	[  567.044986]  alloc_pages_current+0x87/0xe0
> 	[  567.044991]  allocate_slab+0x2e5/0x4f0
> 	[  567.044996]  ___slab_alloc+0x380/0x5d0
> 	[  567.045021]  __slab_alloc+0x20/0x40
> 	[  567.045025]  kmem_cache_alloc+0x2a0/0x2e0
> 	[  567.045033]  mqueue_alloc_inode+0x1a/0x30
> 	[  567.045041]  alloc_inode+0x22/0xa0
> 	[  567.045045]  new_inode_pseudo+0x12/0x60
> 	[  567.045049]  new_inode+0x17/0x30
> 	[  567.045052]  mqueue_get_inode+0x45/0x3b0
> 	[  567.045060]  mqueue_fill_super+0x41/0x70
> 	[  567.045067]  vfs_get_super+0x7f/0x100
> 	[  567.045074]  get_tree_keyed+0x1d/0x20
> 	[  567.045080]  mqueue_get_tree+0x1c/0x20
> 	[  567.045086]  vfs_get_tree+0x2a/0xc0
> 	[  567.045092]  fc_mount+0x13/0x50
> 	[  567.045099]  mq_create_mount+0x92/0xe0
> 	[  567.045102]  mq_init_ns+0x3b/0x50
> 	[  567.045106]  copy_ipcs+0x10a/0x1b0
> 	[  567.045113]  create_new_namespaces+0xa6/0x2b0
> 	[  567.045118]  unshare_nsproxy_namespaces+0x5a/0xb0
> 	[  567.045124]  ksys_unshare+0x19f/0x360
> 	[  567.045129]  __x64_sys_unshare+0x12/0x20
> 	[  567.045135]  do_syscall_64+0x38/0x90
> 	[  567.045143]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> For it, the __alloc_pages_nodemask() will first try process's targed
> nodemask(unmovable node here), and there is no availabe zone, so it
> goes with the NULL nodemask, and get a page in the slowpath.

OK, I see your point now. I was not aware of the slab allocator not
following cpusets. Sounds like a bug to me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-05 12:08             ` Michal Hocko
@ 2020-11-05 12:53               ` Vlastimil Babka
  2020-11-05 12:58                 ` Michal Hocko
  2020-11-06  4:32               ` Huang, Ying
  1 sibling, 1 reply; 27+ messages in thread
From: Vlastimil Babka @ 2020-11-05 12:53 UTC (permalink / raw)
  To: Michal Hocko, Feng Tang
  Cc: Andrew Morton, Johannes Weiner, Matthew Wilcox, Mel Gorman,
	dave.hansen, ying.huang, linux-mm, linux-kernel

On 11/5/20 1:08 PM, Michal Hocko wrote:
> On Thu 05-11-20 09:40:28, Feng Tang wrote:
>> > 
>> > Could you be more specific? This sounds like a bug. Allocations
>> > shouldn't spill over to a node which is not in the cpuset. There are few
>> > exceptions like IRQ context but that shouldn't happen regurarly.
>> 
>> I mean when the docker starts, it will spawn many processes which obey
>> the mem binding set, and they have some kernel page requests, which got
>> successfully allocated, like the following callstack:
>> 
>> 	[  567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] Tainted: G        W I       5.9.0-rc8+ #6
>> 	[  567.044956] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
>> 	[  567.044958] Call Trace:
>> 	[  567.044972]  dump_stack+0x74/0x9a
>> 	[  567.044978]  __alloc_pages_nodemask.cold+0x22/0xe5
>> 	[  567.044986]  alloc_pages_current+0x87/0xe0
>> 	[  567.044991]  allocate_slab+0x2e5/0x4f0
>> 	[  567.044996]  ___slab_alloc+0x380/0x5d0
>> 	[  567.045021]  __slab_alloc+0x20/0x40
>> 	[  567.045025]  kmem_cache_alloc+0x2a0/0x2e0
>> 	[  567.045033]  mqueue_alloc_inode+0x1a/0x30
>> 	[  567.045041]  alloc_inode+0x22/0xa0
>> 	[  567.045045]  new_inode_pseudo+0x12/0x60
>> 	[  567.045049]  new_inode+0x17/0x30
>> 	[  567.045052]  mqueue_get_inode+0x45/0x3b0
>> 	[  567.045060]  mqueue_fill_super+0x41/0x70
>> 	[  567.045067]  vfs_get_super+0x7f/0x100
>> 	[  567.045074]  get_tree_keyed+0x1d/0x20
>> 	[  567.045080]  mqueue_get_tree+0x1c/0x20
>> 	[  567.045086]  vfs_get_tree+0x2a/0xc0
>> 	[  567.045092]  fc_mount+0x13/0x50
>> 	[  567.045099]  mq_create_mount+0x92/0xe0
>> 	[  567.045102]  mq_init_ns+0x3b/0x50
>> 	[  567.045106]  copy_ipcs+0x10a/0x1b0
>> 	[  567.045113]  create_new_namespaces+0xa6/0x2b0
>> 	[  567.045118]  unshare_nsproxy_namespaces+0x5a/0xb0
>> 	[  567.045124]  ksys_unshare+0x19f/0x360
>> 	[  567.045129]  __x64_sys_unshare+0x12/0x20
>> 	[  567.045135]  do_syscall_64+0x38/0x90
>> 	[  567.045143]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> 
>> For it, the __alloc_pages_nodemask() will first try process's targed
>> nodemask(unmovable node here), and there is no availabe zone, so it
>> goes with the NULL nodemask, and get a page in the slowpath.
> 
> OK, I see your point now. I was not aware of the slab allocator not
> following cpusets. Sounds like a bug to me.

SLAB and SLUB seem to not care about cpusets in the fast path. But this stack 
shows that it went all the way to the page allocator, so the cpusets should have 
been obeyed there at least.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-05 12:53               ` Vlastimil Babka
@ 2020-11-05 12:58                 ` Michal Hocko
  2020-11-05 13:07                   ` Feng Tang
  2020-11-05 13:14                   ` Vlastimil Babka
  0 siblings, 2 replies; 27+ messages in thread
From: Michal Hocko @ 2020-11-05 12:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Feng Tang, Andrew Morton, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, dave.hansen, ying.huang, linux-mm, linux-kernel

On Thu 05-11-20 13:53:24, Vlastimil Babka wrote:
> On 11/5/20 1:08 PM, Michal Hocko wrote:
> > On Thu 05-11-20 09:40:28, Feng Tang wrote:
> > > > > Could you be more specific? This sounds like a bug. Allocations
> > > > shouldn't spill over to a node which is not in the cpuset. There are few
> > > > exceptions like IRQ context but that shouldn't happen regurarly.
> > > 
> > > I mean when the docker starts, it will spawn many processes which obey
> > > the mem binding set, and they have some kernel page requests, which got
> > > successfully allocated, like the following callstack:
> > > 
> > > 	[  567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] Tainted: G        W I       5.9.0-rc8+ #6
> > > 	[  567.044956] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
> > > 	[  567.044958] Call Trace:
> > > 	[  567.044972]  dump_stack+0x74/0x9a
> > > 	[  567.044978]  __alloc_pages_nodemask.cold+0x22/0xe5
> > > 	[  567.044986]  alloc_pages_current+0x87/0xe0
> > > 	[  567.044991]  allocate_slab+0x2e5/0x4f0
> > > 	[  567.044996]  ___slab_alloc+0x380/0x5d0
> > > 	[  567.045021]  __slab_alloc+0x20/0x40
> > > 	[  567.045025]  kmem_cache_alloc+0x2a0/0x2e0
> > > 	[  567.045033]  mqueue_alloc_inode+0x1a/0x30
> > > 	[  567.045041]  alloc_inode+0x22/0xa0
> > > 	[  567.045045]  new_inode_pseudo+0x12/0x60
> > > 	[  567.045049]  new_inode+0x17/0x30
> > > 	[  567.045052]  mqueue_get_inode+0x45/0x3b0
> > > 	[  567.045060]  mqueue_fill_super+0x41/0x70
> > > 	[  567.045067]  vfs_get_super+0x7f/0x100
> > > 	[  567.045074]  get_tree_keyed+0x1d/0x20
> > > 	[  567.045080]  mqueue_get_tree+0x1c/0x20
> > > 	[  567.045086]  vfs_get_tree+0x2a/0xc0
> > > 	[  567.045092]  fc_mount+0x13/0x50
> > > 	[  567.045099]  mq_create_mount+0x92/0xe0
> > > 	[  567.045102]  mq_init_ns+0x3b/0x50
> > > 	[  567.045106]  copy_ipcs+0x10a/0x1b0
> > > 	[  567.045113]  create_new_namespaces+0xa6/0x2b0
> > > 	[  567.045118]  unshare_nsproxy_namespaces+0x5a/0xb0
> > > 	[  567.045124]  ksys_unshare+0x19f/0x360
> > > 	[  567.045129]  __x64_sys_unshare+0x12/0x20
> > > 	[  567.045135]  do_syscall_64+0x38/0x90
> > > 	[  567.045143]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > 
> > > For it, the __alloc_pages_nodemask() will first try process's targed
> > > nodemask(unmovable node here), and there is no availabe zone, so it
> > > goes with the NULL nodemask, and get a page in the slowpath.
> > 
> > OK, I see your point now. I was not aware of the slab allocator not
> > following cpusets. Sounds like a bug to me.
> 
> SLAB and SLUB seem to not care about cpusets in the fast path.

Is a fallback to a different node which is outside of the cpuset
possible?

> But this
> stack shows that it went all the way to the page allocator, so the cpusets
> should have been obeyed there at least.

Looking closer what is this dump_stack saying actually?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-05 12:58                 ` Michal Hocko
@ 2020-11-05 13:07                   ` Feng Tang
  2020-11-05 13:12                     ` Michal Hocko
  2020-11-05 13:14                   ` Vlastimil Babka
  1 sibling, 1 reply; 27+ messages in thread
From: Feng Tang @ 2020-11-05 13:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Andrew Morton, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, dave.hansen, ying.huang, linux-mm, linux-kernel

On Thu, Nov 05, 2020 at 01:58:28PM +0100, Michal Hocko wrote:
> On Thu 05-11-20 13:53:24, Vlastimil Babka wrote:
> > On 11/5/20 1:08 PM, Michal Hocko wrote:
> > > On Thu 05-11-20 09:40:28, Feng Tang wrote:
> > > > > > Could you be more specific? This sounds like a bug. Allocations
> > > > > shouldn't spill over to a node which is not in the cpuset. There are few
> > > > > exceptions like IRQ context but that shouldn't happen regurarly.
> > > > 
> > > > I mean when the docker starts, it will spawn many processes which obey
> > > > the mem binding set, and they have some kernel page requests, which got
> > > > successfully allocated, like the following callstack:
> > > > 
> > > > 	[  567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] Tainted: G        W I       5.9.0-rc8+ #6
> > > > 	[  567.044956] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
> > > > 	[  567.044958] Call Trace:
> > > > 	[  567.044972]  dump_stack+0x74/0x9a
> > > > 	[  567.044978]  __alloc_pages_nodemask.cold+0x22/0xe5
> > > > 	[  567.044986]  alloc_pages_current+0x87/0xe0
> > > > 	[  567.044991]  allocate_slab+0x2e5/0x4f0
> > > > 	[  567.044996]  ___slab_alloc+0x380/0x5d0
> > > > 	[  567.045021]  __slab_alloc+0x20/0x40
> > > > 	[  567.045025]  kmem_cache_alloc+0x2a0/0x2e0
> > > > 	[  567.045033]  mqueue_alloc_inode+0x1a/0x30
> > > > 	[  567.045041]  alloc_inode+0x22/0xa0
> > > > 	[  567.045045]  new_inode_pseudo+0x12/0x60
> > > > 	[  567.045049]  new_inode+0x17/0x30
> > > > 	[  567.045052]  mqueue_get_inode+0x45/0x3b0
> > > > 	[  567.045060]  mqueue_fill_super+0x41/0x70
> > > > 	[  567.045067]  vfs_get_super+0x7f/0x100
> > > > 	[  567.045074]  get_tree_keyed+0x1d/0x20
> > > > 	[  567.045080]  mqueue_get_tree+0x1c/0x20
> > > > 	[  567.045086]  vfs_get_tree+0x2a/0xc0
> > > > 	[  567.045092]  fc_mount+0x13/0x50
> > > > 	[  567.045099]  mq_create_mount+0x92/0xe0
> > > > 	[  567.045102]  mq_init_ns+0x3b/0x50
> > > > 	[  567.045106]  copy_ipcs+0x10a/0x1b0
> > > > 	[  567.045113]  create_new_namespaces+0xa6/0x2b0
> > > > 	[  567.045118]  unshare_nsproxy_namespaces+0x5a/0xb0
> > > > 	[  567.045124]  ksys_unshare+0x19f/0x360
> > > > 	[  567.045129]  __x64_sys_unshare+0x12/0x20
> > > > 	[  567.045135]  do_syscall_64+0x38/0x90
> > > > 	[  567.045143]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > 
> > > > For it, the __alloc_pages_nodemask() will first try process's targed
> > > > nodemask(unmovable node here), and there is no availabe zone, so it
> > > > goes with the NULL nodemask, and get a page in the slowpath.
> > > 
> > > OK, I see your point now. I was not aware of the slab allocator not
> > > following cpusets. Sounds like a bug to me.
> > 
> > SLAB and SLUB seem to not care about cpusets in the fast path.
> 
> Is a fallback to a different node which is outside of the cpuset
> possible?
 
My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL'

And actually in this usage, I seen other types of kernel allocation
request got fallback to the normal node which is not in cpuset mem
nodemasks, like

	[  567.510901] CPU: 3 PID: 2022 Comm: runc:[2:INIT] Tainted: G        W I       5.9.0-rc8+ #6
	[  567.510902] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
	[  567.510903] Call Trace:
	[  567.510909]  dump_stack+0x74/0x9a
	[  567.510910]  __alloc_pages_nodemask.cold+0x22/0xe5
	[  567.510913]  alloc_pages_current+0x87/0xe0
	[  567.510914]  __vmalloc_node_range+0x14c/0x240
	[  567.510918]  module_alloc+0x82/0xe0
	[  567.510921]  bpf_jit_alloc_exec+0xe/0x10
	[  567.510922]  bpf_jit_binary_alloc+0x7a/0x120
	[  567.510925]  bpf_int_jit_compile+0x145/0x424
	[  567.510926]  bpf_prog_select_runtime+0xac/0x130
	[  567.510928]  bpf_prepare_filter+0x44c/0x4b0
	[  567.510932]  bpf_prog_create_from_user+0xc7/0x120
	[  567.510934]  do_seccomp+0x118/0x990
	[  567.510937]  __x64_sys_seccomp+0x1a/0x20
	[  567.510939]  do_syscall_64+0x38/0x90

And its gfp_mask is (GFP_KERNEL|__GFP_HIGHMEM|__GFP_NOWARN)

Thanks,
Feng

> > But this
> > stack shows that it went all the way to the page allocator, so the cpusets
> > should have been obeyed there at least.
> 
> Looking closer what is this dump_stack saying actually?
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-05 13:07                   ` Feng Tang
@ 2020-11-05 13:12                     ` Michal Hocko
  2020-11-05 13:43                       ` Feng Tang
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2020-11-05 13:12 UTC (permalink / raw)
  To: Feng Tang
  Cc: Vlastimil Babka, Andrew Morton, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, dave.hansen, ying.huang, linux-mm, linux-kernel

On Thu 05-11-20 21:07:10, Feng Tang wrote:
[...]
> My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL'

Can you provide the full information please? Which node has been
requested. Which cpuset the calling process run in and which node has
the allocation succeeded from? A bare dump_stack without any further
context is not really helpful.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-05 12:58                 ` Michal Hocko
  2020-11-05 13:07                   ` Feng Tang
@ 2020-11-05 13:14                   ` Vlastimil Babka
  2020-11-05 13:19                     ` Michal Hocko
  1 sibling, 1 reply; 27+ messages in thread
From: Vlastimil Babka @ 2020-11-05 13:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Feng Tang, Andrew Morton, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, dave.hansen, ying.huang, linux-mm, linux-kernel

On 11/5/20 1:58 PM, Michal Hocko wrote:
> On Thu 05-11-20 13:53:24, Vlastimil Babka wrote:
>> On 11/5/20 1:08 PM, Michal Hocko wrote:
>> > On Thu 05-11-20 09:40:28, Feng Tang wrote:
>> > > > > Could you be more specific? This sounds like a bug. Allocations
>> > > > shouldn't spill over to a node which is not in the cpuset. There are few
>> > > > exceptions like IRQ context but that shouldn't happen regurarly.
>> > > 
>> > > I mean when the docker starts, it will spawn many processes which obey
>> > > the mem binding set, and they have some kernel page requests, which got
>> > > successfully allocated, like the following callstack:
>> > > 
>> > > 	[  567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD] Tainted: G        W I       5.9.0-rc8+ #6
>> > > 	[  567.044956] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
>> > > 	[  567.044958] Call Trace:
>> > > 	[  567.044972]  dump_stack+0x74/0x9a
>> > > 	[  567.044978]  __alloc_pages_nodemask.cold+0x22/0xe5
>> > > 	[  567.044986]  alloc_pages_current+0x87/0xe0
>> > > 	[  567.044991]  allocate_slab+0x2e5/0x4f0
>> > > 	[  567.044996]  ___slab_alloc+0x380/0x5d0
>> > > 	[  567.045021]  __slab_alloc+0x20/0x40
>> > > 	[  567.045025]  kmem_cache_alloc+0x2a0/0x2e0
>> > > 	[  567.045033]  mqueue_alloc_inode+0x1a/0x30
>> > > 	[  567.045041]  alloc_inode+0x22/0xa0
>> > > 	[  567.045045]  new_inode_pseudo+0x12/0x60
>> > > 	[  567.045049]  new_inode+0x17/0x30
>> > > 	[  567.045052]  mqueue_get_inode+0x45/0x3b0
>> > > 	[  567.045060]  mqueue_fill_super+0x41/0x70
>> > > 	[  567.045067]  vfs_get_super+0x7f/0x100
>> > > 	[  567.045074]  get_tree_keyed+0x1d/0x20
>> > > 	[  567.045080]  mqueue_get_tree+0x1c/0x20
>> > > 	[  567.045086]  vfs_get_tree+0x2a/0xc0
>> > > 	[  567.045092]  fc_mount+0x13/0x50
>> > > 	[  567.045099]  mq_create_mount+0x92/0xe0
>> > > 	[  567.045102]  mq_init_ns+0x3b/0x50
>> > > 	[  567.045106]  copy_ipcs+0x10a/0x1b0
>> > > 	[  567.045113]  create_new_namespaces+0xa6/0x2b0
>> > > 	[  567.045118]  unshare_nsproxy_namespaces+0x5a/0xb0
>> > > 	[  567.045124]  ksys_unshare+0x19f/0x360
>> > > 	[  567.045129]  __x64_sys_unshare+0x12/0x20
>> > > 	[  567.045135]  do_syscall_64+0x38/0x90
>> > > 	[  567.045143]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> > > 
>> > > For it, the __alloc_pages_nodemask() will first try process's targed
>> > > nodemask(unmovable node here), and there is no availabe zone, so it
>> > > goes with the NULL nodemask, and get a page in the slowpath.
>> > 
>> > OK, I see your point now. I was not aware of the slab allocator not
>> > following cpusets. Sounds like a bug to me.
>> 
>> SLAB and SLUB seem to not care about cpusets in the fast path.
> 
> Is a fallback to a different node which is outside of the cpuset
> possible?

AFAICS anything in per-cpu cache will be allocated without looking at the 
cpuset, so it can be outside of the cpuset. In SLUB slowpath, get_partial_node() 
looking for fallback on the same node will also not look at cpuset. 
get_any_partial() looking for a fallback allocation on any node does check 
cpuset_zone_allowed() and obey it strictly. A fallback to page allocator will 
obey whatever page allocator obeys.

So if a process cannot is restricted to allocate from node X via cpuset *and* 
also cannot be executed on CPU's from node X via taskset, then it AFAICS 
effectively cannot violate the cpuset in SLUB because it won't reach the percpu 
or per-node caches that don't check cpusets.

>> But this
>> stack shows that it went all the way to the page allocator, so the cpusets
>> should have been obeyed there at least.
> 
> Looking closer what is this dump_stack saying actually?

Yes, is that a dump of successful allocation (that violates cpusets?) or a 
failing one?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-05 13:14                   ` Vlastimil Babka
@ 2020-11-05 13:19                     ` Michal Hocko
  2020-11-05 13:34                       ` Vlastimil Babka
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2020-11-05 13:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Feng Tang, Andrew Morton, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, dave.hansen, ying.huang, linux-mm, linux-kernel

On Thu 05-11-20 14:14:25, Vlastimil Babka wrote:
> On 11/5/20 1:58 PM, Michal Hocko wrote:
> > On Thu 05-11-20 13:53:24, Vlastimil Babka wrote:
> > > On 11/5/20 1:08 PM, Michal Hocko wrote:
> > > > On Thu 05-11-20 09:40:28, Feng Tang wrote:
> > > > > > > Could you be more specific? This sounds like a bug. Allocations
> > > > > > shouldn't spill over to a node which is not in the cpuset. There are few
> > > > > > exceptions like IRQ context but that shouldn't happen regurarly.
> > > > > > > I mean when the docker starts, it will spawn many processes
> > > which obey
> > > > > the mem binding set, and they have some kernel page requests, which got
> > > > > successfully allocated, like the following callstack:
> > > > > > > 	[  567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD]
> > > Tainted: G        W I       5.9.0-rc8+ #6
> > > > > 	[  567.044956] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
> > > > > 	[  567.044958] Call Trace:
> > > > > 	[  567.044972]  dump_stack+0x74/0x9a
> > > > > 	[  567.044978]  __alloc_pages_nodemask.cold+0x22/0xe5
> > > > > 	[  567.044986]  alloc_pages_current+0x87/0xe0
> > > > > 	[  567.044991]  allocate_slab+0x2e5/0x4f0
> > > > > 	[  567.044996]  ___slab_alloc+0x380/0x5d0
> > > > > 	[  567.045021]  __slab_alloc+0x20/0x40
> > > > > 	[  567.045025]  kmem_cache_alloc+0x2a0/0x2e0
> > > > > 	[  567.045033]  mqueue_alloc_inode+0x1a/0x30
> > > > > 	[  567.045041]  alloc_inode+0x22/0xa0
> > > > > 	[  567.045045]  new_inode_pseudo+0x12/0x60
> > > > > 	[  567.045049]  new_inode+0x17/0x30
> > > > > 	[  567.045052]  mqueue_get_inode+0x45/0x3b0
> > > > > 	[  567.045060]  mqueue_fill_super+0x41/0x70
> > > > > 	[  567.045067]  vfs_get_super+0x7f/0x100
> > > > > 	[  567.045074]  get_tree_keyed+0x1d/0x20
> > > > > 	[  567.045080]  mqueue_get_tree+0x1c/0x20
> > > > > 	[  567.045086]  vfs_get_tree+0x2a/0xc0
> > > > > 	[  567.045092]  fc_mount+0x13/0x50
> > > > > 	[  567.045099]  mq_create_mount+0x92/0xe0
> > > > > 	[  567.045102]  mq_init_ns+0x3b/0x50
> > > > > 	[  567.045106]  copy_ipcs+0x10a/0x1b0
> > > > > 	[  567.045113]  create_new_namespaces+0xa6/0x2b0
> > > > > 	[  567.045118]  unshare_nsproxy_namespaces+0x5a/0xb0
> > > > > 	[  567.045124]  ksys_unshare+0x19f/0x360
> > > > > 	[  567.045129]  __x64_sys_unshare+0x12/0x20
> > > > > 	[  567.045135]  do_syscall_64+0x38/0x90
> > > > > 	[  567.045143]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > > > For it, the __alloc_pages_nodemask() will first try
> > > process's targed
> > > > > nodemask(unmovable node here), and there is no availabe zone, so it
> > > > > goes with the NULL nodemask, and get a page in the slowpath.
> > > > > OK, I see your point now. I was not aware of the slab allocator
> > > not
> > > > following cpusets. Sounds like a bug to me.
> > > 
> > > SLAB and SLUB seem to not care about cpusets in the fast path.
> > 
> > Is a fallback to a different node which is outside of the cpuset
> > possible?
> 
> AFAICS anything in per-cpu cache will be allocated without looking at the
> cpuset, so it can be outside of the cpuset. In SLUB slowpath,
> get_partial_node() looking for fallback on the same node will also not look
> at cpuset. get_any_partial() looking for a fallback allocation on any node
> does check cpuset_zone_allowed() and obey it strictly. A fallback to page
> allocator will obey whatever page allocator obeys.

IIUC this means that if there is no strong CPU binding to cpuset nodes
then a runaway is possible. Albeit only partially and relying on
somebody to fill up pcp object caches, right?

Is that an overlook or a decision design or a performance optimization?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-05 13:19                     ` Michal Hocko
@ 2020-11-05 13:34                       ` Vlastimil Babka
  0 siblings, 0 replies; 27+ messages in thread
From: Vlastimil Babka @ 2020-11-05 13:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Feng Tang, Andrew Morton, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, dave.hansen, ying.huang, linux-mm, linux-kernel,
	Christoph Lameter

On 11/5/20 2:19 PM, Michal Hocko wrote:
> On Thu 05-11-20 14:14:25, Vlastimil Babka wrote:
>> On 11/5/20 1:58 PM, Michal Hocko wrote:
>> > On Thu 05-11-20 13:53:24, Vlastimil Babka wrote:
>> > > On 11/5/20 1:08 PM, Michal Hocko wrote:
>> > > > On Thu 05-11-20 09:40:28, Feng Tang wrote:
>> > > > > > > Could you be more specific? This sounds like a bug. Allocations
>> > > > > > shouldn't spill over to a node which is not in the cpuset. There are few
>> > > > > > exceptions like IRQ context but that shouldn't happen regurarly.
>> > > > > > > I mean when the docker starts, it will spawn many processes
>> > > which obey
>> > > > > the mem binding set, and they have some kernel page requests, which got
>> > > > > successfully allocated, like the following callstack:
>> > > > > > > 	[  567.044953] CPU: 1 PID: 2021 Comm: runc:[1:CHILD]
>> > > Tainted: G        W I       5.9.0-rc8+ #6
>> > > > > 	[  567.044956] Hardware name:  /NUC6i5SYB, BIOS SYSKLi35.86A.0051.2016.0804.1114 08/04/2016
>> > > > > 	[  567.044958] Call Trace:
>> > > > > 	[  567.044972]  dump_stack+0x74/0x9a
>> > > > > 	[  567.044978]  __alloc_pages_nodemask.cold+0x22/0xe5
>> > > > > 	[  567.044986]  alloc_pages_current+0x87/0xe0
>> > > > > 	[  567.044991]  allocate_slab+0x2e5/0x4f0
>> > > > > 	[  567.044996]  ___slab_alloc+0x380/0x5d0
>> > > > > 	[  567.045021]  __slab_alloc+0x20/0x40
>> > > > > 	[  567.045025]  kmem_cache_alloc+0x2a0/0x2e0
>> > > > > 	[  567.045033]  mqueue_alloc_inode+0x1a/0x30
>> > > > > 	[  567.045041]  alloc_inode+0x22/0xa0
>> > > > > 	[  567.045045]  new_inode_pseudo+0x12/0x60
>> > > > > 	[  567.045049]  new_inode+0x17/0x30
>> > > > > 	[  567.045052]  mqueue_get_inode+0x45/0x3b0
>> > > > > 	[  567.045060]  mqueue_fill_super+0x41/0x70
>> > > > > 	[  567.045067]  vfs_get_super+0x7f/0x100
>> > > > > 	[  567.045074]  get_tree_keyed+0x1d/0x20
>> > > > > 	[  567.045080]  mqueue_get_tree+0x1c/0x20
>> > > > > 	[  567.045086]  vfs_get_tree+0x2a/0xc0
>> > > > > 	[  567.045092]  fc_mount+0x13/0x50
>> > > > > 	[  567.045099]  mq_create_mount+0x92/0xe0
>> > > > > 	[  567.045102]  mq_init_ns+0x3b/0x50
>> > > > > 	[  567.045106]  copy_ipcs+0x10a/0x1b0
>> > > > > 	[  567.045113]  create_new_namespaces+0xa6/0x2b0
>> > > > > 	[  567.045118]  unshare_nsproxy_namespaces+0x5a/0xb0
>> > > > > 	[  567.045124]  ksys_unshare+0x19f/0x360
>> > > > > 	[  567.045129]  __x64_sys_unshare+0x12/0x20
>> > > > > 	[  567.045135]  do_syscall_64+0x38/0x90
>> > > > > 	[  567.045143]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> > > > > > > For it, the __alloc_pages_nodemask() will first try
>> > > process's targed
>> > > > > nodemask(unmovable node here), and there is no availabe zone, so it
>> > > > > goes with the NULL nodemask, and get a page in the slowpath.
>> > > > > OK, I see your point now. I was not aware of the slab allocator
>> > > not
>> > > > following cpusets. Sounds like a bug to me.
>> > > 
>> > > SLAB and SLUB seem to not care about cpusets in the fast path.
>> > 
>> > Is a fallback to a different node which is outside of the cpuset
>> > possible?
>> 
>> AFAICS anything in per-cpu cache will be allocated without looking at the
>> cpuset, so it can be outside of the cpuset. In SLUB slowpath,
>> get_partial_node() looking for fallback on the same node will also not look
>> at cpuset. get_any_partial() looking for a fallback allocation on any node
>> does check cpuset_zone_allowed() and obey it strictly. A fallback to page
>> allocator will obey whatever page allocator obeys.
> 
> IIUC this means that if there is no strong CPU binding to cpuset nodes
> then a runaway is possible. Albeit only partially and relying on
> somebody to fill up pcp object caches, right?

Seems so.

> Is that an overlook or a decision design or a performance optimization?
  ... yes :)

More seriously, let's ask Christoph, as that code goes all the way to first SLUB 
commit.
On SLAB side, it would seem percpu caches came first, then cpuset support was 
added for page allocator, but SLAB was oblivious until it learned some of it in 
commit 765c4507af71c to properly support __GFP_THISNODE.

My guess is that the amount of cpuset constraint violation through percpu caches 
was never considered to be a problem serious enough to justify making the 
fastpaths slower.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-05 13:12                     ` Michal Hocko
@ 2020-11-05 13:43                       ` Feng Tang
  2020-11-05 16:16                         ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Feng Tang @ 2020-11-05 13:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Andrew Morton, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, dave.hansen, ying.huang, linux-mm, linux-kernel

On Thu, Nov 05, 2020 at 02:12:45PM +0100, Michal Hocko wrote:
> On Thu 05-11-20 21:07:10, Feng Tang wrote:
> [...]
> > My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL'
> 
> Can you provide the full information please? Which node has been
> requested. Which cpuset the calling process run in and which node has
> the allocation succeeded from? A bare dump_stack without any further
> context is not really helpful.

I don't have the same platform as the original report, so I simulated
one similar setup (with fakenuma and movablecore), which has 2 memory
nodes: node 0 has DMA0/DMA32/Movable zones, while node 1 has only
Movable zone. With it, I can got the same error and same oom callstack
as the original report (as in the cover-letter).

The test command is:
	# docker run -it --rm --cpuset-mems 1 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"

To debug I only added some trace in the __alloc_pages_nodemask(), and
for the callstack which get the page successfully:

	[  567.510903] Call Trace:
	[  567.510909]  dump_stack+0x74/0x9a
	[  567.510910]  __alloc_pages_nodemask.cold+0x22/0xe5
	[  567.510913]  alloc_pages_current+0x87/0xe0
	[  567.510914]  __vmalloc_node_range+0x14c/0x240
	[  567.510918]  module_alloc+0x82/0xe0
	[  567.510921]  bpf_jit_alloc_exec+0xe/0x10
	[  567.510922]  bpf_jit_binary_alloc+0x7a/0x120
	[  567.510925]  bpf_int_jit_compile+0x145/0x424
	[  567.510926]  bpf_prog_select_runtime+0xac/0x130

The incomming parameter nodemask is NULL, and the function will first try the
cpuset nodemask (1 here), and the zoneidx is only granted 2, which makes the
'ac's preferred zone to be NULL. so it goes into __alloc_pages_slowpath(),
which will first set the nodemask to 'NULL', and this time it got a preferred
zone: zone DMA32 from node 0, following get_page_from_freelist will allocate
one page from that zone. 

Thanks,
Feng

> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-05 13:43                       ` Feng Tang
@ 2020-11-05 16:16                         ` Michal Hocko
  2020-11-06  7:06                           ` Feng Tang
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2020-11-05 16:16 UTC (permalink / raw)
  To: Feng Tang
  Cc: Vlastimil Babka, Andrew Morton, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, dave.hansen, ying.huang, linux-mm, linux-kernel

On Thu 05-11-20 21:43:05, Feng Tang wrote:
> On Thu, Nov 05, 2020 at 02:12:45PM +0100, Michal Hocko wrote:
> > On Thu 05-11-20 21:07:10, Feng Tang wrote:
> > [...]
> > > My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL'
> > 
> > Can you provide the full information please? Which node has been
> > requested. Which cpuset the calling process run in and which node has
> > the allocation succeeded from? A bare dump_stack without any further
> > context is not really helpful.
> 
> I don't have the same platform as the original report, so I simulated
> one similar setup (with fakenuma and movablecore), which has 2 memory
> nodes: node 0 has DMA0/DMA32/Movable zones, while node 1 has only
> Movable zone. With it, I can got the same error and same oom callstack
> as the original report (as in the cover-letter).
> 
> The test command is:
> 	# docker run -it --rm --cpuset-mems 1 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"
> 
> To debug I only added some trace in the __alloc_pages_nodemask(), and
> for the callstack which get the page successfully:
> 
> 	[  567.510903] Call Trace:
> 	[  567.510909]  dump_stack+0x74/0x9a
> 	[  567.510910]  __alloc_pages_nodemask.cold+0x22/0xe5
> 	[  567.510913]  alloc_pages_current+0x87/0xe0
> 	[  567.510914]  __vmalloc_node_range+0x14c/0x240
> 	[  567.510918]  module_alloc+0x82/0xe0
> 	[  567.510921]  bpf_jit_alloc_exec+0xe/0x10
> 	[  567.510922]  bpf_jit_binary_alloc+0x7a/0x120
> 	[  567.510925]  bpf_int_jit_compile+0x145/0x424
> 	[  567.510926]  bpf_prog_select_runtime+0xac/0x130

As already said this doesn't really tell much without the additional
information.

> The incomming parameter nodemask is NULL, and the function will first try the
> cpuset nodemask (1 here), and the zoneidx is only granted 2, which makes the
> 'ac's preferred zone to be NULL. so it goes into __alloc_pages_slowpath(),
> which will first set the nodemask to 'NULL', and this time it got a preferred
> zone: zone DMA32 from node 0, following get_page_from_freelist will allocate
> one page from that zone. 

I do not follow. Both hot and slow paths of the allocator set
ALLOC_CPUSET or emulate it by mems_allowed when cpusets are nebaled
IIRC. This is later enforced in get_page_from_free_list. There are some
exceptions when the allocating process can run away from its cpusets -
e.g. IRQs, OOM victims and few other cases but definitely not a random
allocation. There might be some subtle details that have changed or I
might have forgot but 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-05 12:08             ` Michal Hocko
  2020-11-05 12:53               ` Vlastimil Babka
@ 2020-11-06  4:32               ` Huang, Ying
  2020-11-06  7:43                 ` Michal Hocko
  1 sibling, 1 reply; 27+ messages in thread
From: Huang, Ying @ 2020-11-06  4:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Feng Tang, Andrew Morton, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, dave.hansen, linux-mm, linux-kernel

Michal Hocko <mhocko@suse.com> writes:

> On Thu 05-11-20 09:40:28, Feng Tang wrote:
>> On Wed, Nov 04, 2020 at 09:53:43AM +0100, Michal Hocko wrote:
>>  
>> > > > As I've said in reply to your second patch. I think we can make the oom
>> > > > killer behavior more sensible in this misconfigured cases but I do not
>> > > > think we want break the cpuset isolation for such a configuration.
>> > > 
>> > > Do you mean we skip the killing and just let the allocation fail? We've
>> > > checked the oom killer code first, when the oom happens, both DRAM
>> > > node and unmovable node have lots of free memory, and killing process
>> > > won't improve the situation.
>> > 
>> > We already do skip oom killer and fail for lowmem allocation requests already.
>> > This is similar in some sense. Another option would be to kill the
>> > allocating context which will have less corner cases potentially because
>> > some allocation failures might be unexpected.
>> 
>> Yes, this can avoid the helpless oom killing to kill a good process(no
>> memory pressure at all)
>> 
>> And I think the important thing is to judge whether this usage (binding
>> docker like workload to unmovable node) is a valid case :) 
>
> I am confused. Why wouldbe an unmovable node a problem. Movable
> allocations can be satisfied from the Zone Normal just fine. It is other
> way around that is a problem.
>
>> Initially, I thought it invalid too, but later think it still makes some
>> sense for the 2 cases:
>>     * user want to bind his workload to one node(most of user space
>>       memory) to avoid cross-node traffic, and that node happens to
>>       be configured as unmovable
>
> See above
>
>>     * one small DRAM node + big PMEM node, and memory latency insensitive
>>       workload could be bound to the cheaper unmovable PMEM node
>
> Please elaborate some more. As long as you have movable and normal nodes
> then this should be possible with a deal of care - most notably the
> movable:kernel ratio memory shouldn't be too big.
>
> Besides that why does PMEM node have to be MOVABLE only in the first
> place?

The performance of PMEM is much worse than that of DRAM.  If we found
that some pages on PMEM are accessed frequently (hot), we may want to
move them to DRAM to optimize the system performance.  If the unmovable
pages are allocated on PMEM and hot, it's possible that we cannot move
the pages to DRAM unless rebooting the system.  So we think we should
make the PMEM nodes to be MOVABLE only.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-05 16:16                         ` Michal Hocko
@ 2020-11-06  7:06                           ` Feng Tang
  2020-11-06  8:10                             ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Feng Tang @ 2020-11-06  7:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Andrew Morton, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, dave.hansen, ying.huang, linux-mm, linux-kernel

On Thu, Nov 05, 2020 at 05:16:12PM +0100, Michal Hocko wrote:
> On Thu 05-11-20 21:43:05, Feng Tang wrote:
> > On Thu, Nov 05, 2020 at 02:12:45PM +0100, Michal Hocko wrote:
> > > On Thu 05-11-20 21:07:10, Feng Tang wrote:
> > > [...]
> > > > My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL'
> > > 
> > > Can you provide the full information please? Which node has been
> > > requested. Which cpuset the calling process run in and which node has
> > > the allocation succeeded from? A bare dump_stack without any further
> > > context is not really helpful.
> > 
> > I don't have the same platform as the original report, so I simulated
> > one similar setup (with fakenuma and movablecore), which has 2 memory
> > nodes: node 0 has DMA0/DMA32/Movable zones, while node 1 has only
> > Movable zone. With it, I can got the same error and same oom callstack
> > as the original report (as in the cover-letter).
> > 
> > The test command is:
> > 	# docker run -it --rm --cpuset-mems 1 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"
> > 
> > To debug I only added some trace in the __alloc_pages_nodemask(), and
> > for the callstack which get the page successfully:
> > 
> > 	[  567.510903] Call Trace:
> > 	[  567.510909]  dump_stack+0x74/0x9a
> > 	[  567.510910]  __alloc_pages_nodemask.cold+0x22/0xe5
> > 	[  567.510913]  alloc_pages_current+0x87/0xe0
> > 	[  567.510914]  __vmalloc_node_range+0x14c/0x240
> > 	[  567.510918]  module_alloc+0x82/0xe0
> > 	[  567.510921]  bpf_jit_alloc_exec+0xe/0x10
> > 	[  567.510922]  bpf_jit_binary_alloc+0x7a/0x120
> > 	[  567.510925]  bpf_int_jit_compile+0x145/0x424
> > 	[  567.510926]  bpf_prog_select_runtime+0xac/0x130
> 
> As already said this doesn't really tell much without the additional
> information.
> 
> > The incomming parameter nodemask is NULL, and the function will first try the
> > cpuset nodemask (1 here), and the zoneidx is only granted 2, which makes the
> > 'ac's preferred zone to be NULL. so it goes into __alloc_pages_slowpath(),
> > which will first set the nodemask to 'NULL', and this time it got a preferred
> > zone: zone DMA32 from node 0, following get_page_from_freelist will allocate
> > one page from that zone. 
> 
> I do not follow. Both hot and slow paths of the allocator set
> ALLOC_CPUSET or emulate it by mems_allowed when cpusets are nebaled
> IIRC. This is later enforced in get_page_from_free_list. There are some
> exceptions when the allocating process can run away from its cpusets -
> e.g. IRQs, OOM victims and few other cases but definitely not a random
> allocation. There might be some subtle details that have changed or I
> might have forgot but 

yes, I was confused too. IIUC, the key check inside get_page_from_freelist()
is 

	if (cpusets_enabled() &&
		(alloc_flags & ALLOC_CPUSET) &&
		!__cpuset_zone_allowed(zone, gfp_mask))

In our case (kernel page got allocated), the first 2 conditions are true,
and for __cpuset_zone_allowed(), the possible place to return true is
checking parent cpuset's nodemask

	cs = nearest_hardwall_ancestor(task_cs(current));
	allowed = node_isset(node, cs->mems_allowed);

This will override the ALLOC_CPUSET check.

Thanks,
Feng
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-06  4:32               ` Huang, Ying
@ 2020-11-06  7:43                 ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2020-11-06  7:43 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Feng Tang, Andrew Morton, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, dave.hansen, linux-mm, linux-kernel

On Fri 06-11-20 12:32:44, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Thu 05-11-20 09:40:28, Feng Tang wrote:
> >> On Wed, Nov 04, 2020 at 09:53:43AM +0100, Michal Hocko wrote:
> >>  
> >> > > > As I've said in reply to your second patch. I think we can make the oom
> >> > > > killer behavior more sensible in this misconfigured cases but I do not
> >> > > > think we want break the cpuset isolation for such a configuration.
> >> > > 
> >> > > Do you mean we skip the killing and just let the allocation fail? We've
> >> > > checked the oom killer code first, when the oom happens, both DRAM
> >> > > node and unmovable node have lots of free memory, and killing process
> >> > > won't improve the situation.
> >> > 
> >> > We already do skip oom killer and fail for lowmem allocation requests already.
> >> > This is similar in some sense. Another option would be to kill the
> >> > allocating context which will have less corner cases potentially because
> >> > some allocation failures might be unexpected.
> >> 
> >> Yes, this can avoid the helpless oom killing to kill a good process(no
> >> memory pressure at all)
> >> 
> >> And I think the important thing is to judge whether this usage (binding
> >> docker like workload to unmovable node) is a valid case :) 
> >
> > I am confused. Why wouldbe an unmovable node a problem. Movable
> > allocations can be satisfied from the Zone Normal just fine. It is other
> > way around that is a problem.
> >
> >> Initially, I thought it invalid too, but later think it still makes some
> >> sense for the 2 cases:
> >>     * user want to bind his workload to one node(most of user space
> >>       memory) to avoid cross-node traffic, and that node happens to
> >>       be configured as unmovable
> >
> > See above
> >
> >>     * one small DRAM node + big PMEM node, and memory latency insensitive
> >>       workload could be bound to the cheaper unmovable PMEM node
> >
> > Please elaborate some more. As long as you have movable and normal nodes
> > then this should be possible with a deal of care - most notably the
> > movable:kernel ratio memory shouldn't be too big.
> >
> > Besides that why does PMEM node have to be MOVABLE only in the first
> > place?
> 
> The performance of PMEM is much worse than that of DRAM.  If we found
> that some pages on PMEM are accessed frequently (hot), we may want to
> move them to DRAM to optimize the system performance.  If the unmovable
> pages are allocated on PMEM and hot, it's possible that we cannot move
> the pages to DRAM unless rebooting the system.  So we think we should
> make the PMEM nodes to be MOVABLE only.

That is fair but then you really need a fallback node too. So this is
mere optimization rather than a fundamental restriction.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-06  7:06                           ` Feng Tang
@ 2020-11-06  8:10                             ` Michal Hocko
  2020-11-06  9:08                               ` Feng Tang
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2020-11-06  8:10 UTC (permalink / raw)
  To: Feng Tang
  Cc: Vlastimil Babka, Andrew Morton, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, dave.hansen, ying.huang, linux-mm, linux-kernel

On Fri 06-11-20 15:06:56, Feng Tang wrote:
> On Thu, Nov 05, 2020 at 05:16:12PM +0100, Michal Hocko wrote:
> > On Thu 05-11-20 21:43:05, Feng Tang wrote:
> > > On Thu, Nov 05, 2020 at 02:12:45PM +0100, Michal Hocko wrote:
> > > > On Thu 05-11-20 21:07:10, Feng Tang wrote:
> > > > [...]
> > > > > My debug traces shows it is, and its gfp_mask is 'GFP_KERNEL'
> > > > 
> > > > Can you provide the full information please? Which node has been
> > > > requested. Which cpuset the calling process run in and which node has
> > > > the allocation succeeded from? A bare dump_stack without any further
> > > > context is not really helpful.
> > > 
> > > I don't have the same platform as the original report, so I simulated
> > > one similar setup (with fakenuma and movablecore), which has 2 memory
> > > nodes: node 0 has DMA0/DMA32/Movable zones, while node 1 has only
> > > Movable zone. With it, I can got the same error and same oom callstack
> > > as the original report (as in the cover-letter).
> > > 
> > > The test command is:
> > > 	# docker run -it --rm --cpuset-mems 1 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"
> > > 
> > > To debug I only added some trace in the __alloc_pages_nodemask(), and
> > > for the callstack which get the page successfully:
> > > 
> > > 	[  567.510903] Call Trace:
> > > 	[  567.510909]  dump_stack+0x74/0x9a
> > > 	[  567.510910]  __alloc_pages_nodemask.cold+0x22/0xe5
> > > 	[  567.510913]  alloc_pages_current+0x87/0xe0
> > > 	[  567.510914]  __vmalloc_node_range+0x14c/0x240
> > > 	[  567.510918]  module_alloc+0x82/0xe0
> > > 	[  567.510921]  bpf_jit_alloc_exec+0xe/0x10
> > > 	[  567.510922]  bpf_jit_binary_alloc+0x7a/0x120
> > > 	[  567.510925]  bpf_int_jit_compile+0x145/0x424
> > > 	[  567.510926]  bpf_prog_select_runtime+0xac/0x130
> > 
> > As already said this doesn't really tell much without the additional
> > information.
> > 
> > > The incomming parameter nodemask is NULL, and the function will first try the
> > > cpuset nodemask (1 here), and the zoneidx is only granted 2, which makes the
> > > 'ac's preferred zone to be NULL. so it goes into __alloc_pages_slowpath(),
> > > which will first set the nodemask to 'NULL', and this time it got a preferred
> > > zone: zone DMA32 from node 0, following get_page_from_freelist will allocate
> > > one page from that zone. 
> > 
> > I do not follow. Both hot and slow paths of the allocator set
> > ALLOC_CPUSET or emulate it by mems_allowed when cpusets are nebaled
> > IIRC. This is later enforced in get_page_from_free_list. There are some
> > exceptions when the allocating process can run away from its cpusets -
> > e.g. IRQs, OOM victims and few other cases but definitely not a random
> > allocation. There might be some subtle details that have changed or I
> > might have forgot but 
> 
> yes, I was confused too. IIUC, the key check inside get_page_from_freelist()
> is 
> 
> 	if (cpusets_enabled() &&
> 		(alloc_flags & ALLOC_CPUSET) &&
> 		!__cpuset_zone_allowed(zone, gfp_mask))
> 
> In our case (kernel page got allocated), the first 2 conditions are true,
> and for __cpuset_zone_allowed(), the possible place to return true is
> checking parent cpuset's nodemask
> 
> 	cs = nearest_hardwall_ancestor(task_cs(current));
> 	allowed = node_isset(node, cs->mems_allowed);
> 
> This will override the ALLOC_CPUSET check.

Yes and this is ok because that is defined hierarchical semantic of the
cpusets which applies to any !hardwalled allocation. Cpusets are quite
non intuitive. Re-reading the previous discussion I have realized that
me trying to not go into those details might have mislead you. Let me
try again and clarify that now.

I was talking in context of the patch you are proposing and that is a
clear violation of the cpuset isolation. Especially for hardwalled
setups because it allows to spill over to other nodes which shouldn't be
possible except for few exceptions which shouldn't generate a lot of
allocations (e.g. oom victim exiting, IRQ context).

What I was not talking about, and should have been more clear about, is
that without hardwall resp. exclusive nodes the isolation is best effort
only for most kernel allocation requests (or more specifically those
without __GFP_HARDWALL). Your patch doesn't distinguish between those
and any non movable allocations and effectively allowed to runaway even
for hardwalled allocations which are not movable. Those can be controlled
by userspace very easily.

I hope this clarifies it a bit more and sorry if I mislead you.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-06  8:10                             ` Michal Hocko
@ 2020-11-06  9:08                               ` Feng Tang
  2020-11-06 10:35                                 ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Feng Tang @ 2020-11-06  9:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Andrew Morton, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, dave.hansen, ying.huang, linux-mm, linux-kernel

On Fri, Nov 06, 2020 at 09:10:26AM +0100, Michal Hocko wrote:
> > > > The incomming parameter nodemask is NULL, and the function will first try the
> > > > cpuset nodemask (1 here), and the zoneidx is only granted 2, which makes the
> > > > 'ac's preferred zone to be NULL. so it goes into __alloc_pages_slowpath(),
> > > > which will first set the nodemask to 'NULL', and this time it got a preferred
> > > > zone: zone DMA32 from node 0, following get_page_from_freelist will allocate
> > > > one page from that zone. 
> > > 
> > > I do not follow. Both hot and slow paths of the allocator set
> > > ALLOC_CPUSET or emulate it by mems_allowed when cpusets are nebaled
> > > IIRC. This is later enforced in get_page_from_free_list. There are some
> > > exceptions when the allocating process can run away from its cpusets -
> > > e.g. IRQs, OOM victims and few other cases but definitely not a random
> > > allocation. There might be some subtle details that have changed or I
> > > might have forgot but 
> > 
> > yes, I was confused too. IIUC, the key check inside get_page_from_freelist()
> > is 
> > 
> > 	if (cpusets_enabled() &&
> > 		(alloc_flags & ALLOC_CPUSET) &&
> > 		!__cpuset_zone_allowed(zone, gfp_mask))
> > 
> > In our case (kernel page got allocated), the first 2 conditions are true,
> > and for __cpuset_zone_allowed(), the possible place to return true is
> > checking parent cpuset's nodemask
> > 
> > 	cs = nearest_hardwall_ancestor(task_cs(current));
> > 	allowed = node_isset(node, cs->mems_allowed);
> > 
> > This will override the ALLOC_CPUSET check.
> 
> Yes and this is ok because that is defined hierarchical semantic of the
> cpusets which applies to any !hardwalled allocation. Cpusets are quite
> non intuitive. Re-reading the previous discussion I have realized that
> me trying to not go into those details might have mislead you. Let me
> try again and clarify that now.
> 
> I was talking in context of the patch you are proposing and that is a
> clear violation of the cpuset isolation. Especially for hardwalled
> setups because it allows to spill over to other nodes which shouldn't be
> possible except for few exceptions which shouldn't generate a lot of
> allocations (e.g. oom victim exiting, IRQ context).

I agree my patch is pretty hacky. As said in the cover-letter, I would
bring up this usage case, and get suggestions on how to support it.
 
> What I was not talking about, and should have been more clear about, is
> that without hardwall resp. exclusive nodes the isolation is best effort
> only for most kernel allocation requests (or more specifically those
> without __GFP_HARDWALL). Your patch doesn't distinguish between those
> and any non movable allocations and effectively allowed to runaway even
> for hardwalled allocations which are not movable. Those can be controlled
> by userspace very easily.

You are right, there are quiet several types of page allocations failures.
The callstack in patch 2/2 is a GFP_HIGHUSER from pipe_write, and there
are more types of kernel allocation requests which will got blocked by
the differnt  check. My RFC patch just gave a easiest one-for-all hack to
let them bypass it.

Do we need to tackle them case by case?

> I hope this clarifies it a bit more and sorry if I mislead you.

Yes, it does and many thanks for the clarifying!

- Feng

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node
  2020-11-06  9:08                               ` Feng Tang
@ 2020-11-06 10:35                                 ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2020-11-06 10:35 UTC (permalink / raw)
  To: Feng Tang
  Cc: Vlastimil Babka, Andrew Morton, Johannes Weiner, Matthew Wilcox,
	Mel Gorman, dave.hansen, ying.huang, linux-mm, linux-kernel

On Fri 06-11-20 17:08:57, Feng Tang wrote:
[...]
> You are right, there are quiet several types of page allocations failures.
> The callstack in patch 2/2 is a GFP_HIGHUSER from pipe_write, and there
> are more types of kernel allocation requests which will got blocked by
> the differnt  check. My RFC patch just gave a easiest one-for-all hack to
> let them bypass it.
> 
> Do we need to tackle them case by case?

No, I do not think, how we can change those __GFP_HARDWALL without
breaking the isolation.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2020-11-06 10:36 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-04  6:10 [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node Feng Tang
2020-11-04  6:10 ` [RFC PATCH 1/2] mm, oom: dump meminfo for all memory nodes Feng Tang
2020-11-04  7:18   ` Michal Hocko
2020-11-04  6:10 ` [RFC PATCH 2/2] mm, page_alloc: loose the node binding check to avoid helpless oom killing Feng Tang
2020-11-04  7:23   ` Michal Hocko
2020-11-04  7:13 ` [RFC PATCH 0/2] mm: fix OOMs for binding workloads to movable zone only node Michal Hocko
2020-11-04  7:38   ` Feng Tang
2020-11-04  7:58     ` Michal Hocko
2020-11-04  8:40       ` Feng Tang
2020-11-04  8:53         ` Michal Hocko
2020-11-05  1:40           ` Feng Tang
2020-11-05 12:08             ` Michal Hocko
2020-11-05 12:53               ` Vlastimil Babka
2020-11-05 12:58                 ` Michal Hocko
2020-11-05 13:07                   ` Feng Tang
2020-11-05 13:12                     ` Michal Hocko
2020-11-05 13:43                       ` Feng Tang
2020-11-05 16:16                         ` Michal Hocko
2020-11-06  7:06                           ` Feng Tang
2020-11-06  8:10                             ` Michal Hocko
2020-11-06  9:08                               ` Feng Tang
2020-11-06 10:35                                 ` Michal Hocko
2020-11-05 13:14                   ` Vlastimil Babka
2020-11-05 13:19                     ` Michal Hocko
2020-11-05 13:34                       ` Vlastimil Babka
2020-11-06  4:32               ` Huang, Ying
2020-11-06  7:43                 ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).