All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/4] fix premature OOM due to cpuset races
@ 2017-01-17 22:16 ` Vlastimil Babka
  0 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-17 22:16 UTC (permalink / raw)
  To: Mel Gorman, Ganapatrao Kulkarni
  Cc: Michal Hocko, linux-kernel, linux-mm, Vlastimil Babka

This is my attempt to fix the recent report based on LTP cpuset stress test [1].
Patches are based on 4.9 as that was the initial reported version, but later
it was reported that this problem exists since 4.7. We will probably want to
go to stable with this, as triggering OOMs is not nice. That's why the patches
try to be not too intrusive.

Longer-term we might try to think how to fix the cpuset mess in a better and
less error prone way. I was for example very surprised to learn, that cpuset
updates change not only task->mems_allowed, but also nodemask of mempolicies.
Until now I expected the parameter to alloc_pages_nodemask() to be stable.
I wonder why do we then treat cpusets specially in get_page_from_freelist()
and distinguish HARDWALL etc, when there's unconditional intersection between
mempolicy and cpuset. I would expect the nodemask adjustment for saving
overhead in g_p_f(), but that clearly doesn't happen in the current form.
So we have both crazy complexity and overhead, AFAICS.

[1] https://lkml.kernel.org/r/CAFpQJXUq-JuEP=QPidy4p_=FN0rkH5Z-kfB4qBvsf6jMS87Edg@mail.gmail.com

Vlastimil Babka (4):
  mm, page_alloc: fix check for NULL preferred_zone
  mm, page_alloc: fix fast-path race with cpuset update or removal
  mm, page_alloc: move cpuset seqcount checking to slowpath
  mm, page_alloc: fix premature OOM when racing with cpuset mems update

 mm/page_alloc.c | 58 ++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 37 insertions(+), 21 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC 0/4] fix premature OOM due to cpuset races
@ 2017-01-17 22:16 ` Vlastimil Babka
  0 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-17 22:16 UTC (permalink / raw)
  To: Mel Gorman, Ganapatrao Kulkarni
  Cc: Michal Hocko, linux-kernel, linux-mm, Vlastimil Babka

This is my attempt to fix the recent report based on LTP cpuset stress test [1].
Patches are based on 4.9 as that was the initial reported version, but later
it was reported that this problem exists since 4.7. We will probably want to
go to stable with this, as triggering OOMs is not nice. That's why the patches
try to be not too intrusive.

Longer-term we might try to think how to fix the cpuset mess in a better and
less error prone way. I was for example very surprised to learn, that cpuset
updates change not only task->mems_allowed, but also nodemask of mempolicies.
Until now I expected the parameter to alloc_pages_nodemask() to be stable.
I wonder why do we then treat cpusets specially in get_page_from_freelist()
and distinguish HARDWALL etc, when there's unconditional intersection between
mempolicy and cpuset. I would expect the nodemask adjustment for saving
overhead in g_p_f(), but that clearly doesn't happen in the current form.
So we have both crazy complexity and overhead, AFAICS.

[1] https://lkml.kernel.org/r/CAFpQJXUq-JuEP=QPidy4p_=FN0rkH5Z-kfB4qBvsf6jMS87Edg@mail.gmail.com

Vlastimil Babka (4):
  mm, page_alloc: fix check for NULL preferred_zone
  mm, page_alloc: fix fast-path race with cpuset update or removal
  mm, page_alloc: move cpuset seqcount checking to slowpath
  mm, page_alloc: fix premature OOM when racing with cpuset mems update

 mm/page_alloc.c | 58 ++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 37 insertions(+), 21 deletions(-)

-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC 1/4] mm, page_alloc: fix check for NULL preferred_zone
  2017-01-17 22:16 ` Vlastimil Babka
@ 2017-01-17 22:16   ` Vlastimil Babka
  -1 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-17 22:16 UTC (permalink / raw)
  To: Mel Gorman, Ganapatrao Kulkarni
  Cc: Michal Hocko, linux-kernel, linux-mm, Vlastimil Babka

Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in
a zonelist twice") we have a wrong check for NULL preferred_zone, which can
theoretically happen due to concurrent cpuset modification. We check the
zoneref pointer which is never NULL and we should check the zone pointer.

Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 34ada718ef47..593a11d8bc6b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3763,7 +3763,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	 */
 	ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
 					ac.high_zoneidx, ac.nodemask);
-	if (!ac.preferred_zoneref) {
+	if (!ac.preferred_zoneref->zone) {
 		page = NULL;
 		goto no_zone;
 	}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC 1/4] mm, page_alloc: fix check for NULL preferred_zone
@ 2017-01-17 22:16   ` Vlastimil Babka
  0 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-17 22:16 UTC (permalink / raw)
  To: Mel Gorman, Ganapatrao Kulkarni
  Cc: Michal Hocko, linux-kernel, linux-mm, Vlastimil Babka

Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in
a zonelist twice") we have a wrong check for NULL preferred_zone, which can
theoretically happen due to concurrent cpuset modification. We check the
zoneref pointer which is never NULL and we should check the zone pointer.

Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 34ada718ef47..593a11d8bc6b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3763,7 +3763,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	 */
 	ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
 					ac.high_zoneidx, ac.nodemask);
-	if (!ac.preferred_zoneref) {
+	if (!ac.preferred_zoneref->zone) {
 		page = NULL;
 		goto no_zone;
 	}
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC 2/4] mm, page_alloc: fix fast-path race with cpuset update or removal
  2017-01-17 22:16 ` Vlastimil Babka
@ 2017-01-17 22:16   ` Vlastimil Babka
  -1 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-17 22:16 UTC (permalink / raw)
  To: Mel Gorman, Ganapatrao Kulkarni
  Cc: Michal Hocko, linux-kernel, linux-mm, Vlastimil Babka

Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode triggers
OOM killer in few seconds, despite lots of free memory. The test attemps to
repeatedly fault in memory in one process in a cpuset, while changing allowed
nodes of the cpuset between 0 and 1 in another process.

One possible cause is that in the fast path we find the preferred zoneref
according to current mems_allowed, so that it points to the middle of the
zonelist, skipping e.g. zones of node 1 completely. If the mems_allowed is
updated to contain only node 1, we never reach it in the zonelist, and trigger
OOM before checking the cpuset_mems_cookie.

This patch fixes the particular case by redoing the preferred zoneref search
if we switch back to the original nodemask. The condition is also slightly
changed so that when the last non-root cpuset is removed, we don't miss it.

Note that this is not a full fix, and more patches will follow.

Reported-by: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
Fixes: 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist iterator")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 593a11d8bc6b..dedadb4a779f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3783,9 +3783,17 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	/*
 	 * Restore the original nodemask if it was potentially replaced with
 	 * &cpuset_current_mems_allowed to optimize the fast-path attempt.
+	 * Also recalculate the starting point for the zonelist iterator or
+	 * we could end up iterating over non-eligible zones endlessly.
 	 */
-	if (cpusets_enabled())
+	if (unlikely(ac.nodemask != nodemask)) {
 		ac.nodemask = nodemask;
+		ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
+						ac.high_zoneidx, ac.nodemask);
+		if (!ac.preferred_zoneref->zone)
+			goto no_zone;
+	}
+
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
 
 no_zone:
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC 2/4] mm, page_alloc: fix fast-path race with cpuset update or removal
@ 2017-01-17 22:16   ` Vlastimil Babka
  0 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-17 22:16 UTC (permalink / raw)
  To: Mel Gorman, Ganapatrao Kulkarni
  Cc: Michal Hocko, linux-kernel, linux-mm, Vlastimil Babka

Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode triggers
OOM killer in few seconds, despite lots of free memory. The test attemps to
repeatedly fault in memory in one process in a cpuset, while changing allowed
nodes of the cpuset between 0 and 1 in another process.

One possible cause is that in the fast path we find the preferred zoneref
according to current mems_allowed, so that it points to the middle of the
zonelist, skipping e.g. zones of node 1 completely. If the mems_allowed is
updated to contain only node 1, we never reach it in the zonelist, and trigger
OOM before checking the cpuset_mems_cookie.

This patch fixes the particular case by redoing the preferred zoneref search
if we switch back to the original nodemask. The condition is also slightly
changed so that when the last non-root cpuset is removed, we don't miss it.

Note that this is not a full fix, and more patches will follow.

Reported-by: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
Fixes: 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist iterator")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 593a11d8bc6b..dedadb4a779f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3783,9 +3783,17 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	/*
 	 * Restore the original nodemask if it was potentially replaced with
 	 * &cpuset_current_mems_allowed to optimize the fast-path attempt.
+	 * Also recalculate the starting point for the zonelist iterator or
+	 * we could end up iterating over non-eligible zones endlessly.
 	 */
-	if (cpusets_enabled())
+	if (unlikely(ac.nodemask != nodemask)) {
 		ac.nodemask = nodemask;
+		ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
+						ac.high_zoneidx, ac.nodemask);
+		if (!ac.preferred_zoneref->zone)
+			goto no_zone;
+	}
+
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
 
 no_zone:
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC 3/4] mm, page_alloc: move cpuset seqcount checking to slowpath
  2017-01-17 22:16 ` Vlastimil Babka
@ 2017-01-17 22:16   ` Vlastimil Babka
  -1 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-17 22:16 UTC (permalink / raw)
  To: Mel Gorman, Ganapatrao Kulkarni
  Cc: Michal Hocko, linux-kernel, linux-mm, Vlastimil Babka

This is a preparation for the following patch to make review simpler. While
the primary motivation is a bug fix, this could also save some cycles in the
fast path.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 46 +++++++++++++++++++++++++---------------------
 1 file changed, 25 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dedadb4a779f..bbc3f015f796 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3502,12 +3502,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	unsigned int alloc_flags;
 	unsigned long did_some_progress;
-	enum compact_priority compact_priority = DEF_COMPACT_PRIORITY;
+	enum compact_priority compact_priority;
 	enum compact_result compact_result;
-	int compaction_retries = 0;
-	int no_progress_loops = 0;
+	int compaction_retries;
+	int no_progress_loops;
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
+	unsigned int cpuset_mems_cookie;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3528,6 +3529,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
 		gfp_mask &= ~__GFP_ATOMIC;
 
+retry_cpuset:
+	compaction_retries = 0;
+	no_progress_loops = 0;
+	compact_priority = DEF_COMPACT_PRIORITY;
+	cpuset_mems_cookie = read_mems_allowed_begin();
+
 	/*
 	 * The fast path uses conservative alloc_flags to succeed only until
 	 * kswapd needs to be woken up, and to avoid the cost of setting up
@@ -3699,6 +3706,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 nopage:
+	/*
+	 * When updating a task's mems_allowed, it is possible to race with
+	 * parallel threads in such a way that an allocation can fail while
+	 * the mask is being updated. If a page allocation is about to fail,
+	 * check if the cpuset changed during allocation and if so, retry.
+	 */
+	if (read_mems_allowed_retry(cpuset_mems_cookie))
+		goto retry_cpuset;
+
 	warn_alloc(gfp_mask,
 			"page allocation failure: order:%u", order);
 got_pg:
@@ -3713,7 +3729,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	struct page *page;
-	unsigned int cpuset_mems_cookie;
 	unsigned int alloc_flags = ALLOC_WMARK_LOW;
 	gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
 	struct alloc_context ac = {
@@ -3750,9 +3765,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 
-retry_cpuset:
-	cpuset_mems_cookie = read_mems_allowed_begin();
-
 	/* Dirty zone balancing only done in the fast path */
 	ac.spread_dirty_pages = (gfp_mask & __GFP_WRITE);
 
@@ -3765,6 +3777,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 					ac.high_zoneidx, ac.nodemask);
 	if (!ac.preferred_zoneref->zone) {
 		page = NULL;
+		/*
+		 * This might be due to race with cpuset_current_mems_allowed
+		 * update, so make sure we retry with original nodemask.
+		 */
 		goto no_zone;
 	}
 
@@ -3787,27 +3803,15 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	 * we could end up iterating over non-eligible zones endlessly.
 	 */
 	if (unlikely(ac.nodemask != nodemask)) {
+no_zone:
 		ac.nodemask = nodemask;
 		ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
 						ac.high_zoneidx, ac.nodemask);
-		if (!ac.preferred_zoneref->zone)
-			goto no_zone;
+		/* If we have NULL preferred zone, slowpath wll handle that */
 	}
 
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
 
-no_zone:
-	/*
-	 * When updating a task's mems_allowed, it is possible to race with
-	 * parallel threads in such a way that an allocation can fail while
-	 * the mask is being updated. If a page allocation is about to fail,
-	 * check if the cpuset changed during allocation and if so, retry.
-	 */
-	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) {
-		alloc_mask = gfp_mask;
-		goto retry_cpuset;
-	}
-
 out:
 	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
 	    unlikely(memcg_kmem_charge(page, gfp_mask, order) != 0)) {
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC 3/4] mm, page_alloc: move cpuset seqcount checking to slowpath
@ 2017-01-17 22:16   ` Vlastimil Babka
  0 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-17 22:16 UTC (permalink / raw)
  To: Mel Gorman, Ganapatrao Kulkarni
  Cc: Michal Hocko, linux-kernel, linux-mm, Vlastimil Babka

This is a preparation for the following patch to make review simpler. While
the primary motivation is a bug fix, this could also save some cycles in the
fast path.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 46 +++++++++++++++++++++++++---------------------
 1 file changed, 25 insertions(+), 21 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dedadb4a779f..bbc3f015f796 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3502,12 +3502,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	unsigned int alloc_flags;
 	unsigned long did_some_progress;
-	enum compact_priority compact_priority = DEF_COMPACT_PRIORITY;
+	enum compact_priority compact_priority;
 	enum compact_result compact_result;
-	int compaction_retries = 0;
-	int no_progress_loops = 0;
+	int compaction_retries;
+	int no_progress_loops;
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
+	unsigned int cpuset_mems_cookie;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3528,6 +3529,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
 		gfp_mask &= ~__GFP_ATOMIC;
 
+retry_cpuset:
+	compaction_retries = 0;
+	no_progress_loops = 0;
+	compact_priority = DEF_COMPACT_PRIORITY;
+	cpuset_mems_cookie = read_mems_allowed_begin();
+
 	/*
 	 * The fast path uses conservative alloc_flags to succeed only until
 	 * kswapd needs to be woken up, and to avoid the cost of setting up
@@ -3699,6 +3706,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 nopage:
+	/*
+	 * When updating a task's mems_allowed, it is possible to race with
+	 * parallel threads in such a way that an allocation can fail while
+	 * the mask is being updated. If a page allocation is about to fail,
+	 * check if the cpuset changed during allocation and if so, retry.
+	 */
+	if (read_mems_allowed_retry(cpuset_mems_cookie))
+		goto retry_cpuset;
+
 	warn_alloc(gfp_mask,
 			"page allocation failure: order:%u", order);
 got_pg:
@@ -3713,7 +3729,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	struct page *page;
-	unsigned int cpuset_mems_cookie;
 	unsigned int alloc_flags = ALLOC_WMARK_LOW;
 	gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
 	struct alloc_context ac = {
@@ -3750,9 +3765,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 
-retry_cpuset:
-	cpuset_mems_cookie = read_mems_allowed_begin();
-
 	/* Dirty zone balancing only done in the fast path */
 	ac.spread_dirty_pages = (gfp_mask & __GFP_WRITE);
 
@@ -3765,6 +3777,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 					ac.high_zoneidx, ac.nodemask);
 	if (!ac.preferred_zoneref->zone) {
 		page = NULL;
+		/*
+		 * This might be due to race with cpuset_current_mems_allowed
+		 * update, so make sure we retry with original nodemask.
+		 */
 		goto no_zone;
 	}
 
@@ -3787,27 +3803,15 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	 * we could end up iterating over non-eligible zones endlessly.
 	 */
 	if (unlikely(ac.nodemask != nodemask)) {
+no_zone:
 		ac.nodemask = nodemask;
 		ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
 						ac.high_zoneidx, ac.nodemask);
-		if (!ac.preferred_zoneref->zone)
-			goto no_zone;
+		/* If we have NULL preferred zone, slowpath wll handle that */
 	}
 
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
 
-no_zone:
-	/*
-	 * When updating a task's mems_allowed, it is possible to race with
-	 * parallel threads in such a way that an allocation can fail while
-	 * the mask is being updated. If a page allocation is about to fail,
-	 * check if the cpuset changed during allocation and if so, retry.
-	 */
-	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) {
-		alloc_mask = gfp_mask;
-		goto retry_cpuset;
-	}
-
 out:
 	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
 	    unlikely(memcg_kmem_charge(page, gfp_mask, order) != 0)) {
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC 4/4] mm, page_alloc: fix premature OOM when racing with cpuset mems update
  2017-01-17 22:16 ` Vlastimil Babka
@ 2017-01-17 22:16   ` Vlastimil Babka
  -1 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-17 22:16 UTC (permalink / raw)
  To: Mel Gorman, Ganapatrao Kulkarni
  Cc: Michal Hocko, linux-kernel, linux-mm, Vlastimil Babka

Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode triggers
OOM killer in few seconds, despite lots of free memory. The test attemps to
repeatedly fault in memory in one process in a cpuset, while changing allowed
nodes of the cpuset between 0 and 1 in another process.

The problem comes from insufficient protection against cpuset changes, which
can cause get_page_from_freelist() to consider all zones as non-eligible due to
nodemask and/or current->mems_allowed. This was masked in the past by
sufficient retries, but since commit 682a3385e773 ("mm, page_alloc: inline the
fast path of the zonelist iterator") we fix the preferred_zoneref once, and
don't iterate the whole zonelist in further attempts.

A previous patch fixed this problem for current->mems_allowed. However, cpuset
changes also update the policy nodemasks. The fix has two parts. We have to
repeat the preferred_zoneref search when we detect cpuset update by way of
seqcount, and we have to check the seqcount before considering OOM.

Reported-by: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
Fixes: 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist iterator")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bbc3f015f796..4db451270b08 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3534,6 +3534,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	no_progress_loops = 0;
 	compact_priority = DEF_COMPACT_PRIORITY;
 	cpuset_mems_cookie = read_mems_allowed_begin();
+	ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
+					ac->high_zoneidx, ac->nodemask);
+	if (!ac->preferred_zoneref->zone)
+		goto nopage;
+
 
 	/*
 	 * The fast path uses conservative alloc_flags to succeed only until
@@ -3694,6 +3699,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				&compaction_retries))
 		goto retry;
 
+	if (read_mems_allowed_retry(cpuset_mems_cookie))
+		goto retry_cpuset;
+
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 	if (page)
@@ -3789,6 +3797,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (likely(page))
 		goto out;
 
+no_zone:
 	/*
 	 * Runtime PM, block IO and its error handling path can deadlock
 	 * because I/O on the device might not complete.
@@ -3802,13 +3811,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	 * Also recalculate the starting point for the zonelist iterator or
 	 * we could end up iterating over non-eligible zones endlessly.
 	 */
-	if (unlikely(ac.nodemask != nodemask)) {
-no_zone:
+	if (unlikely(ac.nodemask != nodemask))
 		ac.nodemask = nodemask;
-		ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
-						ac.high_zoneidx, ac.nodemask);
-		/* If we have NULL preferred zone, slowpath wll handle that */
-	}
 
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC 4/4] mm, page_alloc: fix premature OOM when racing with cpuset mems update
@ 2017-01-17 22:16   ` Vlastimil Babka
  0 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-17 22:16 UTC (permalink / raw)
  To: Mel Gorman, Ganapatrao Kulkarni
  Cc: Michal Hocko, linux-kernel, linux-mm, Vlastimil Babka

Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode triggers
OOM killer in few seconds, despite lots of free memory. The test attemps to
repeatedly fault in memory in one process in a cpuset, while changing allowed
nodes of the cpuset between 0 and 1 in another process.

The problem comes from insufficient protection against cpuset changes, which
can cause get_page_from_freelist() to consider all zones as non-eligible due to
nodemask and/or current->mems_allowed. This was masked in the past by
sufficient retries, but since commit 682a3385e773 ("mm, page_alloc: inline the
fast path of the zonelist iterator") we fix the preferred_zoneref once, and
don't iterate the whole zonelist in further attempts.

A previous patch fixed this problem for current->mems_allowed. However, cpuset
changes also update the policy nodemasks. The fix has two parts. We have to
repeat the preferred_zoneref search when we detect cpuset update by way of
seqcount, and we have to check the seqcount before considering OOM.

Reported-by: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
Fixes: 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist iterator")
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bbc3f015f796..4db451270b08 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3534,6 +3534,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	no_progress_loops = 0;
 	compact_priority = DEF_COMPACT_PRIORITY;
 	cpuset_mems_cookie = read_mems_allowed_begin();
+	ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
+					ac->high_zoneidx, ac->nodemask);
+	if (!ac->preferred_zoneref->zone)
+		goto nopage;
+
 
 	/*
 	 * The fast path uses conservative alloc_flags to succeed only until
@@ -3694,6 +3699,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				&compaction_retries))
 		goto retry;
 
+	if (read_mems_allowed_retry(cpuset_mems_cookie))
+		goto retry_cpuset;
+
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 	if (page)
@@ -3789,6 +3797,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (likely(page))
 		goto out;
 
+no_zone:
 	/*
 	 * Runtime PM, block IO and its error handling path can deadlock
 	 * because I/O on the device might not complete.
@@ -3802,13 +3811,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	 * Also recalculate the starting point for the zonelist iterator or
 	 * we could end up iterating over non-eligible zones endlessly.
 	 */
-	if (unlikely(ac.nodemask != nodemask)) {
-no_zone:
+	if (unlikely(ac.nodemask != nodemask))
 		ac.nodemask = nodemask;
-		ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
-						ac.high_zoneidx, ac.nodemask);
-		/* If we have NULL preferred zone, slowpath wll handle that */
-	}
 
 	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [RFC 4/4] mm, page_alloc: fix premature OOM when racing with cpuset mems update
  2017-01-17 22:16   ` Vlastimil Babka
@ 2017-01-18  7:12     ` Hillf Danton
  -1 siblings, 0 replies; 46+ messages in thread
From: Hillf Danton @ 2017-01-18  7:12 UTC (permalink / raw)
  To: 'Vlastimil Babka', 'Mel Gorman',
	'Ganapatrao Kulkarni'
  Cc: 'Michal Hocko', linux-kernel, linux-mm


On Wednesday, January 18, 2017 6:16 AM Vlastimil Babka wrote: 
> 
> @@ -3802,13 +3811,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  	 * Also recalculate the starting point for the zonelist iterator or
>  	 * we could end up iterating over non-eligible zones endlessly.
>  	 */
Is the newly added comment still needed?

> -	if (unlikely(ac.nodemask != nodemask)) {
> -no_zone:
> +	if (unlikely(ac.nodemask != nodemask))
>  		ac.nodemask = nodemask;
> -		ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
> -						ac.high_zoneidx, ac.nodemask);
> -		/* If we have NULL preferred zone, slowpath wll handle that */
> -	}
> 
>  	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
> 
> --
> 2.11.0

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/4] mm, page_alloc: fix premature OOM when racing with cpuset mems update
@ 2017-01-18  7:12     ` Hillf Danton
  0 siblings, 0 replies; 46+ messages in thread
From: Hillf Danton @ 2017-01-18  7:12 UTC (permalink / raw)
  To: 'Vlastimil Babka', 'Mel Gorman',
	'Ganapatrao Kulkarni'
  Cc: 'Michal Hocko', linux-kernel, linux-mm


On Wednesday, January 18, 2017 6:16 AM Vlastimil Babka wrote: 
> 
> @@ -3802,13 +3811,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  	 * Also recalculate the starting point for the zonelist iterator or
>  	 * we could end up iterating over non-eligible zones endlessly.
>  	 */
Is the newly added comment still needed?

> -	if (unlikely(ac.nodemask != nodemask)) {
> -no_zone:
> +	if (unlikely(ac.nodemask != nodemask))
>  		ac.nodemask = nodemask;
> -		ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
> -						ac.high_zoneidx, ac.nodemask);
> -		/* If we have NULL preferred zone, slowpath wll handle that */
> -	}
> 
>  	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
> 
> --
> 2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 3/4] mm, page_alloc: move cpuset seqcount checking to slowpath
  2017-01-17 22:16   ` Vlastimil Babka
@ 2017-01-18  7:22     ` Hillf Danton
  -1 siblings, 0 replies; 46+ messages in thread
From: Hillf Danton @ 2017-01-18  7:22 UTC (permalink / raw)
  To: 'Vlastimil Babka', 'Mel Gorman',
	'Ganapatrao Kulkarni'
  Cc: 'Michal Hocko', linux-kernel, linux-mm

On Wednesday, January 18, 2017 6:16 AM Vlastimil Babka wrote: 
> 
> This is a preparation for the following patch to make review simpler. While
> the primary motivation is a bug fix, this could also save some cycles in the
> fast path.
> 
This also gets kswapd involved. 
Dunno how frequent cpuset is changed in real life.

Hillf

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 3/4] mm, page_alloc: move cpuset seqcount checking to slowpath
@ 2017-01-18  7:22     ` Hillf Danton
  0 siblings, 0 replies; 46+ messages in thread
From: Hillf Danton @ 2017-01-18  7:22 UTC (permalink / raw)
  To: 'Vlastimil Babka', 'Mel Gorman',
	'Ganapatrao Kulkarni'
  Cc: 'Michal Hocko', linux-kernel, linux-mm

On Wednesday, January 18, 2017 6:16 AM Vlastimil Babka wrote: 
> 
> This is a preparation for the following patch to make review simpler. While
> the primary motivation is a bug fix, this could also save some cycles in the
> fast path.
> 
This also gets kswapd involved. 
Dunno how frequent cpuset is changed in real life.

Hillf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 0/4] fix premature OOM due to cpuset races
  2017-01-17 22:16 ` Vlastimil Babka
@ 2017-01-18  9:19   ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-01-18  9:19 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Mel Gorman, Ganapatrao Kulkarni, linux-kernel, linux-mm

On Tue 17-01-17 23:16:06, Vlastimil Babka wrote:
> This is my attempt to fix the recent report based on LTP cpuset stress test [1].
> Patches are based on 4.9 as that was the initial reported version, but later
> it was reported that this problem exists since 4.7. We will probably want to
> go to stable with this, as triggering OOMs is not nice. That's why the patches
> try to be not too intrusive.
> 
> Longer-term we might try to think how to fix the cpuset mess in a better and
> less error prone way. I was for example very surprised to learn, that cpuset
> updates change not only task->mems_allowed, but also nodemask of mempolicies.
> Until now I expected the parameter to alloc_pages_nodemask() to be stable.
> I wonder why do we then treat cpusets specially in get_page_from_freelist()
> and distinguish HARDWALL etc, when there's unconditional intersection between
> mempolicy and cpuset. I would expect the nodemask adjustment for saving
> overhead in g_p_f(), but that clearly doesn't happen in the current form.
> So we have both crazy complexity and overhead, AFAICS.

Absolutely agreed! This is a mess which should be fixed and nodemask
should be stable for each allocation attempt. Trying to catch up with
concurrent changes is just insane and makes the code more complicated.

> [1] https://lkml.kernel.org/r/CAFpQJXUq-JuEP=QPidy4p_=FN0rkH5Z-kfB4qBvsf6jMS87Edg@mail.gmail.com
> 
> Vlastimil Babka (4):
>   mm, page_alloc: fix check for NULL preferred_zone
>   mm, page_alloc: fix fast-path race with cpuset update or removal
>   mm, page_alloc: move cpuset seqcount checking to slowpath
>   mm, page_alloc: fix premature OOM when racing with cpuset mems update
> 
>  mm/page_alloc.c | 58 ++++++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 37 insertions(+), 21 deletions(-)
> 
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 0/4] fix premature OOM due to cpuset races
@ 2017-01-18  9:19   ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-01-18  9:19 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Mel Gorman, Ganapatrao Kulkarni, linux-kernel, linux-mm

On Tue 17-01-17 23:16:06, Vlastimil Babka wrote:
> This is my attempt to fix the recent report based on LTP cpuset stress test [1].
> Patches are based on 4.9 as that was the initial reported version, but later
> it was reported that this problem exists since 4.7. We will probably want to
> go to stable with this, as triggering OOMs is not nice. That's why the patches
> try to be not too intrusive.
> 
> Longer-term we might try to think how to fix the cpuset mess in a better and
> less error prone way. I was for example very surprised to learn, that cpuset
> updates change not only task->mems_allowed, but also nodemask of mempolicies.
> Until now I expected the parameter to alloc_pages_nodemask() to be stable.
> I wonder why do we then treat cpusets specially in get_page_from_freelist()
> and distinguish HARDWALL etc, when there's unconditional intersection between
> mempolicy and cpuset. I would expect the nodemask adjustment for saving
> overhead in g_p_f(), but that clearly doesn't happen in the current form.
> So we have both crazy complexity and overhead, AFAICS.

Absolutely agreed! This is a mess which should be fixed and nodemask
should be stable for each allocation attempt. Trying to catch up with
concurrent changes is just insane and makes the code more complicated.

> [1] https://lkml.kernel.org/r/CAFpQJXUq-JuEP=QPidy4p_=FN0rkH5Z-kfB4qBvsf6jMS87Edg@mail.gmail.com
> 
> Vlastimil Babka (4):
>   mm, page_alloc: fix check for NULL preferred_zone
>   mm, page_alloc: fix fast-path race with cpuset update or removal
>   mm, page_alloc: move cpuset seqcount checking to slowpath
>   mm, page_alloc: fix premature OOM when racing with cpuset mems update
> 
>  mm/page_alloc.c | 58 ++++++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 37 insertions(+), 21 deletions(-)
> 
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 3/4] mm, page_alloc: move cpuset seqcount checking to slowpath
  2017-01-18  7:22     ` Hillf Danton
@ 2017-01-18  9:26       ` Vlastimil Babka
  -1 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-18  9:26 UTC (permalink / raw)
  To: Hillf Danton, 'Mel Gorman', 'Ganapatrao Kulkarni'
  Cc: 'Michal Hocko', linux-kernel, linux-mm

On 01/18/2017 08:22 AM, Hillf Danton wrote:
> On Wednesday, January 18, 2017 6:16 AM Vlastimil Babka wrote:
>>
>> This is a preparation for the following patch to make review simpler. While
>> the primary motivation is a bug fix, this could also save some cycles in the
>> fast path.
>>
> This also gets kswapd involved.
> Dunno how frequent cpuset is changed in real life.

I don't think the extra kswapd wakeups due to retry_cpuset would be noticeable. 
Such frequent cpuset changes would likely have their own associated overhead 
larger than the wakeups.

>
> Hillf
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 3/4] mm, page_alloc: move cpuset seqcount checking to slowpath
@ 2017-01-18  9:26       ` Vlastimil Babka
  0 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-18  9:26 UTC (permalink / raw)
  To: Hillf Danton, 'Mel Gorman', 'Ganapatrao Kulkarni'
  Cc: 'Michal Hocko', linux-kernel, linux-mm

On 01/18/2017 08:22 AM, Hillf Danton wrote:
> On Wednesday, January 18, 2017 6:16 AM Vlastimil Babka wrote:
>>
>> This is a preparation for the following patch to make review simpler. While
>> the primary motivation is a bug fix, this could also save some cycles in the
>> fast path.
>>
> This also gets kswapd involved.
> Dunno how frequent cpuset is changed in real life.

I don't think the extra kswapd wakeups due to retry_cpuset would be noticeable. 
Such frequent cpuset changes would likely have their own associated overhead 
larger than the wakeups.

>
> Hillf
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 1/4] mm, page_alloc: fix check for NULL preferred_zone
  2017-01-17 22:16   ` Vlastimil Babka
@ 2017-01-18  9:31     ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-01-18  9:31 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Mel Gorman, Ganapatrao Kulkarni, linux-kernel, linux-mm

On Tue 17-01-17 23:16:07, Vlastimil Babka wrote:
> Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in
> a zonelist twice") we have a wrong check for NULL preferred_zone, which can
> theoretically happen due to concurrent cpuset modification. We check the
> zoneref pointer which is never NULL and we should check the zone pointer.
> 
> Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 34ada718ef47..593a11d8bc6b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3763,7 +3763,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  	 */
>  	ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
>  					ac.high_zoneidx, ac.nodemask);
> -	if (!ac.preferred_zoneref) {
> +	if (!ac.preferred_zoneref->zone) {

When can the ->zone be NULL?

>  		page = NULL;
>  		goto no_zone;
>  	}
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 1/4] mm, page_alloc: fix check for NULL preferred_zone
@ 2017-01-18  9:31     ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-01-18  9:31 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Mel Gorman, Ganapatrao Kulkarni, linux-kernel, linux-mm

On Tue 17-01-17 23:16:07, Vlastimil Babka wrote:
> Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in
> a zonelist twice") we have a wrong check for NULL preferred_zone, which can
> theoretically happen due to concurrent cpuset modification. We check the
> zoneref pointer which is never NULL and we should check the zone pointer.
> 
> Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 34ada718ef47..593a11d8bc6b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3763,7 +3763,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  	 */
>  	ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
>  					ac.high_zoneidx, ac.nodemask);
> -	if (!ac.preferred_zoneref) {
> +	if (!ac.preferred_zoneref->zone) {

When can the ->zone be NULL?

>  		page = NULL;
>  		goto no_zone;
>  	}
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/4] mm, page_alloc: fix premature OOM when racing with cpuset mems update
  2017-01-18  7:12     ` Hillf Danton
@ 2017-01-18  9:32       ` Vlastimil Babka
  -1 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-18  9:32 UTC (permalink / raw)
  To: Hillf Danton, 'Mel Gorman', 'Ganapatrao Kulkarni'
  Cc: 'Michal Hocko', linux-kernel, linux-mm

On 01/18/2017 08:12 AM, Hillf Danton wrote:
>
> On Wednesday, January 18, 2017 6:16 AM Vlastimil Babka wrote:
>>
>> @@ -3802,13 +3811,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>>  	 * Also recalculate the starting point for the zonelist iterator or
>>  	 * we could end up iterating over non-eligible zones endlessly.
>>  	 */
> Is the newly added comment still needed?

You're right that it's no longer true. I think we can even remove most of the 
zoneref trickery and non-NULL checks in the fastpath (as a cleanup patch on 
top), as the loop in get_page_from_freelist() should handle it just fine. IIRC 
Mel even did this in the microopt series, but I pointed out that NULL 
preferred_zoneref pointer would be dangerous in get_page_from_freelist(). We 
didn't realize that we check the wrong pointer (i.e. patch 1/4 here).

Vlastimil

>
>> -	if (unlikely(ac.nodemask != nodemask)) {
>> -no_zone:
>> +	if (unlikely(ac.nodemask != nodemask))
>>  		ac.nodemask = nodemask;
>> -		ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
>> -						ac.high_zoneidx, ac.nodemask);
>> -		/* If we have NULL preferred zone, slowpath wll handle that */
>> -	}
>>
>>  	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
>>
>> --
>> 2.11.0
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/4] mm, page_alloc: fix premature OOM when racing with cpuset mems update
@ 2017-01-18  9:32       ` Vlastimil Babka
  0 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-18  9:32 UTC (permalink / raw)
  To: Hillf Danton, 'Mel Gorman', 'Ganapatrao Kulkarni'
  Cc: 'Michal Hocko', linux-kernel, linux-mm

On 01/18/2017 08:12 AM, Hillf Danton wrote:
>
> On Wednesday, January 18, 2017 6:16 AM Vlastimil Babka wrote:
>>
>> @@ -3802,13 +3811,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>>  	 * Also recalculate the starting point for the zonelist iterator or
>>  	 * we could end up iterating over non-eligible zones endlessly.
>>  	 */
> Is the newly added comment still needed?

You're right that it's no longer true. I think we can even remove most of the 
zoneref trickery and non-NULL checks in the fastpath (as a cleanup patch on 
top), as the loop in get_page_from_freelist() should handle it just fine. IIRC 
Mel even did this in the microopt series, but I pointed out that NULL 
preferred_zoneref pointer would be dangerous in get_page_from_freelist(). We 
didn't realize that we check the wrong pointer (i.e. patch 1/4 here).

Vlastimil

>
>> -	if (unlikely(ac.nodemask != nodemask)) {
>> -no_zone:
>> +	if (unlikely(ac.nodemask != nodemask))
>>  		ac.nodemask = nodemask;
>> -		ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
>> -						ac.high_zoneidx, ac.nodemask);
>> -		/* If we have NULL preferred zone, slowpath wll handle that */
>> -	}
>>
>>  	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
>>
>> --
>> 2.11.0
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 2/4] mm, page_alloc: fix fast-path race with cpuset update or removal
  2017-01-17 22:16   ` Vlastimil Babka
@ 2017-01-18  9:34     ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-01-18  9:34 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Mel Gorman, Ganapatrao Kulkarni, linux-kernel, linux-mm

On Tue 17-01-17 23:16:08, Vlastimil Babka wrote:
> Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode triggers
> OOM killer in few seconds, despite lots of free memory. The test attemps to
> repeatedly fault in memory in one process in a cpuset, while changing allowed
> nodes of the cpuset between 0 and 1 in another process.
> 
> One possible cause is that in the fast path we find the preferred zoneref
> according to current mems_allowed, so that it points to the middle of the
> zonelist, skipping e.g. zones of node 1 completely. If the mems_allowed is
> updated to contain only node 1, we never reach it in the zonelist, and trigger
> OOM before checking the cpuset_mems_cookie.
> 
> This patch fixes the particular case by redoing the preferred zoneref search
> if we switch back to the original nodemask. The condition is also slightly
> changed so that when the last non-root cpuset is removed, we don't miss it.

OK, the patch makes sense but longterm we really have to get rid of this
insane switching between masks dances.

> Note that this is not a full fix, and more patches will follow.
> 
> Reported-by: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
> Fixes: 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist iterator")
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 593a11d8bc6b..dedadb4a779f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3783,9 +3783,17 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  	/*
>  	 * Restore the original nodemask if it was potentially replaced with
>  	 * &cpuset_current_mems_allowed to optimize the fast-path attempt.
> +	 * Also recalculate the starting point for the zonelist iterator or
> +	 * we could end up iterating over non-eligible zones endlessly.
>  	 */
> -	if (cpusets_enabled())
> +	if (unlikely(ac.nodemask != nodemask)) {
>  		ac.nodemask = nodemask;
> +		ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
> +						ac.high_zoneidx, ac.nodemask);
> +		if (!ac.preferred_zoneref->zone)
> +			goto no_zone;
> +	}
> +
>  	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
>  
>  no_zone:
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 2/4] mm, page_alloc: fix fast-path race with cpuset update or removal
@ 2017-01-18  9:34     ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-01-18  9:34 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Mel Gorman, Ganapatrao Kulkarni, linux-kernel, linux-mm

On Tue 17-01-17 23:16:08, Vlastimil Babka wrote:
> Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode triggers
> OOM killer in few seconds, despite lots of free memory. The test attemps to
> repeatedly fault in memory in one process in a cpuset, while changing allowed
> nodes of the cpuset between 0 and 1 in another process.
> 
> One possible cause is that in the fast path we find the preferred zoneref
> according to current mems_allowed, so that it points to the middle of the
> zonelist, skipping e.g. zones of node 1 completely. If the mems_allowed is
> updated to contain only node 1, we never reach it in the zonelist, and trigger
> OOM before checking the cpuset_mems_cookie.
> 
> This patch fixes the particular case by redoing the preferred zoneref search
> if we switch back to the original nodemask. The condition is also slightly
> changed so that when the last non-root cpuset is removed, we don't miss it.

OK, the patch makes sense but longterm we really have to get rid of this
insane switching between masks dances.

> Note that this is not a full fix, and more patches will follow.
> 
> Reported-by: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
> Fixes: 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist iterator")
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 593a11d8bc6b..dedadb4a779f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3783,9 +3783,17 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  	/*
>  	 * Restore the original nodemask if it was potentially replaced with
>  	 * &cpuset_current_mems_allowed to optimize the fast-path attempt.
> +	 * Also recalculate the starting point for the zonelist iterator or
> +	 * we could end up iterating over non-eligible zones endlessly.
>  	 */
> -	if (cpusets_enabled())
> +	if (unlikely(ac.nodemask != nodemask)) {
>  		ac.nodemask = nodemask;
> +		ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
> +						ac.high_zoneidx, ac.nodemask);
> +		if (!ac.preferred_zoneref->zone)
> +			goto no_zone;
> +	}
> +
>  	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
>  
>  no_zone:
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 3/4] mm, page_alloc: move cpuset seqcount checking to slowpath
  2017-01-17 22:16   ` Vlastimil Babka
@ 2017-01-18  9:40     ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-01-18  9:40 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Mel Gorman, Ganapatrao Kulkarni, linux-kernel, linux-mm

On Tue 17-01-17 23:16:09, Vlastimil Babka wrote:
> This is a preparation for the following patch to make review simpler. While
> the primary motivation is a bug fix, this could also save some cycles in the
> fast path.

I cannot say I would be happy about this patch :/ The code is still very
confusing and subtle. I really think we should get rid of
synchronization with the concurrent cpuset/mempolicy updates instead.
Have you considered that instead?

> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/page_alloc.c | 46 +++++++++++++++++++++++++---------------------
>  1 file changed, 25 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dedadb4a779f..bbc3f015f796 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3502,12 +3502,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	struct page *page = NULL;
>  	unsigned int alloc_flags;
>  	unsigned long did_some_progress;
> -	enum compact_priority compact_priority = DEF_COMPACT_PRIORITY;
> +	enum compact_priority compact_priority;
>  	enum compact_result compact_result;
> -	int compaction_retries = 0;
> -	int no_progress_loops = 0;
> +	int compaction_retries;
> +	int no_progress_loops;
>  	unsigned long alloc_start = jiffies;
>  	unsigned int stall_timeout = 10 * HZ;
> +	unsigned int cpuset_mems_cookie;
>  
>  	/*
>  	 * In the slowpath, we sanity check order to avoid ever trying to
> @@ -3528,6 +3529,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
>  		gfp_mask &= ~__GFP_ATOMIC;
>  
> +retry_cpuset:
> +	compaction_retries = 0;
> +	no_progress_loops = 0;
> +	compact_priority = DEF_COMPACT_PRIORITY;
> +	cpuset_mems_cookie = read_mems_allowed_begin();
> +
>  	/*
>  	 * The fast path uses conservative alloc_flags to succeed only until
>  	 * kswapd needs to be woken up, and to avoid the cost of setting up
> @@ -3699,6 +3706,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	}
>  
>  nopage:
> +	/*
> +	 * When updating a task's mems_allowed, it is possible to race with
> +	 * parallel threads in such a way that an allocation can fail while
> +	 * the mask is being updated. If a page allocation is about to fail,
> +	 * check if the cpuset changed during allocation and if so, retry.
> +	 */
> +	if (read_mems_allowed_retry(cpuset_mems_cookie))
> +		goto retry_cpuset;
> +
>  	warn_alloc(gfp_mask,
>  			"page allocation failure: order:%u", order);
>  got_pg:
> @@ -3713,7 +3729,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  			struct zonelist *zonelist, nodemask_t *nodemask)
>  {
>  	struct page *page;
> -	unsigned int cpuset_mems_cookie;
>  	unsigned int alloc_flags = ALLOC_WMARK_LOW;
>  	gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
>  	struct alloc_context ac = {
> @@ -3750,9 +3765,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  	if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
>  		alloc_flags |= ALLOC_CMA;
>  
> -retry_cpuset:
> -	cpuset_mems_cookie = read_mems_allowed_begin();
> -
>  	/* Dirty zone balancing only done in the fast path */
>  	ac.spread_dirty_pages = (gfp_mask & __GFP_WRITE);
>  
> @@ -3765,6 +3777,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  					ac.high_zoneidx, ac.nodemask);
>  	if (!ac.preferred_zoneref->zone) {
>  		page = NULL;
> +		/*
> +		 * This might be due to race with cpuset_current_mems_allowed
> +		 * update, so make sure we retry with original nodemask.
> +		 */
>  		goto no_zone;
>  	}
>  
> @@ -3787,27 +3803,15 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  	 * we could end up iterating over non-eligible zones endlessly.
>  	 */
>  	if (unlikely(ac.nodemask != nodemask)) {
> +no_zone:
>  		ac.nodemask = nodemask;
>  		ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
>  						ac.high_zoneidx, ac.nodemask);
> -		if (!ac.preferred_zoneref->zone)
> -			goto no_zone;
> +		/* If we have NULL preferred zone, slowpath wll handle that */
>  	}
>  
>  	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
>  
> -no_zone:
> -	/*
> -	 * When updating a task's mems_allowed, it is possible to race with
> -	 * parallel threads in such a way that an allocation can fail while
> -	 * the mask is being updated. If a page allocation is about to fail,
> -	 * check if the cpuset changed during allocation and if so, retry.
> -	 */
> -	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) {
> -		alloc_mask = gfp_mask;
> -		goto retry_cpuset;
> -	}
> -
>  out:
>  	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
>  	    unlikely(memcg_kmem_charge(page, gfp_mask, order) != 0)) {
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 3/4] mm, page_alloc: move cpuset seqcount checking to slowpath
@ 2017-01-18  9:40     ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-01-18  9:40 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Mel Gorman, Ganapatrao Kulkarni, linux-kernel, linux-mm

On Tue 17-01-17 23:16:09, Vlastimil Babka wrote:
> This is a preparation for the following patch to make review simpler. While
> the primary motivation is a bug fix, this could also save some cycles in the
> fast path.

I cannot say I would be happy about this patch :/ The code is still very
confusing and subtle. I really think we should get rid of
synchronization with the concurrent cpuset/mempolicy updates instead.
Have you considered that instead?

> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/page_alloc.c | 46 +++++++++++++++++++++++++---------------------
>  1 file changed, 25 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dedadb4a779f..bbc3f015f796 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3502,12 +3502,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	struct page *page = NULL;
>  	unsigned int alloc_flags;
>  	unsigned long did_some_progress;
> -	enum compact_priority compact_priority = DEF_COMPACT_PRIORITY;
> +	enum compact_priority compact_priority;
>  	enum compact_result compact_result;
> -	int compaction_retries = 0;
> -	int no_progress_loops = 0;
> +	int compaction_retries;
> +	int no_progress_loops;
>  	unsigned long alloc_start = jiffies;
>  	unsigned int stall_timeout = 10 * HZ;
> +	unsigned int cpuset_mems_cookie;
>  
>  	/*
>  	 * In the slowpath, we sanity check order to avoid ever trying to
> @@ -3528,6 +3529,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
>  		gfp_mask &= ~__GFP_ATOMIC;
>  
> +retry_cpuset:
> +	compaction_retries = 0;
> +	no_progress_loops = 0;
> +	compact_priority = DEF_COMPACT_PRIORITY;
> +	cpuset_mems_cookie = read_mems_allowed_begin();
> +
>  	/*
>  	 * The fast path uses conservative alloc_flags to succeed only until
>  	 * kswapd needs to be woken up, and to avoid the cost of setting up
> @@ -3699,6 +3706,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	}
>  
>  nopage:
> +	/*
> +	 * When updating a task's mems_allowed, it is possible to race with
> +	 * parallel threads in such a way that an allocation can fail while
> +	 * the mask is being updated. If a page allocation is about to fail,
> +	 * check if the cpuset changed during allocation and if so, retry.
> +	 */
> +	if (read_mems_allowed_retry(cpuset_mems_cookie))
> +		goto retry_cpuset;
> +
>  	warn_alloc(gfp_mask,
>  			"page allocation failure: order:%u", order);
>  got_pg:
> @@ -3713,7 +3729,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  			struct zonelist *zonelist, nodemask_t *nodemask)
>  {
>  	struct page *page;
> -	unsigned int cpuset_mems_cookie;
>  	unsigned int alloc_flags = ALLOC_WMARK_LOW;
>  	gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
>  	struct alloc_context ac = {
> @@ -3750,9 +3765,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  	if (IS_ENABLED(CONFIG_CMA) && ac.migratetype == MIGRATE_MOVABLE)
>  		alloc_flags |= ALLOC_CMA;
>  
> -retry_cpuset:
> -	cpuset_mems_cookie = read_mems_allowed_begin();
> -
>  	/* Dirty zone balancing only done in the fast path */
>  	ac.spread_dirty_pages = (gfp_mask & __GFP_WRITE);
>  
> @@ -3765,6 +3777,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  					ac.high_zoneidx, ac.nodemask);
>  	if (!ac.preferred_zoneref->zone) {
>  		page = NULL;
> +		/*
> +		 * This might be due to race with cpuset_current_mems_allowed
> +		 * update, so make sure we retry with original nodemask.
> +		 */
>  		goto no_zone;
>  	}
>  
> @@ -3787,27 +3803,15 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  	 * we could end up iterating over non-eligible zones endlessly.
>  	 */
>  	if (unlikely(ac.nodemask != nodemask)) {
> +no_zone:
>  		ac.nodemask = nodemask;
>  		ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
>  						ac.high_zoneidx, ac.nodemask);
> -		if (!ac.preferred_zoneref->zone)
> -			goto no_zone;
> +		/* If we have NULL preferred zone, slowpath wll handle that */
>  	}
>  
>  	page = __alloc_pages_slowpath(alloc_mask, order, &ac);
>  
> -no_zone:
> -	/*
> -	 * When updating a task's mems_allowed, it is possible to race with
> -	 * parallel threads in such a way that an allocation can fail while
> -	 * the mask is being updated. If a page allocation is about to fail,
> -	 * check if the cpuset changed during allocation and if so, retry.
> -	 */
> -	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie))) {
> -		alloc_mask = gfp_mask;
> -		goto retry_cpuset;
> -	}
> -
>  out:
>  	if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
>  	    unlikely(memcg_kmem_charge(page, gfp_mask, order) != 0)) {
> -- 
> 2.11.0

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 1/4] mm, page_alloc: fix check for NULL preferred_zone
  2017-01-17 22:16   ` Vlastimil Babka
@ 2017-01-18  9:45     ` Mel Gorman
  -1 siblings, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2017-01-18  9:45 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Ganapatrao Kulkarni, Michal Hocko, linux-kernel, linux-mm

On Tue, Jan 17, 2017 at 11:16:07PM +0100, Vlastimil Babka wrote:
> Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in
> a zonelist twice") we have a wrong check for NULL preferred_zone, which can
> theoretically happen due to concurrent cpuset modification. We check the
> zoneref pointer which is never NULL and we should check the zone pointer.
> 
> Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 1/4] mm, page_alloc: fix check for NULL preferred_zone
@ 2017-01-18  9:45     ` Mel Gorman
  0 siblings, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2017-01-18  9:45 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Ganapatrao Kulkarni, Michal Hocko, linux-kernel, linux-mm

On Tue, Jan 17, 2017 at 11:16:07PM +0100, Vlastimil Babka wrote:
> Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in
> a zonelist twice") we have a wrong check for NULL preferred_zone, which can
> theoretically happen due to concurrent cpuset modification. We check the
> zoneref pointer which is never NULL and we should check the zone pointer.
> 
> Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 1/4] mm, page_alloc: fix check for NULL preferred_zone
  2017-01-18  9:31     ` Michal Hocko
@ 2017-01-18  9:45       ` Vlastimil Babka
  -1 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-18  9:45 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Mel Gorman, Ganapatrao Kulkarni, linux-kernel, linux-mm

On 01/18/2017 10:31 AM, Michal Hocko wrote:
> On Tue 17-01-17 23:16:07, Vlastimil Babka wrote:
>> Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in
>> a zonelist twice") we have a wrong check for NULL preferred_zone, which can
>> theoretically happen due to concurrent cpuset modification. We check the
>> zoneref pointer which is never NULL and we should check the zone pointer.
>>
>> Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>  mm/page_alloc.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 34ada718ef47..593a11d8bc6b 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3763,7 +3763,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>>  	 */
>>  	ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
>>  					ac.high_zoneidx, ac.nodemask);
>> -	if (!ac.preferred_zoneref) {
>> +	if (!ac.preferred_zoneref->zone) {
>
> When can the ->zone be NULL?

Either we get a genuinely screwed nodemask, or there's a concurrent cpuset 
update and nodes in zonelist are ordered in such a way that we see all of them 
as not being available to us in the nodemask/current->mems_alowed, when we 
iterate the zonelist, so we reach the end of zonelist. The zonelists are 
terminated with a zoneref with NULL zone pointer.

>
>>  		page = NULL;
>>  		goto no_zone;
>>  	}
>> --
>> 2.11.0
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 1/4] mm, page_alloc: fix check for NULL preferred_zone
@ 2017-01-18  9:45       ` Vlastimil Babka
  0 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-18  9:45 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Mel Gorman, Ganapatrao Kulkarni, linux-kernel, linux-mm

On 01/18/2017 10:31 AM, Michal Hocko wrote:
> On Tue 17-01-17 23:16:07, Vlastimil Babka wrote:
>> Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in
>> a zonelist twice") we have a wrong check for NULL preferred_zone, which can
>> theoretically happen due to concurrent cpuset modification. We check the
>> zoneref pointer which is never NULL and we should check the zone pointer.
>>
>> Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>  mm/page_alloc.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 34ada718ef47..593a11d8bc6b 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3763,7 +3763,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>>  	 */
>>  	ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
>>  					ac.high_zoneidx, ac.nodemask);
>> -	if (!ac.preferred_zoneref) {
>> +	if (!ac.preferred_zoneref->zone) {
>
> When can the ->zone be NULL?

Either we get a genuinely screwed nodemask, or there's a concurrent cpuset 
update and nodes in zonelist are ordered in such a way that we see all of them 
as not being available to us in the nodemask/current->mems_alowed, when we 
iterate the zonelist, so we reach the end of zonelist. The zonelists are 
terminated with a zoneref with NULL zone pointer.

>
>>  		page = NULL;
>>  		goto no_zone;
>>  	}
>> --
>> 2.11.0
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 2/4] mm, page_alloc: fix fast-path race with cpuset update or removal
  2017-01-17 22:16   ` Vlastimil Babka
@ 2017-01-18  9:46     ` Mel Gorman
  -1 siblings, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2017-01-18  9:46 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Ganapatrao Kulkarni, Michal Hocko, linux-kernel, linux-mm

On Tue, Jan 17, 2017 at 11:16:08PM +0100, Vlastimil Babka wrote:
> Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode triggers
> OOM killer in few seconds, despite lots of free memory. The test attemps to
> repeatedly fault in memory in one process in a cpuset, while changing allowed
> nodes of the cpuset between 0 and 1 in another process.
> 
> One possible cause is that in the fast path we find the preferred zoneref
> according to current mems_allowed, so that it points to the middle of the
> zonelist, skipping e.g. zones of node 1 completely. If the mems_allowed is
> updated to contain only node 1, we never reach it in the zonelist, and trigger
> OOM before checking the cpuset_mems_cookie.
> 
> This patch fixes the particular case by redoing the preferred zoneref search
> if we switch back to the original nodemask. The condition is also slightly
> changed so that when the last non-root cpuset is removed, we don't miss it.
> 
> Note that this is not a full fix, and more patches will follow.
> 
> Reported-by: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
> Fixes: 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist iterator")
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 2/4] mm, page_alloc: fix fast-path race with cpuset update or removal
@ 2017-01-18  9:46     ` Mel Gorman
  0 siblings, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2017-01-18  9:46 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Ganapatrao Kulkarni, Michal Hocko, linux-kernel, linux-mm

On Tue, Jan 17, 2017 at 11:16:08PM +0100, Vlastimil Babka wrote:
> Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode triggers
> OOM killer in few seconds, despite lots of free memory. The test attemps to
> repeatedly fault in memory in one process in a cpuset, while changing allowed
> nodes of the cpuset between 0 and 1 in another process.
> 
> One possible cause is that in the fast path we find the preferred zoneref
> according to current mems_allowed, so that it points to the middle of the
> zonelist, skipping e.g. zones of node 1 completely. If the mems_allowed is
> updated to contain only node 1, we never reach it in the zonelist, and trigger
> OOM before checking the cpuset_mems_cookie.
> 
> This patch fixes the particular case by redoing the preferred zoneref search
> if we switch back to the original nodemask. The condition is also slightly
> changed so that when the last non-root cpuset is removed, we don't miss it.
> 
> Note that this is not a full fix, and more patches will follow.
> 
> Reported-by: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
> Fixes: 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist iterator")
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 3/4] mm, page_alloc: move cpuset seqcount checking to slowpath
  2017-01-18  9:40     ` Michal Hocko
@ 2017-01-18  9:48       ` Vlastimil Babka
  -1 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-18  9:48 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Mel Gorman, Ganapatrao Kulkarni, linux-kernel, linux-mm

On 01/18/2017 10:40 AM, Michal Hocko wrote:
> On Tue 17-01-17 23:16:09, Vlastimil Babka wrote:
>> This is a preparation for the following patch to make review simpler. While
>> the primary motivation is a bug fix, this could also save some cycles in the
>> fast path.
>
> I cannot say I would be happy about this patch :/ The code is still very
> confusing and subtle. I really think we should get rid of
> synchronization with the concurrent cpuset/mempolicy updates instead.
> Have you considered that instead?

Not so thoroughly yet, but I already suspect it would be intrusive for stable. 
We could make copies of nodemask and mems_allowed and protect just the copying 
with seqcount, but that would mean overhead and stack space. Also we might try 
revert 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist 
iterator") ...

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 3/4] mm, page_alloc: move cpuset seqcount checking to slowpath
@ 2017-01-18  9:48       ` Vlastimil Babka
  0 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-18  9:48 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Mel Gorman, Ganapatrao Kulkarni, linux-kernel, linux-mm

On 01/18/2017 10:40 AM, Michal Hocko wrote:
> On Tue 17-01-17 23:16:09, Vlastimil Babka wrote:
>> This is a preparation for the following patch to make review simpler. While
>> the primary motivation is a bug fix, this could also save some cycles in the
>> fast path.
>
> I cannot say I would be happy about this patch :/ The code is still very
> confusing and subtle. I really think we should get rid of
> synchronization with the concurrent cpuset/mempolicy updates instead.
> Have you considered that instead?

Not so thoroughly yet, but I already suspect it would be intrusive for stable. 
We could make copies of nodemask and mems_allowed and protect just the copying 
with seqcount, but that would mean overhead and stack space. Also we might try 
revert 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist 
iterator") ...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 1/4] mm, page_alloc: fix check for NULL preferred_zone
  2017-01-18  9:45       ` Vlastimil Babka
@ 2017-01-18  9:53         ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-01-18  9:53 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Mel Gorman, Ganapatrao Kulkarni, linux-kernel, linux-mm

On Wed 18-01-17 10:45:33, Vlastimil Babka wrote:
> On 01/18/2017 10:31 AM, Michal Hocko wrote:
> > On Tue 17-01-17 23:16:07, Vlastimil Babka wrote:
> > > Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in
> > > a zonelist twice") we have a wrong check for NULL preferred_zone, which can
> > > theoretically happen due to concurrent cpuset modification. We check the
> > > zoneref pointer which is never NULL and we should check the zone pointer.
> > > 
> > > Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
> > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > > ---
> > >  mm/page_alloc.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 34ada718ef47..593a11d8bc6b 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -3763,7 +3763,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> > >  	 */
> > >  	ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
> > >  					ac.high_zoneidx, ac.nodemask);
> > > -	if (!ac.preferred_zoneref) {
> > > +	if (!ac.preferred_zoneref->zone) {
> > 
> > When can the ->zone be NULL?
> 
> Either we get a genuinely screwed nodemask, or there's a concurrent cpuset
> update and nodes in zonelist are ordered in such a way that we see all of
> them as not being available to us in the nodemask/current->mems_alowed, when
> we iterate the zonelist, so we reach the end of zonelist. The zonelists are
> terminated with a zoneref with NULL zone pointer.

Thanks for the clarification.  Please add a big fat comment in
first_zones_zonelist about this potential case.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 1/4] mm, page_alloc: fix check for NULL preferred_zone
@ 2017-01-18  9:53         ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-01-18  9:53 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Mel Gorman, Ganapatrao Kulkarni, linux-kernel, linux-mm

On Wed 18-01-17 10:45:33, Vlastimil Babka wrote:
> On 01/18/2017 10:31 AM, Michal Hocko wrote:
> > On Tue 17-01-17 23:16:07, Vlastimil Babka wrote:
> > > Since commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in
> > > a zonelist twice") we have a wrong check for NULL preferred_zone, which can
> > > theoretically happen due to concurrent cpuset modification. We check the
> > > zoneref pointer which is never NULL and we should check the zone pointer.
> > > 
> > > Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
> > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> > > ---
> > >  mm/page_alloc.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 34ada718ef47..593a11d8bc6b 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -3763,7 +3763,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> > >  	 */
> > >  	ac.preferred_zoneref = first_zones_zonelist(ac.zonelist,
> > >  					ac.high_zoneidx, ac.nodemask);
> > > -	if (!ac.preferred_zoneref) {
> > > +	if (!ac.preferred_zoneref->zone) {
> > 
> > When can the ->zone be NULL?
> 
> Either we get a genuinely screwed nodemask, or there's a concurrent cpuset
> update and nodes in zonelist are ordered in such a way that we see all of
> them as not being available to us in the nodemask/current->mems_alowed, when
> we iterate the zonelist, so we reach the end of zonelist. The zonelists are
> terminated with a zoneref with NULL zone pointer.

Thanks for the clarification.  Please add a big fat comment in
first_zones_zonelist about this potential case.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 3/4] mm, page_alloc: move cpuset seqcount checking to slowpath
  2017-01-18  9:48       ` Vlastimil Babka
@ 2017-01-18  9:55         ` Michal Hocko
  -1 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-01-18  9:55 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Mel Gorman, Ganapatrao Kulkarni, linux-kernel, linux-mm

On Wed 18-01-17 10:48:55, Vlastimil Babka wrote:
> On 01/18/2017 10:40 AM, Michal Hocko wrote:
> > On Tue 17-01-17 23:16:09, Vlastimil Babka wrote:
> > > This is a preparation for the following patch to make review simpler. While
> > > the primary motivation is a bug fix, this could also save some cycles in the
> > > fast path.
> > 
> > I cannot say I would be happy about this patch :/ The code is still very
> > confusing and subtle. I really think we should get rid of
> > synchronization with the concurrent cpuset/mempolicy updates instead.
> > Have you considered that instead?
> 
> Not so thoroughly yet, but I already suspect it would be intrusive for
> stable. We could make copies of nodemask and mems_allowed and protect just
> the copying with seqcount, but that would mean overhead and stack space.
> Also we might try revert 682a3385e773 ("mm, page_alloc: inline the fast path
> of the zonelist iterator") ...

If reverting that patch makes the problem go away and it is applicable
for the stable I would rather go that way for stable and take a deep
breath and rethink the whole cpuset and nodemask manipulation in the
allocation path for a better long term solution.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 3/4] mm, page_alloc: move cpuset seqcount checking to slowpath
@ 2017-01-18  9:55         ` Michal Hocko
  0 siblings, 0 replies; 46+ messages in thread
From: Michal Hocko @ 2017-01-18  9:55 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Mel Gorman, Ganapatrao Kulkarni, linux-kernel, linux-mm

On Wed 18-01-17 10:48:55, Vlastimil Babka wrote:
> On 01/18/2017 10:40 AM, Michal Hocko wrote:
> > On Tue 17-01-17 23:16:09, Vlastimil Babka wrote:
> > > This is a preparation for the following patch to make review simpler. While
> > > the primary motivation is a bug fix, this could also save some cycles in the
> > > fast path.
> > 
> > I cannot say I would be happy about this patch :/ The code is still very
> > confusing and subtle. I really think we should get rid of
> > synchronization with the concurrent cpuset/mempolicy updates instead.
> > Have you considered that instead?
> 
> Not so thoroughly yet, but I already suspect it would be intrusive for
> stable. We could make copies of nodemask and mems_allowed and protect just
> the copying with seqcount, but that would mean overhead and stack space.
> Also we might try revert 682a3385e773 ("mm, page_alloc: inline the fast path
> of the zonelist iterator") ...

If reverting that patch makes the problem go away and it is applicable
for the stable I would rather go that way for stable and take a deep
breath and rethink the whole cpuset and nodemask manipulation in the
allocation path for a better long term solution.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 3/4] mm, page_alloc: move cpuset seqcount checking to slowpath
  2017-01-17 22:16   ` Vlastimil Babka
@ 2017-01-18 10:03     ` Mel Gorman
  -1 siblings, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2017-01-18 10:03 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Ganapatrao Kulkarni, Michal Hocko, linux-kernel, linux-mm

On Tue, Jan 17, 2017 at 11:16:09PM +0100, Vlastimil Babka wrote:
> This is a preparation for the following patch to make review simpler. While
> the primary motivation is a bug fix, this could also save some cycles in the
> fast path.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

To be clear, the fast path savings will be when cpusets are active even
though that is still a good thing.  Most of the time, they are disabled
static branches. I see there were concerns raised that this would retry
the kswapd paths but I don't really see the issue. The same wakeup could
occur due to a cpuset switch with the existing retry. Even a potentially
spurious wakeup of kswapd is ok if the slow paths were being hit anyway
as kswapd is probably still awake from the first wakeup. If anything,
the fact that kswapd wakeups ignore cpusets and potentially wakes kswapd
on forbidden nodes is more problematic but not worth fixing. If kswapd
needs to wake on a node outside the cpuset then it's going to be by some
active process outside the cpuset some time in the future so;

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 3/4] mm, page_alloc: move cpuset seqcount checking to slowpath
@ 2017-01-18 10:03     ` Mel Gorman
  0 siblings, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2017-01-18 10:03 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Ganapatrao Kulkarni, Michal Hocko, linux-kernel, linux-mm

On Tue, Jan 17, 2017 at 11:16:09PM +0100, Vlastimil Babka wrote:
> This is a preparation for the following patch to make review simpler. While
> the primary motivation is a bug fix, this could also save some cycles in the
> fast path.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

To be clear, the fast path savings will be when cpusets are active even
though that is still a good thing.  Most of the time, they are disabled
static branches. I see there were concerns raised that this would retry
the kswapd paths but I don't really see the issue. The same wakeup could
occur due to a cpuset switch with the existing retry. Even a potentially
spurious wakeup of kswapd is ok if the slow paths were being hit anyway
as kswapd is probably still awake from the first wakeup. If anything,
the fact that kswapd wakeups ignore cpusets and potentially wakes kswapd
on forbidden nodes is more problematic but not worth fixing. If kswapd
needs to wake on a node outside the cpuset then it's going to be by some
active process outside the cpuset some time in the future so;

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/4] mm, page_alloc: fix premature OOM when racing with cpuset mems update
  2017-01-17 22:16   ` Vlastimil Babka
@ 2017-01-18 10:08     ` Mel Gorman
  -1 siblings, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2017-01-18 10:08 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Ganapatrao Kulkarni, Michal Hocko, linux-kernel, linux-mm

On Tue, Jan 17, 2017 at 11:16:10PM +0100, Vlastimil Babka wrote:
> Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode triggers
> OOM killer in few seconds, despite lots of free memory. The test attemps to
> repeatedly fault in memory in one process in a cpuset, while changing allowed
> nodes of the cpuset between 0 and 1 in another process.
> 
> The problem comes from insufficient protection against cpuset changes, which
> can cause get_page_from_freelist() to consider all zones as non-eligible due to
> nodemask and/or current->mems_allowed. This was masked in the past by
> sufficient retries, but since commit 682a3385e773 ("mm, page_alloc: inline the
> fast path of the zonelist iterator") we fix the preferred_zoneref once, and
> don't iterate the whole zonelist in further attempts.
> 
> A previous patch fixed this problem for current->mems_allowed. However, cpuset
> changes also update the policy nodemasks. The fix has two parts. We have to
> repeat the preferred_zoneref search when we detect cpuset update by way of
> seqcount, and we have to check the seqcount before considering OOM.
> 
> Reported-by: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
> Fixes: 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist iterator")
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC 4/4] mm, page_alloc: fix premature OOM when racing with cpuset mems update
@ 2017-01-18 10:08     ` Mel Gorman
  0 siblings, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2017-01-18 10:08 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Ganapatrao Kulkarni, Michal Hocko, linux-kernel, linux-mm

On Tue, Jan 17, 2017 at 11:16:10PM +0100, Vlastimil Babka wrote:
> Ganapatrao Kulkarni reported that the LTP test cpuset01 in stress mode triggers
> OOM killer in few seconds, despite lots of free memory. The test attemps to
> repeatedly fault in memory in one process in a cpuset, while changing allowed
> nodes of the cpuset between 0 and 1 in another process.
> 
> The problem comes from insufficient protection against cpuset changes, which
> can cause get_page_from_freelist() to consider all zones as non-eligible due to
> nodemask and/or current->mems_allowed. This was masked in the past by
> sufficient retries, but since commit 682a3385e773 ("mm, page_alloc: inline the
> fast path of the zonelist iterator") we fix the preferred_zoneref once, and
> don't iterate the whole zonelist in further attempts.
> 
> A previous patch fixed this problem for current->mems_allowed. However, cpuset
> changes also update the policy nodemasks. The fix has two parts. We have to
> repeat the preferred_zoneref search when we detect cpuset update by way of
> seqcount, and we have to check the seqcount before considering OOM.
> 
> Reported-by: Ganapatrao Kulkarni <gpkulkarni@gmail.com>
> Fixes: 682a3385e773 ("mm, page_alloc: inline the fast path of the zonelist iterator")
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC 5/4] mm, page_alloc: fix premature OOM due to vma mempolicy update
  2017-01-17 22:16 ` Vlastimil Babka
@ 2017-01-18 16:20   ` Vlastimil Babka
  -1 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-18 16:20 UTC (permalink / raw)
  To: Mel Gorman, Ganapatrao Kulkarni; +Cc: Michal Hocko, linux-kernel, linux-mm

Due to interactions between mempolicies and cpusets, it can happen that the
intersection of nodes allowed by both is empty. In that case, cpuset has a
higher priority, which can be seen in the code of policy_nodemask().

There's however a possible race when cpuset's mems_allowed is updated after
policy_nodemask() called from alloc_pages_vma() observes a non-empty
intersection, and then it becomes empty. This can be either a temporary state
until the vma's mempolicy gets updated as part of the cpuset change, or
permanent. In both cases, this leads to OOM's when all calls to
get_page_from_freelist() end up iterating over the empty intersection.

One part of the issue is that unlike task mempolicy, vma mempolicy rebinding
by cpuset isn't protected by the seqlock, so the allocation cannot detect the
race and retry. This patch adds the necessary protections.

The second part is that although alloc_pages_vma() performs the read-side
operations on the seqlock, the cookie is different than the one used by
__alloc_pages_slowpath(). This leaves a window where the cpuset update will
finish before we read the cookie in __alloc_pages_slowpath(), and thus we
won't detect it, and the OOM will happen before we can return to
alloc_pages_vma() and check its own cookie.

We could pass the first cookie down, but that would make things more
complicated, so this patch instead rechecks for the empty intersection in
__alloc_pages_slowpath() before OOM.

Not-yet-signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
Unfortunately when working on the series I have suspected that when vma
mempolicies are added to the mix via mbind(), the previous patches won't help.
By changing the LTP cpuset01 testcase (will post patch as a reply) this was
confirmed and the problem is also older than the changes in 4.7.
This is one approach that seems to fix this in my tests, but it's not really
nice and just tries to plug more holes in the current design. I'm posting it
mainly for discussion.

 include/linux/mempolicy.h |  6 ++++--
 kernel/cpuset.c           |  4 ++--
 mm/mempolicy.c            | 16 +++++++++++++---
 mm/page_alloc.c           | 13 +++++++++++++
 4 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5e5b2969d931..c5519027eb6a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -143,7 +143,8 @@ extern void numa_default_policy(void);
 extern void numa_policy_init(void);
 extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
 				enum mpol_rebind_step step);
-extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
+extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new,
+						struct task_struct *tsk);
 
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
@@ -257,7 +258,8 @@ static inline void mpol_rebind_task(struct task_struct *tsk,
 {
 }
 
-static inline void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
+static inline void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new,
+						struct task_struct *tsk)
 {
 }
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 29f815d2ef7e..727ddf5d8222 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1117,7 +1117,7 @@ static void update_tasks_nodemask(struct cpuset *cs)
 
 		migrate = is_memory_migrate(cs);
 
-		mpol_rebind_mm(mm, &cs->mems_allowed);
+		mpol_rebind_mm(mm, &cs->mems_allowed, task);
 		if (migrate)
 			cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
 		else
@@ -1559,7 +1559,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 		struct mm_struct *mm = get_task_mm(leader);
 
 		if (mm) {
-			mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
+			mpol_rebind_mm(mm, &cpuset_attach_nodemask_to, leader);
 
 			/*
 			 * old_mems_allowed is the same with mems_allowed
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0b859af06b87..bc6983732333 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -437,13 +437,23 @@ void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
  * Call holding a reference to mm.  Takes mm->mmap_sem during call.
  */
 
-void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
+void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new,
+					struct task_struct *tsk)
 {
 	struct vm_area_struct *vma;
 
 	down_write(&mm->mmap_sem);
-	for (vma = mm->mmap; vma; vma = vma->vm_next)
-		mpol_rebind_policy(vma->vm_policy, new, MPOL_REBIND_ONCE);
+	task_lock(tsk);
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		if (vma->vm_policy) {
+			local_irq_disable();
+			write_seqcount_begin(&tsk->mems_allowed_seq);
+			mpol_rebind_policy(vma->vm_policy, new, MPOL_REBIND_ONCE);
+			write_seqcount_end(&tsk->mems_allowed_seq);
+			local_irq_enable();
+		}
+	}
+	task_unlock(tsk);
 	up_write(&mm->mmap_sem);
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 89c8cf87eab5..36fe6742c276 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3699,6 +3699,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				&compaction_retries))
 		goto retry;
 
+	/*
+	 * It's possible that alloc_pages_vma() saw that nodemask is compatible
+	 * with cpuset's mems_allowed, but then the cpuset was updated and this
+	 * is no longer true.
+	 * It's also possible that the discrepancy was only visible during our
+	 * allocation attempts, and now nodemask has been updated to match the
+	 * cpuset and this check will pass. In that case the
+	 * cpuset_mems_cookie check below should catch that.
+	 */
+	if (ac->nodemask && (alloc_flags & ALLOC_CPUSET)
+			&& !cpuset_nodemask_valid_mems_allowed(ac->nodemask))
+		goto nopage;
+
 	/*
 	 * It's possible we raced with cpuset update so the OOM would be
 	 * premature (see below the nopage: label for full explanation).
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC 5/4] mm, page_alloc: fix premature OOM due to vma mempolicy update
@ 2017-01-18 16:20   ` Vlastimil Babka
  0 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-18 16:20 UTC (permalink / raw)
  To: Mel Gorman, Ganapatrao Kulkarni; +Cc: Michal Hocko, linux-kernel, linux-mm

Due to interactions between mempolicies and cpusets, it can happen that the
intersection of nodes allowed by both is empty. In that case, cpuset has a
higher priority, which can be seen in the code of policy_nodemask().

There's however a possible race when cpuset's mems_allowed is updated after
policy_nodemask() called from alloc_pages_vma() observes a non-empty
intersection, and then it becomes empty. This can be either a temporary state
until the vma's mempolicy gets updated as part of the cpuset change, or
permanent. In both cases, this leads to OOM's when all calls to
get_page_from_freelist() end up iterating over the empty intersection.

One part of the issue is that unlike task mempolicy, vma mempolicy rebinding
by cpuset isn't protected by the seqlock, so the allocation cannot detect the
race and retry. This patch adds the necessary protections.

The second part is that although alloc_pages_vma() performs the read-side
operations on the seqlock, the cookie is different than the one used by
__alloc_pages_slowpath(). This leaves a window where the cpuset update will
finish before we read the cookie in __alloc_pages_slowpath(), and thus we
won't detect it, and the OOM will happen before we can return to
alloc_pages_vma() and check its own cookie.

We could pass the first cookie down, but that would make things more
complicated, so this patch instead rechecks for the empty intersection in
__alloc_pages_slowpath() before OOM.

Not-yet-signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
Unfortunately when working on the series I have suspected that when vma
mempolicies are added to the mix via mbind(), the previous patches won't help.
By changing the LTP cpuset01 testcase (will post patch as a reply) this was
confirmed and the problem is also older than the changes in 4.7.
This is one approach that seems to fix this in my tests, but it's not really
nice and just tries to plug more holes in the current design. I'm posting it
mainly for discussion.

 include/linux/mempolicy.h |  6 ++++--
 kernel/cpuset.c           |  4 ++--
 mm/mempolicy.c            | 16 +++++++++++++---
 mm/page_alloc.c           | 13 +++++++++++++
 4 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5e5b2969d931..c5519027eb6a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -143,7 +143,8 @@ extern void numa_default_policy(void);
 extern void numa_policy_init(void);
 extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
 				enum mpol_rebind_step step);
-extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
+extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new,
+						struct task_struct *tsk);
 
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
@@ -257,7 +258,8 @@ static inline void mpol_rebind_task(struct task_struct *tsk,
 {
 }
 
-static inline void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
+static inline void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new,
+						struct task_struct *tsk)
 {
 }
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 29f815d2ef7e..727ddf5d8222 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1117,7 +1117,7 @@ static void update_tasks_nodemask(struct cpuset *cs)
 
 		migrate = is_memory_migrate(cs);
 
-		mpol_rebind_mm(mm, &cs->mems_allowed);
+		mpol_rebind_mm(mm, &cs->mems_allowed, task);
 		if (migrate)
 			cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
 		else
@@ -1559,7 +1559,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
 		struct mm_struct *mm = get_task_mm(leader);
 
 		if (mm) {
-			mpol_rebind_mm(mm, &cpuset_attach_nodemask_to);
+			mpol_rebind_mm(mm, &cpuset_attach_nodemask_to, leader);
 
 			/*
 			 * old_mems_allowed is the same with mems_allowed
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0b859af06b87..bc6983732333 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -437,13 +437,23 @@ void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
  * Call holding a reference to mm.  Takes mm->mmap_sem during call.
  */
 
-void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
+void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new,
+					struct task_struct *tsk)
 {
 	struct vm_area_struct *vma;
 
 	down_write(&mm->mmap_sem);
-	for (vma = mm->mmap; vma; vma = vma->vm_next)
-		mpol_rebind_policy(vma->vm_policy, new, MPOL_REBIND_ONCE);
+	task_lock(tsk);
+	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		if (vma->vm_policy) {
+			local_irq_disable();
+			write_seqcount_begin(&tsk->mems_allowed_seq);
+			mpol_rebind_policy(vma->vm_policy, new, MPOL_REBIND_ONCE);
+			write_seqcount_end(&tsk->mems_allowed_seq);
+			local_irq_enable();
+		}
+	}
+	task_unlock(tsk);
 	up_write(&mm->mmap_sem);
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 89c8cf87eab5..36fe6742c276 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3699,6 +3699,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				&compaction_retries))
 		goto retry;
 
+	/*
+	 * It's possible that alloc_pages_vma() saw that nodemask is compatible
+	 * with cpuset's mems_allowed, but then the cpuset was updated and this
+	 * is no longer true.
+	 * It's also possible that the discrepancy was only visible during our
+	 * allocation attempts, and now nodemask has been updated to match the
+	 * cpuset and this check will pass. In that case the
+	 * cpuset_mems_cookie check below should catch that.
+	 */
+	if (ac->nodemask && (alloc_flags & ALLOC_CPUSET)
+			&& !cpuset_nodemask_valid_mems_allowed(ac->nodemask))
+		goto nopage;
+
 	/*
 	 * It's possible we raced with cpuset update so the OOM would be
 	 * premature (see below the nopage: label for full explanation).
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [RFC 5/4] mm, page_alloc: fix premature OOM due to vma mempolicy update
  2017-01-18 16:20   ` Vlastimil Babka
@ 2017-01-18 16:23     ` Vlastimil Babka
  -1 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-18 16:23 UTC (permalink / raw)
  To: Mel Gorman, Ganapatrao Kulkarni; +Cc: Michal Hocko, linux-kernel, linux-mm

On 01/18/2017 05:20 PM, Vlastimil Babka wrote:
> By changing the LTP cpuset01 testcase (will post patch as a reply) this was
> confirmed and the problem is also older than the changes in 4.7.

-----8<-----
diff --git a/testcases/kernel/mem/cpuset/cpuset01.c b/testcases/kernel/mem/cpuset/cpuset01.c
index 558420f72f9b..ef673366e15e 100644
--- a/testcases/kernel/mem/cpuset/cpuset01.c
+++ b/testcases/kernel/mem/cpuset/cpuset01.c
@@ -40,6 +40,7 @@
 #include <fcntl.h>
 #include <math.h>
 #if HAVE_NUMAIF_H
+#include <linux/mempolicy.h>
 #include <numaif.h>
 #endif
 #include <signal.h>
@@ -61,6 +62,7 @@ volatile int end;
 static int *nodes;
 static int nnodes;
 static long ncpus;
+static unsigned long nmask[MAXNODES / BITS_PER_LONG] = { 0 };
 
 static void testcpuset(void);
 static void sighandler(int signo LTP_ATTRIBUTE_UNUSED);
@@ -89,7 +91,6 @@ static void testcpuset(void)
 {
 	int lc;
 	int child, i, status;
-	unsigned long nmask[MAXNODES / BITS_PER_LONG] = { 0 };
 	char mems[BUFSIZ], buf[BUFSIZ];
 
 	read_cpuset_files(CPATH, "cpus", buf);
@@ -105,6 +106,7 @@ static void testcpuset(void)
 		for (i = 0; i < nnodes; i++) {
 			if (nodes[i] >= MAXNODES)
 				continue;
+			printf("bind to node %d\n", nodes[i]);
 			set_node(nmask, nodes[i]);
 		}
 		if (set_mempolicy(MPOL_BIND, nmask, MAXNODES) == -1)
@@ -163,6 +165,8 @@ static int mem_hog(void)
 			tst_resm(TFAIL | TERRNO, "mmap");
 			break;
 		}
+		if (mbind(addr, pagesize * 10, MPOL_BIND, nmask, MAXNODES, 0) == -1)
+			tst_brkm(TBROK | TERRNO, cleanup, "set_mempolicy");
 		memset(addr, 0xF7, pagesize * 10);
 		munmap(addr, pagesize * 10);
 	}

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [RFC 5/4] mm, page_alloc: fix premature OOM due to vma mempolicy update
@ 2017-01-18 16:23     ` Vlastimil Babka
  0 siblings, 0 replies; 46+ messages in thread
From: Vlastimil Babka @ 2017-01-18 16:23 UTC (permalink / raw)
  To: Mel Gorman, Ganapatrao Kulkarni; +Cc: Michal Hocko, linux-kernel, linux-mm

On 01/18/2017 05:20 PM, Vlastimil Babka wrote:
> By changing the LTP cpuset01 testcase (will post patch as a reply) this was
> confirmed and the problem is also older than the changes in 4.7.

-----8<-----
diff --git a/testcases/kernel/mem/cpuset/cpuset01.c b/testcases/kernel/mem/cpuset/cpuset01.c
index 558420f72f9b..ef673366e15e 100644
--- a/testcases/kernel/mem/cpuset/cpuset01.c
+++ b/testcases/kernel/mem/cpuset/cpuset01.c
@@ -40,6 +40,7 @@
 #include <fcntl.h>
 #include <math.h>
 #if HAVE_NUMAIF_H
+#include <linux/mempolicy.h>
 #include <numaif.h>
 #endif
 #include <signal.h>
@@ -61,6 +62,7 @@ volatile int end;
 static int *nodes;
 static int nnodes;
 static long ncpus;
+static unsigned long nmask[MAXNODES / BITS_PER_LONG] = { 0 };
 
 static void testcpuset(void);
 static void sighandler(int signo LTP_ATTRIBUTE_UNUSED);
@@ -89,7 +91,6 @@ static void testcpuset(void)
 {
 	int lc;
 	int child, i, status;
-	unsigned long nmask[MAXNODES / BITS_PER_LONG] = { 0 };
 	char mems[BUFSIZ], buf[BUFSIZ];
 
 	read_cpuset_files(CPATH, "cpus", buf);
@@ -105,6 +106,7 @@ static void testcpuset(void)
 		for (i = 0; i < nnodes; i++) {
 			if (nodes[i] >= MAXNODES)
 				continue;
+			printf("bind to node %d\n", nodes[i]);
 			set_node(nmask, nodes[i]);
 		}
 		if (set_mempolicy(MPOL_BIND, nmask, MAXNODES) == -1)
@@ -163,6 +165,8 @@ static int mem_hog(void)
 			tst_resm(TFAIL | TERRNO, "mmap");
 			break;
 		}
+		if (mbind(addr, pagesize * 10, MPOL_BIND, nmask, MAXNODES, 0) == -1)
+			tst_brkm(TBROK | TERRNO, cleanup, "set_mempolicy");
 		memset(addr, 0xF7, pagesize * 10);
 		munmap(addr, pagesize * 10);
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2017-01-18 16:24 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-17 22:16 [RFC 0/4] fix premature OOM due to cpuset races Vlastimil Babka
2017-01-17 22:16 ` Vlastimil Babka
2017-01-17 22:16 ` [RFC 1/4] mm, page_alloc: fix check for NULL preferred_zone Vlastimil Babka
2017-01-17 22:16   ` Vlastimil Babka
2017-01-18  9:31   ` Michal Hocko
2017-01-18  9:31     ` Michal Hocko
2017-01-18  9:45     ` Vlastimil Babka
2017-01-18  9:45       ` Vlastimil Babka
2017-01-18  9:53       ` Michal Hocko
2017-01-18  9:53         ` Michal Hocko
2017-01-18  9:45   ` Mel Gorman
2017-01-18  9:45     ` Mel Gorman
2017-01-17 22:16 ` [RFC 2/4] mm, page_alloc: fix fast-path race with cpuset update or removal Vlastimil Babka
2017-01-17 22:16   ` Vlastimil Babka
2017-01-18  9:34   ` Michal Hocko
2017-01-18  9:34     ` Michal Hocko
2017-01-18  9:46   ` Mel Gorman
2017-01-18  9:46     ` Mel Gorman
2017-01-17 22:16 ` [RFC 3/4] mm, page_alloc: move cpuset seqcount checking to slowpath Vlastimil Babka
2017-01-17 22:16   ` Vlastimil Babka
2017-01-18  7:22   ` Hillf Danton
2017-01-18  7:22     ` Hillf Danton
2017-01-18  9:26     ` Vlastimil Babka
2017-01-18  9:26       ` Vlastimil Babka
2017-01-18  9:40   ` Michal Hocko
2017-01-18  9:40     ` Michal Hocko
2017-01-18  9:48     ` Vlastimil Babka
2017-01-18  9:48       ` Vlastimil Babka
2017-01-18  9:55       ` Michal Hocko
2017-01-18  9:55         ` Michal Hocko
2017-01-18 10:03   ` Mel Gorman
2017-01-18 10:03     ` Mel Gorman
2017-01-17 22:16 ` [RFC 4/4] mm, page_alloc: fix premature OOM when racing with cpuset mems update Vlastimil Babka
2017-01-17 22:16   ` Vlastimil Babka
2017-01-18  7:12   ` Hillf Danton
2017-01-18  7:12     ` Hillf Danton
2017-01-18  9:32     ` Vlastimil Babka
2017-01-18  9:32       ` Vlastimil Babka
2017-01-18 10:08   ` Mel Gorman
2017-01-18 10:08     ` Mel Gorman
2017-01-18  9:19 ` [RFC 0/4] fix premature OOM due to cpuset races Michal Hocko
2017-01-18  9:19   ` Michal Hocko
2017-01-18 16:20 ` [RFC 5/4] mm, page_alloc: fix premature OOM due to vma mempolicy update Vlastimil Babka
2017-01-18 16:20   ` Vlastimil Babka
2017-01-18 16:23   ` Vlastimil Babka
2017-01-18 16:23     ` Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.