All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/6] cpuset/mempolicies related fixes and cleanups
@ 2017-04-11 14:06 ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 14:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov,
	Vlastimil Babka

I've recently summarized the cpuset/mempolicy issues in a LSF/MM proposal [1]
and the discussion itself [2]. I've been trying to rewrite the handling as
proposed, with the idea that changing semantics to make all mempolicies static
wrt cpuset updates (and discarding the relative and default modes) can be tried
on top, as there's a high risk of being rejected/reverted because somebody
might still care about the removed modes.

However I haven't yet figured out how to properly:

1) make mempolicies swappable instead of rebinding in place. I thought mbind()
already works that way and uses refcounting to avoid use-after-free of the old
policy by a parallel allocation, but turns out true refcounting is only done
for shared (shmem) mempolicies, and the actual protection for mbind() comes
from mmap_sem. Extending the refcounting means more overhead in allocator hot
path. Also swapping whole mempolicies means that we have to allocate the new
ones, which can fail, and reverting of the partially done work also means
allocating (note that mbind() doesn't care and will just leave part of the
range updated and part not updated when returning -ENOMEM...).

2) make cpuset's task->mems_allowed also swappable (after converting it from
nodemask to zonelist, which is the easy part) for mostly the same reasons.

The good news is that while trying to do the above, I've at least figured out
how to hopefully close the remaining premature OOM's, and do a buch of cleanups
on top, removing quite some of the code that was also supposed to prevent the
cpuset update races, but doesn't work anymore nowadays. This should fix the
most pressing concerns with this topic and give us a better baseline before
either proceeding with the original proposal, or pushing a change of semantics
that removes the problem 1) above. I'd be then fine with trying to change the
semantic first and rewrite later.

Patchset is based on next-20170411 and has been tested with the LTP cpuset01
stress test.

[1] https://lkml.kernel.org/r/4c44a589-5fd8-08d0-892c-e893bb525b71@suse.cz
[2] https://lwn.net/Articles/717797/

Vlastimil Babka (6):
  mm, page_alloc: fix more premature OOM due to race with cpuset update
  mm, mempolicy: stop adjusting current->il_next in
    mpol_rebind_nodemask()
  mm, page_alloc: pass preferred nid instead of zonelist to allocator
  mm, mempolicy: simplify rebinding mempolicies when updating cpusets
  mm, cpuset: always use seqlock when changing task's nodemask
  mm, mempolicy: don't check cpuset seqlock where it doesn't matter

 include/linux/gfp.h            |  11 ++-
 include/linux/mempolicy.h      |  12 ++-
 include/uapi/linux/mempolicy.h |   8 --
 kernel/cgroup/cpuset.c         |  33 ++-------
 mm/hugetlb.c                   |  15 ++--
 mm/memory_hotplug.c            |   6 +-
 mm/mempolicy.c                 | 165 +++++++++--------------------------------
 mm/page_alloc.c                |  61 ++++++++++-----
 8 files changed, 109 insertions(+), 202 deletions(-)

-- 
2.12.2

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [RFC 0/6] cpuset/mempolicies related fixes and cleanups
@ 2017-04-11 14:06 ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 14:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov,
	Vlastimil Babka

I've recently summarized the cpuset/mempolicy issues in a LSF/MM proposal [1]
and the discussion itself [2]. I've been trying to rewrite the handling as
proposed, with the idea that changing semantics to make all mempolicies static
wrt cpuset updates (and discarding the relative and default modes) can be tried
on top, as there's a high risk of being rejected/reverted because somebody
might still care about the removed modes.

However I haven't yet figured out how to properly:

1) make mempolicies swappable instead of rebinding in place. I thought mbind()
already works that way and uses refcounting to avoid use-after-free of the old
policy by a parallel allocation, but turns out true refcounting is only done
for shared (shmem) mempolicies, and the actual protection for mbind() comes
from mmap_sem. Extending the refcounting means more overhead in allocator hot
path. Also swapping whole mempolicies means that we have to allocate the new
ones, which can fail, and reverting of the partially done work also means
allocating (note that mbind() doesn't care and will just leave part of the
range updated and part not updated when returning -ENOMEM...).

2) make cpuset's task->mems_allowed also swappable (after converting it from
nodemask to zonelist, which is the easy part) for mostly the same reasons.

The good news is that while trying to do the above, I've at least figured out
how to hopefully close the remaining premature OOM's, and do a buch of cleanups
on top, removing quite some of the code that was also supposed to prevent the
cpuset update races, but doesn't work anymore nowadays. This should fix the
most pressing concerns with this topic and give us a better baseline before
either proceeding with the original proposal, or pushing a change of semantics
that removes the problem 1) above. I'd be then fine with trying to change the
semantic first and rewrite later.

Patchset is based on next-20170411 and has been tested with the LTP cpuset01
stress test.

[1] https://lkml.kernel.org/r/4c44a589-5fd8-08d0-892c-e893bb525b71@suse.cz
[2] https://lwn.net/Articles/717797/

Vlastimil Babka (6):
  mm, page_alloc: fix more premature OOM due to race with cpuset update
  mm, mempolicy: stop adjusting current->il_next in
    mpol_rebind_nodemask()
  mm, page_alloc: pass preferred nid instead of zonelist to allocator
  mm, mempolicy: simplify rebinding mempolicies when updating cpusets
  mm, cpuset: always use seqlock when changing task's nodemask
  mm, mempolicy: don't check cpuset seqlock where it doesn't matter

 include/linux/gfp.h            |  11 ++-
 include/linux/mempolicy.h      |  12 ++-
 include/uapi/linux/mempolicy.h |   8 --
 kernel/cgroup/cpuset.c         |  33 ++-------
 mm/hugetlb.c                   |  15 ++--
 mm/memory_hotplug.c            |   6 +-
 mm/mempolicy.c                 | 165 +++++++++--------------------------------
 mm/page_alloc.c                |  61 ++++++++++-----
 8 files changed, 109 insertions(+), 202 deletions(-)

-- 
2.12.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
  2017-04-11 14:06 ` Vlastimil Babka
@ 2017-04-11 14:06   ` Vlastimil Babka
  -1 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 14:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov,
	Vlastimil Babka

Commit e47483bca2cc ("mm, page_alloc: fix premature OOM when racing with cpuset
mems update") has fixed known recent regressions found by LTP's cpuset01
testcase. I have however found that by modifying the testcase to use per-vma
mempolicies via bind(2) instead of per-task mempolicies via set_mempolicy(2),
the premature OOM still happens and the issue is much older.

The root of the problem is that the cpuset's mems_allowed and mempolicy's
nodemask can temporarily have no intersection, thus get_page_from_freelist()
cannot find any usable zone. The current semantic for empty intersection is to
ignore mempolicy's nodemask and honour cpuset restrictions. This is checked in
node_zonelist(), but the racy update can happen after we already passed the
check. Such races should be protected by the seqlock task->mems_allowed_seq,
but it doesn't work here, because 1) mpol_rebind_mm() does not happen under
seqlock for write, and doing so would lead to deadlock, as it takes mmap_sem
for write, while the allocation can have mmap_sem for read when it's taking the
seqlock for read. And 2) the seqlock cookie of callers of node_zonelist()
(alloc_pages_vma() and alloc_pages_current()) is different than the one of
__alloc_pages_slowpath(), so there's still a potential race window.

This patch fixes the issue by having __alloc_pages_slowpath() check for empty
intersection of cpuset and ac->nodemask before OOM or allocation failure. If
it's indeed empty, the nodemask is ignored and allocation retried, which mimics
node_zonelist(). This works fine, because almost all callers of
__alloc_pages_nodemask are obtaining the nodemask via node_zonelist(). The only
exception is new_node_page() from hotplug, where the potential violation of
nodemask isn't an issue, as there's already a fallback allocation attempt
without any nodemask. If there's a future caller that needs to have its specific
nodemask honoured over task's cpuset restrictions, we'll have to e.g. add a gfp
flag for that.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 38 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 32b31d661c9c..502d82f0e004 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3668,6 +3668,39 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	return false;
 }
 
+static inline bool
+check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac)
+{
+	/*
+	 * It's possible that cpuset's mems_allowed and the nodemask from
+	 * mempolicy don't intersect. This should be normally dealt with by
+	 * policy_nodemask(), but it's possible to race with cpuset update in
+	 * such a way the check therein was true, and then it became false
+	 * before we got our cpuset_mems_cookie here.
+	 * This assumes that for all allocations, ac->nodemask can come only
+	 * from MPOL_BIND mempolicy (whose documented semantics is to be ignored
+	 * when it does not intersect with the cpuset restrictions) or the
+	 * caller can deal with a violated nodemask.
+	 */
+	if (cpusets_enabled() && ac->nodemask &&
+			!cpuset_nodemask_valid_mems_allowed(ac->nodemask)) {
+		ac->nodemask = NULL;
+		return true;
+	}
+
+	/*
+	 * When updating a task's mems_allowed or mempolicy nodemask, it is
+	 * possible to race with parallel threads in such a way that our
+	 * allocation can fail while the mask is being updated. If we are about
+	 * to fail, check if the cpuset changed during allocation and if so,
+	 * retry.
+	 */
+	if (read_mems_allowed_retry(cpuset_mems_cookie))
+		return true;
+
+	return false;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
@@ -3863,11 +3896,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				&compaction_retries))
 		goto retry;
 
-	/*
-	 * It's possible we raced with cpuset update so the OOM would be
-	 * premature (see below the nopage: label for full explanation).
-	 */
-	if (read_mems_allowed_retry(cpuset_mems_cookie))
+
+	/* Deal with possible cpuset update races before we start OOM killing */
+	if (check_retry_cpuset(cpuset_mems_cookie, ac))
 		goto retry_cpuset;
 
 	/* Reclaim has failed us, start killing things */
@@ -3886,14 +3917,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 nopage:
-	/*
-	 * When updating a task's mems_allowed or mempolicy nodemask, it is
-	 * possible to race with parallel threads in such a way that our
-	 * allocation can fail while the mask is being updated. If we are about
-	 * to fail, check if the cpuset changed during allocation and if so,
-	 * retry.
-	 */
-	if (read_mems_allowed_retry(cpuset_mems_cookie))
+	/* Deal with possible cpuset update races before we fail */
+	if (check_retry_cpuset(cpuset_mems_cookie, ac))
 		goto retry_cpuset;
 
 	/*
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-04-11 14:06   ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 14:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov,
	Vlastimil Babka

Commit e47483bca2cc ("mm, page_alloc: fix premature OOM when racing with cpuset
mems update") has fixed known recent regressions found by LTP's cpuset01
testcase. I have however found that by modifying the testcase to use per-vma
mempolicies via bind(2) instead of per-task mempolicies via set_mempolicy(2),
the premature OOM still happens and the issue is much older.

The root of the problem is that the cpuset's mems_allowed and mempolicy's
nodemask can temporarily have no intersection, thus get_page_from_freelist()
cannot find any usable zone. The current semantic for empty intersection is to
ignore mempolicy's nodemask and honour cpuset restrictions. This is checked in
node_zonelist(), but the racy update can happen after we already passed the
check. Such races should be protected by the seqlock task->mems_allowed_seq,
but it doesn't work here, because 1) mpol_rebind_mm() does not happen under
seqlock for write, and doing so would lead to deadlock, as it takes mmap_sem
for write, while the allocation can have mmap_sem for read when it's taking the
seqlock for read. And 2) the seqlock cookie of callers of node_zonelist()
(alloc_pages_vma() and alloc_pages_current()) is different than the one of
__alloc_pages_slowpath(), so there's still a potential race window.

This patch fixes the issue by having __alloc_pages_slowpath() check for empty
intersection of cpuset and ac->nodemask before OOM or allocation failure. If
it's indeed empty, the nodemask is ignored and allocation retried, which mimics
node_zonelist(). This works fine, because almost all callers of
__alloc_pages_nodemask are obtaining the nodemask via node_zonelist(). The only
exception is new_node_page() from hotplug, where the potential violation of
nodemask isn't an issue, as there's already a fallback allocation attempt
without any nodemask. If there's a future caller that needs to have its specific
nodemask honoured over task's cpuset restrictions, we'll have to e.g. add a gfp
flag for that.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 38 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 32b31d661c9c..502d82f0e004 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3668,6 +3668,39 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	return false;
 }
 
+static inline bool
+check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac)
+{
+	/*
+	 * It's possible that cpuset's mems_allowed and the nodemask from
+	 * mempolicy don't intersect. This should be normally dealt with by
+	 * policy_nodemask(), but it's possible to race with cpuset update in
+	 * such a way the check therein was true, and then it became false
+	 * before we got our cpuset_mems_cookie here.
+	 * This assumes that for all allocations, ac->nodemask can come only
+	 * from MPOL_BIND mempolicy (whose documented semantics is to be ignored
+	 * when it does not intersect with the cpuset restrictions) or the
+	 * caller can deal with a violated nodemask.
+	 */
+	if (cpusets_enabled() && ac->nodemask &&
+			!cpuset_nodemask_valid_mems_allowed(ac->nodemask)) {
+		ac->nodemask = NULL;
+		return true;
+	}
+
+	/*
+	 * When updating a task's mems_allowed or mempolicy nodemask, it is
+	 * possible to race with parallel threads in such a way that our
+	 * allocation can fail while the mask is being updated. If we are about
+	 * to fail, check if the cpuset changed during allocation and if so,
+	 * retry.
+	 */
+	if (read_mems_allowed_retry(cpuset_mems_cookie))
+		return true;
+
+	return false;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
@@ -3863,11 +3896,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				&compaction_retries))
 		goto retry;
 
-	/*
-	 * It's possible we raced with cpuset update so the OOM would be
-	 * premature (see below the nopage: label for full explanation).
-	 */
-	if (read_mems_allowed_retry(cpuset_mems_cookie))
+
+	/* Deal with possible cpuset update races before we start OOM killing */
+	if (check_retry_cpuset(cpuset_mems_cookie, ac))
 		goto retry_cpuset;
 
 	/* Reclaim has failed us, start killing things */
@@ -3886,14 +3917,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 nopage:
-	/*
-	 * When updating a task's mems_allowed or mempolicy nodemask, it is
-	 * possible to race with parallel threads in such a way that our
-	 * allocation can fail while the mask is being updated. If we are about
-	 * to fail, check if the cpuset changed during allocation and if so,
-	 * retry.
-	 */
-	if (read_mems_allowed_retry(cpuset_mems_cookie))
+	/* Deal with possible cpuset update races before we fail */
+	if (check_retry_cpuset(cpuset_mems_cookie, ac))
 		goto retry_cpuset;
 
 	/*
-- 
2.12.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()
  2017-04-11 14:06 ` Vlastimil Babka
@ 2017-04-11 14:06   ` Vlastimil Babka
  -1 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 14:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov,
	Vlastimil Babka

The task->il_next variable remembers the last allocation node for task's
MPOL_INTERLEAVE policy. mpol_rebind_nodemask() updates interleave and
bind mempolicies due to changing cpuset mems. Currently it also tries to
make sure that current->il_next is valid within the updated nodemask. This is
bogus, because 1) we are updating potentially any task's mempolicy, not just
current, and 2) we might be updating per-vma mempolicy, not task one.

The interleave_nodes() function that uses il_next can cope fine with the value
not being within the currently allowed nodes, so this hasn't manifested as an
actual issue. Thus it also won't be an issue if we just remove this adjustment
completely.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mempolicy.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 37d0b334bfe9..efeec8d2bce5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -349,12 +349,6 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes,
 		pol->v.nodes = tmp;
 	else
 		BUG();
-
-	if (!node_isset(current->il_next, tmp)) {
-		current->il_next = next_node_in(current->il_next, tmp);
-		if (current->il_next >= MAX_NUMNODES)
-			current->il_next = numa_node_id();
-	}
 }
 
 static void mpol_rebind_preferred(struct mempolicy *pol,
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()
@ 2017-04-11 14:06   ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 14:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov,
	Vlastimil Babka

The task->il_next variable remembers the last allocation node for task's
MPOL_INTERLEAVE policy. mpol_rebind_nodemask() updates interleave and
bind mempolicies due to changing cpuset mems. Currently it also tries to
make sure that current->il_next is valid within the updated nodemask. This is
bogus, because 1) we are updating potentially any task's mempolicy, not just
current, and 2) we might be updating per-vma mempolicy, not task one.

The interleave_nodes() function that uses il_next can cope fine with the value
not being within the currently allowed nodes, so this hasn't manifested as an
actual issue. Thus it also won't be an issue if we just remove this adjustment
completely.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mempolicy.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 37d0b334bfe9..efeec8d2bce5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -349,12 +349,6 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes,
 		pol->v.nodes = tmp;
 	else
 		BUG();
-
-	if (!node_isset(current->il_next, tmp)) {
-		current->il_next = next_node_in(current->il_next, tmp);
-		if (current->il_next >= MAX_NUMNODES)
-			current->il_next = numa_node_id();
-	}
 }
 
 static void mpol_rebind_preferred(struct mempolicy *pol,
-- 
2.12.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC 3/6] mm, page_alloc: pass preferred nid instead of zonelist to allocator
  2017-04-11 14:06 ` Vlastimil Babka
@ 2017-04-11 14:06   ` Vlastimil Babka
  -1 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 14:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov,
	Vlastimil Babka

The main allocator function __alloc_pages_nodemask() takes a zonelist pointer
as one of its parameters. All of its callers directly or indirectly obtain the
zonelist via node_zonelist() using a preferred node id and gfp_mask. We can
make the code a bit simpler by doing the zonelist lookup in
__alloc_pages_nodemask(), passing it a preferred node id instead (gfp_mask is
already another parameter).

There are some code size benefits thanks to removal of inlined node_zonelist():

bloat-o-meter add/remove: 2/2 grow/shrink: 4/36 up/down: 399/-1351 (-952)

This will also make things simpler if we proceed with converting cpusets to
zonelists.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/gfp.h       | 11 +++++------
 include/linux/mempolicy.h |  6 +++---
 mm/hugetlb.c              | 15 +++++++++------
 mm/memory_hotplug.c       |  6 ++----
 mm/mempolicy.c            | 41 +++++++++++++++++++----------------------
 mm/page_alloc.c           | 10 +++++-----
 6 files changed, 43 insertions(+), 46 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 2b1a44f5bdb6..666af3c39d00 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -432,14 +432,13 @@ static inline void arch_alloc_page(struct page *page, int order) { }
 #endif
 
 struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-		       struct zonelist *zonelist, nodemask_t *nodemask);
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
+							nodemask_t *nodemask);
 
 static inline struct page *
-__alloc_pages(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist)
+__alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
 {
-	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
+	return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
 }
 
 /*
@@ -452,7 +451,7 @@ __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
 	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
 	VM_WARN_ON(!node_online(nid));
 
-	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+	return __alloc_pages(gfp_mask, order, nid);
 }
 
 /*
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5f4d8281832b..ecb6cbeede5a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -146,7 +146,7 @@ extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
 				enum mpol_rebind_step step);
 extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
 
-extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
+extern int huge_node(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
 extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
@@ -269,13 +269,13 @@ static inline void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
 {
 }
 
-static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
+static inline int huge_node(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask)
 {
 	*mpol = NULL;
 	*nodemask = NULL;
-	return node_zonelist(0, gfp_flags);
+	return 0;
 }
 
 static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e5828875f7bb..9f1f399bb913 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -904,6 +904,8 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	struct page *page = NULL;
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
+	gfp_t gfp_mask;
+	int nid;
 	struct zonelist *zonelist;
 	struct zone *zone;
 	struct zoneref *z;
@@ -924,12 +926,13 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
-	zonelist = huge_zonelist(vma, address,
-					htlb_alloc_mask(h), &mpol, &nodemask);
+	gfp_mask = htlb_alloc_mask(h);
+	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
+	zonelist = node_zonelist(nid, gfp_mask);
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						MAX_NR_ZONES - 1, nodemask) {
-		if (cpuset_zone_allowed(zone, htlb_alloc_mask(h))) {
+		if (cpuset_zone_allowed(zone, gfp_mask)) {
 			page = dequeue_huge_page_node(h, zone_to_nid(zone));
 			if (page) {
 				if (avoid_reserve)
@@ -1545,13 +1548,13 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
 	do {
 		struct page *page;
 		struct mempolicy *mpol;
-		struct zonelist *zl;
+		int nid;
 		nodemask_t *nodemask;
 
 		cpuset_mems_cookie = read_mems_allowed_begin();
-		zl = huge_zonelist(vma, addr, gfp, &mpol, &nodemask);
+		nid = huge_node(vma, addr, gfp, &mpol, &nodemask);
 		mpol_cond_put(mpol);
-		page = __alloc_pages_nodemask(gfp, order, zl, nodemask);
+		page = __alloc_pages_nodemask(gfp, order, nid, nodemask);
 		if (page)
 			return page;
 	} while (read_mems_allowed_retry(cpuset_mems_cookie));
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 76d4745513ee..79787a55ebc1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1598,11 +1598,9 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 		gfp_mask |= __GFP_HIGHMEM;
 
 	if (!nodes_empty(nmask))
-		new_page = __alloc_pages_nodemask(gfp_mask, 0,
-					node_zonelist(nid, gfp_mask), &nmask);
+		new_page = __alloc_pages_nodemask(gfp_mask, 0, nid, &nmask);
 	if (!new_page)
-		new_page = __alloc_pages(gfp_mask, 0,
-					node_zonelist(nid, gfp_mask));
+		new_page = __alloc_pages(gfp_mask, 0, nid);
 
 	return new_page;
 }
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index efeec8d2bce5..895d7a775f27 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1670,9 +1670,9 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 	return NULL;
 }
 
-/* Return a zonelist indicated by gfp for node representing a mempolicy */
-static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
-	int nd)
+/* Return the node id preferred by the given mempolicy, or the given id */
+static int policy_node(gfp_t gfp, struct mempolicy *policy,
+								int nd)
 {
 	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
 		nd = policy->v.preferred_node;
@@ -1685,7 +1685,7 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
 		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
 	}
 
-	return node_zonelist(nd, gfp);
+	return nd;
 }
 
 /* Do dynamic interleaving for a process */
@@ -1793,38 +1793,37 @@ static inline unsigned interleave_nid(struct mempolicy *pol,
 
 #ifdef CONFIG_HUGETLBFS
 /*
- * huge_zonelist(@vma, @addr, @gfp_flags, @mpol)
+ * huge_node(@vma, @addr, @gfp_flags, @mpol)
  * @vma: virtual memory area whose policy is sought
  * @addr: address in @vma for shared policy lookup and interleave policy
  * @gfp_flags: for requested zone
  * @mpol: pointer to mempolicy pointer for reference counted mempolicy
  * @nodemask: pointer to nodemask pointer for MPOL_BIND nodemask
  *
- * Returns a zonelist suitable for a huge page allocation and a pointer
+ * Returns a nid suitable for a huge page allocation and a pointer
  * to the struct mempolicy for conditional unref after allocation.
  * If the effective policy is 'BIND, returns a pointer to the mempolicy's
  * @nodemask for filtering the zonelist.
  *
  * Must be protected by read_mems_allowed_begin()
  */
-struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
-				gfp_t gfp_flags, struct mempolicy **mpol,
-				nodemask_t **nodemask)
+int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
+				struct mempolicy **mpol, nodemask_t **nodemask)
 {
-	struct zonelist *zl;
+	int nid;
 
 	*mpol = get_vma_policy(vma, addr);
 	*nodemask = NULL;	/* assume !MPOL_BIND */
 
 	if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
-		zl = node_zonelist(interleave_nid(*mpol, vma, addr,
-				huge_page_shift(hstate_vma(vma))), gfp_flags);
+		nid = interleave_nid(*mpol, vma, addr,
+					huge_page_shift(hstate_vma(vma)));
 	} else {
-		zl = policy_zonelist(gfp_flags, *mpol, numa_node_id());
+		nid = policy_node(gfp_flags, *mpol, numa_node_id());
 		if ((*mpol)->mode == MPOL_BIND)
 			*nodemask = &(*mpol)->v.nodes;
 	}
-	return zl;
+	return nid;
 }
 
 /*
@@ -1926,12 +1925,10 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
 static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
 					unsigned nid)
 {
-	struct zonelist *zl;
 	struct page *page;
 
-	zl = node_zonelist(nid, gfp);
-	page = __alloc_pages(gfp, order, zl);
-	if (page && page_zone(page) == zonelist_zone(&zl->_zonerefs[0]))
+	page = __alloc_pages(gfp, order, nid);
+	if (page && page_to_nid(page) == nid)
 		inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
 	return page;
 }
@@ -1965,8 +1962,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 {
 	struct mempolicy *pol;
 	struct page *page;
+	int preferred_nid;
 	unsigned int cpuset_mems_cookie;
-	struct zonelist *zl;
 	nodemask_t *nmask;
 
 retry_cpuset:
@@ -2009,8 +2006,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	}
 
 	nmask = policy_nodemask(gfp, pol);
-	zl = policy_zonelist(gfp, pol, node);
-	page = __alloc_pages_nodemask(gfp, order, zl, nmask);
+	preferred_nid = policy_node(gfp, pol, node);
+	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
 	mpol_cond_put(pol);
 out:
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
@@ -2057,7 +2054,7 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
 	else
 		page = __alloc_pages_nodemask(gfp, order,
-				policy_zonelist(gfp, pol, numa_node_id()),
+				policy_node(gfp, pol, numa_node_id()),
 				policy_nodemask(gfp, pol));
 
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 502d82f0e004..028db51c953b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3969,12 +3969,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 }
 
 static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist, nodemask_t *nodemask,
+		int preferred_nid, nodemask_t *nodemask,
 		struct alloc_context *ac, gfp_t *alloc_mask,
 		unsigned int *alloc_flags)
 {
 	ac->high_zoneidx = gfp_zone(gfp_mask);
-	ac->zonelist = zonelist;
+	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
 	ac->nodemask = nodemask;
 	ac->migratetype = gfpflags_to_migratetype(gfp_mask);
 
@@ -4019,8 +4019,8 @@ static inline void finalise_ac(gfp_t gfp_mask,
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-			struct zonelist *zonelist, nodemask_t *nodemask)
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
+							nodemask_t *nodemask)
 {
 	struct page *page;
 	unsigned int alloc_flags = ALLOC_WMARK_LOW;
@@ -4028,7 +4028,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	struct alloc_context ac = { };
 
 	gfp_mask &= gfp_allowed_mask;
-	if (!prepare_alloc_pages(gfp_mask, order, zonelist, nodemask, &ac, &alloc_mask, &alloc_flags))
+	if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
 		return NULL;
 
 	finalise_ac(gfp_mask, order, &ac);
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC 3/6] mm, page_alloc: pass preferred nid instead of zonelist to allocator
@ 2017-04-11 14:06   ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 14:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov,
	Vlastimil Babka

The main allocator function __alloc_pages_nodemask() takes a zonelist pointer
as one of its parameters. All of its callers directly or indirectly obtain the
zonelist via node_zonelist() using a preferred node id and gfp_mask. We can
make the code a bit simpler by doing the zonelist lookup in
__alloc_pages_nodemask(), passing it a preferred node id instead (gfp_mask is
already another parameter).

There are some code size benefits thanks to removal of inlined node_zonelist():

bloat-o-meter add/remove: 2/2 grow/shrink: 4/36 up/down: 399/-1351 (-952)

This will also make things simpler if we proceed with converting cpusets to
zonelists.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/gfp.h       | 11 +++++------
 include/linux/mempolicy.h |  6 +++---
 mm/hugetlb.c              | 15 +++++++++------
 mm/memory_hotplug.c       |  6 ++----
 mm/mempolicy.c            | 41 +++++++++++++++++++----------------------
 mm/page_alloc.c           | 10 +++++-----
 6 files changed, 43 insertions(+), 46 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 2b1a44f5bdb6..666af3c39d00 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -432,14 +432,13 @@ static inline void arch_alloc_page(struct page *page, int order) { }
 #endif
 
 struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-		       struct zonelist *zonelist, nodemask_t *nodemask);
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
+							nodemask_t *nodemask);
 
 static inline struct page *
-__alloc_pages(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist)
+__alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
 {
-	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
+	return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
 }
 
 /*
@@ -452,7 +451,7 @@ __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
 	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
 	VM_WARN_ON(!node_online(nid));
 
-	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+	return __alloc_pages(gfp_mask, order, nid);
 }
 
 /*
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5f4d8281832b..ecb6cbeede5a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -146,7 +146,7 @@ extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
 				enum mpol_rebind_step step);
 extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
 
-extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
+extern int huge_node(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
 extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
@@ -269,13 +269,13 @@ static inline void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
 {
 }
 
-static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
+static inline int huge_node(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask)
 {
 	*mpol = NULL;
 	*nodemask = NULL;
-	return node_zonelist(0, gfp_flags);
+	return 0;
 }
 
 static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e5828875f7bb..9f1f399bb913 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -904,6 +904,8 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	struct page *page = NULL;
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
+	gfp_t gfp_mask;
+	int nid;
 	struct zonelist *zonelist;
 	struct zone *zone;
 	struct zoneref *z;
@@ -924,12 +926,13 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
-	zonelist = huge_zonelist(vma, address,
-					htlb_alloc_mask(h), &mpol, &nodemask);
+	gfp_mask = htlb_alloc_mask(h);
+	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
+	zonelist = node_zonelist(nid, gfp_mask);
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						MAX_NR_ZONES - 1, nodemask) {
-		if (cpuset_zone_allowed(zone, htlb_alloc_mask(h))) {
+		if (cpuset_zone_allowed(zone, gfp_mask)) {
 			page = dequeue_huge_page_node(h, zone_to_nid(zone));
 			if (page) {
 				if (avoid_reserve)
@@ -1545,13 +1548,13 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
 	do {
 		struct page *page;
 		struct mempolicy *mpol;
-		struct zonelist *zl;
+		int nid;
 		nodemask_t *nodemask;
 
 		cpuset_mems_cookie = read_mems_allowed_begin();
-		zl = huge_zonelist(vma, addr, gfp, &mpol, &nodemask);
+		nid = huge_node(vma, addr, gfp, &mpol, &nodemask);
 		mpol_cond_put(mpol);
-		page = __alloc_pages_nodemask(gfp, order, zl, nodemask);
+		page = __alloc_pages_nodemask(gfp, order, nid, nodemask);
 		if (page)
 			return page;
 	} while (read_mems_allowed_retry(cpuset_mems_cookie));
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 76d4745513ee..79787a55ebc1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1598,11 +1598,9 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 		gfp_mask |= __GFP_HIGHMEM;
 
 	if (!nodes_empty(nmask))
-		new_page = __alloc_pages_nodemask(gfp_mask, 0,
-					node_zonelist(nid, gfp_mask), &nmask);
+		new_page = __alloc_pages_nodemask(gfp_mask, 0, nid, &nmask);
 	if (!new_page)
-		new_page = __alloc_pages(gfp_mask, 0,
-					node_zonelist(nid, gfp_mask));
+		new_page = __alloc_pages(gfp_mask, 0, nid);
 
 	return new_page;
 }
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index efeec8d2bce5..895d7a775f27 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1670,9 +1670,9 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 	return NULL;
 }
 
-/* Return a zonelist indicated by gfp for node representing a mempolicy */
-static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
-	int nd)
+/* Return the node id preferred by the given mempolicy, or the given id */
+static int policy_node(gfp_t gfp, struct mempolicy *policy,
+								int nd)
 {
 	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
 		nd = policy->v.preferred_node;
@@ -1685,7 +1685,7 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
 		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
 	}
 
-	return node_zonelist(nd, gfp);
+	return nd;
 }
 
 /* Do dynamic interleaving for a process */
@@ -1793,38 +1793,37 @@ static inline unsigned interleave_nid(struct mempolicy *pol,
 
 #ifdef CONFIG_HUGETLBFS
 /*
- * huge_zonelist(@vma, @addr, @gfp_flags, @mpol)
+ * huge_node(@vma, @addr, @gfp_flags, @mpol)
  * @vma: virtual memory area whose policy is sought
  * @addr: address in @vma for shared policy lookup and interleave policy
  * @gfp_flags: for requested zone
  * @mpol: pointer to mempolicy pointer for reference counted mempolicy
  * @nodemask: pointer to nodemask pointer for MPOL_BIND nodemask
  *
- * Returns a zonelist suitable for a huge page allocation and a pointer
+ * Returns a nid suitable for a huge page allocation and a pointer
  * to the struct mempolicy for conditional unref after allocation.
  * If the effective policy is 'BIND, returns a pointer to the mempolicy's
  * @nodemask for filtering the zonelist.
  *
  * Must be protected by read_mems_allowed_begin()
  */
-struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
-				gfp_t gfp_flags, struct mempolicy **mpol,
-				nodemask_t **nodemask)
+int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
+				struct mempolicy **mpol, nodemask_t **nodemask)
 {
-	struct zonelist *zl;
+	int nid;
 
 	*mpol = get_vma_policy(vma, addr);
 	*nodemask = NULL;	/* assume !MPOL_BIND */
 
 	if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
-		zl = node_zonelist(interleave_nid(*mpol, vma, addr,
-				huge_page_shift(hstate_vma(vma))), gfp_flags);
+		nid = interleave_nid(*mpol, vma, addr,
+					huge_page_shift(hstate_vma(vma)));
 	} else {
-		zl = policy_zonelist(gfp_flags, *mpol, numa_node_id());
+		nid = policy_node(gfp_flags, *mpol, numa_node_id());
 		if ((*mpol)->mode == MPOL_BIND)
 			*nodemask = &(*mpol)->v.nodes;
 	}
-	return zl;
+	return nid;
 }
 
 /*
@@ -1926,12 +1925,10 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
 static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
 					unsigned nid)
 {
-	struct zonelist *zl;
 	struct page *page;
 
-	zl = node_zonelist(nid, gfp);
-	page = __alloc_pages(gfp, order, zl);
-	if (page && page_zone(page) == zonelist_zone(&zl->_zonerefs[0]))
+	page = __alloc_pages(gfp, order, nid);
+	if (page && page_to_nid(page) == nid)
 		inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
 	return page;
 }
@@ -1965,8 +1962,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 {
 	struct mempolicy *pol;
 	struct page *page;
+	int preferred_nid;
 	unsigned int cpuset_mems_cookie;
-	struct zonelist *zl;
 	nodemask_t *nmask;
 
 retry_cpuset:
@@ -2009,8 +2006,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	}
 
 	nmask = policy_nodemask(gfp, pol);
-	zl = policy_zonelist(gfp, pol, node);
-	page = __alloc_pages_nodemask(gfp, order, zl, nmask);
+	preferred_nid = policy_node(gfp, pol, node);
+	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
 	mpol_cond_put(pol);
 out:
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
@@ -2057,7 +2054,7 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
 	else
 		page = __alloc_pages_nodemask(gfp, order,
-				policy_zonelist(gfp, pol, numa_node_id()),
+				policy_node(gfp, pol, numa_node_id()),
 				policy_nodemask(gfp, pol));
 
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 502d82f0e004..028db51c953b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3969,12 +3969,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 }
 
 static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist, nodemask_t *nodemask,
+		int preferred_nid, nodemask_t *nodemask,
 		struct alloc_context *ac, gfp_t *alloc_mask,
 		unsigned int *alloc_flags)
 {
 	ac->high_zoneidx = gfp_zone(gfp_mask);
-	ac->zonelist = zonelist;
+	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
 	ac->nodemask = nodemask;
 	ac->migratetype = gfpflags_to_migratetype(gfp_mask);
 
@@ -4019,8 +4019,8 @@ static inline void finalise_ac(gfp_t gfp_mask,
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-			struct zonelist *zonelist, nodemask_t *nodemask)
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
+							nodemask_t *nodemask)
 {
 	struct page *page;
 	unsigned int alloc_flags = ALLOC_WMARK_LOW;
@@ -4028,7 +4028,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	struct alloc_context ac = { };
 
 	gfp_mask &= gfp_allowed_mask;
-	if (!prepare_alloc_pages(gfp_mask, order, zonelist, nodemask, &ac, &alloc_mask, &alloc_flags))
+	if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
 		return NULL;
 
 	finalise_ac(gfp_mask, order, &ac);
-- 
2.12.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC 4/6] mm, mempolicy: simplify rebinding mempolicies when updating cpusets
  2017-04-11 14:06 ` Vlastimil Babka
@ 2017-04-11 14:06   ` Vlastimil Babka
  -1 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 14:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov,
	Vlastimil Babka

Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
cpuset's mems") has introduced a two-step protocol when rebinding task's
mempolicy due to cpuset update, in order to avoid a parallel allocation seeing
an empty effective nodemask and failing. Later, commit cc9a6c877661 ("cpuset:
mm: reduce large amounts of memory barrier related damage v3") introduced
a seqlock protection and removed the synchronization point between the two
update steps. At that point (or perhaps later), the two-step rebinding became
unnecessary. Currently it only makes sure that the update first adds new nodes
in step 1 and then removes nodes in step 2. Without memory barriers the effects
are questionable, and even then this cannot prevent a parallel zonelist
iteration checking the nodemask at each step to observe all nodes as unusable
for allocation. We now fully rely on the seqlock to prevent premature OOMs and
allocation failures.

We can thus remove the two-step update parts and simplify the code.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mempolicy.h      |   6 +--
 include/uapi/linux/mempolicy.h |   8 ----
 kernel/cgroup/cpuset.c         |   4 +-
 mm/mempolicy.c                 | 102 ++++++++---------------------------------
 4 files changed, 21 insertions(+), 99 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index ecb6cbeede5a..3a58b4be1b0c 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -142,8 +142,7 @@ bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
-extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
-				enum mpol_rebind_step step);
+extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new);
 extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
 
 extern int huge_node(struct vm_area_struct *vma,
@@ -260,8 +259,7 @@ static inline void numa_default_policy(void)
 }
 
 static inline void mpol_rebind_task(struct task_struct *tsk,
-				const nodemask_t *new,
-				enum mpol_rebind_step step)
+				const nodemask_t *new)
 {
 }
 
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 9cd8b21dddbe..2a4d89508fec 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -24,13 +24,6 @@ enum {
 	MPOL_MAX,	/* always last member of enum */
 };
 
-enum mpol_rebind_step {
-	MPOL_REBIND_ONCE,	/* do rebind work at once(not by two step) */
-	MPOL_REBIND_STEP1,	/* first step(set all the newly nodes) */
-	MPOL_REBIND_STEP2,	/* second step(clean all the disallowed nodes)*/
-	MPOL_REBIND_NSTEP,
-};
-
 /* Flags for set_mempolicy */
 #define MPOL_F_STATIC_NODES	(1 << 15)
 #define MPOL_F_RELATIVE_NODES	(1 << 14)
@@ -65,7 +58,6 @@ enum mpol_rebind_step {
  */
 #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
-#define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 #define MPOL_F_MORON	(1 << 4) /* Migrate On protnone Reference On Node */
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index f6501f4f6040..b0159f8f8c89 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1063,9 +1063,7 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
 	}
 
 	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
-	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
-
-	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2);
+	mpol_rebind_task(tsk, newmems);
 	tsk->mems_allowed = *newmems;
 
 	if (need_loop) {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 895d7a775f27..72e5aeb1feeb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -146,22 +146,7 @@ struct mempolicy *get_task_policy(struct task_struct *p)
 
 static const struct mempolicy_operations {
 	int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
-	/*
-	 * If read-side task has no lock to protect task->mempolicy, write-side
-	 * task will rebind the task->mempolicy by two step. The first step is
-	 * setting all the newly nodes, and the second step is cleaning all the
-	 * disallowed nodes. In this way, we can avoid finding no node to alloc
-	 * page.
-	 * If we have a lock to protect task->mempolicy in read-side, we do
-	 * rebind directly.
-	 *
-	 * step:
-	 * 	MPOL_REBIND_ONCE - do rebind work at once
-	 * 	MPOL_REBIND_STEP1 - set all the newly nodes
-	 * 	MPOL_REBIND_STEP2 - clean all the disallowed nodes
-	 */
-	void (*rebind)(struct mempolicy *pol, const nodemask_t *nodes,
-			enum mpol_rebind_step step);
+	void (*rebind)(struct mempolicy *pol, const nodemask_t *nodes);
 } mpol_ops[MPOL_MAX];
 
 static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
@@ -304,19 +289,11 @@ void __mpol_put(struct mempolicy *p)
 	kmem_cache_free(policy_cache, p);
 }
 
-static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes,
-				enum mpol_rebind_step step)
+static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes)
 {
 }
 
-/*
- * step:
- * 	MPOL_REBIND_ONCE  - do rebind work at once
- * 	MPOL_REBIND_STEP1 - set all the newly nodes
- * 	MPOL_REBIND_STEP2 - clean all the disallowed nodes
- */
-static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes,
-				 enum mpol_rebind_step step)
+static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
 {
 	nodemask_t tmp;
 
@@ -325,35 +302,19 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes,
 	else if (pol->flags & MPOL_F_RELATIVE_NODES)
 		mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
 	else {
-		/*
-		 * if step == 1, we use ->w.cpuset_mems_allowed to cache the
-		 * result
-		 */
-		if (step == MPOL_REBIND_ONCE || step == MPOL_REBIND_STEP1) {
-			nodes_remap(tmp, pol->v.nodes,
-					pol->w.cpuset_mems_allowed, *nodes);
-			pol->w.cpuset_mems_allowed = step ? tmp : *nodes;
-		} else if (step == MPOL_REBIND_STEP2) {
-			tmp = pol->w.cpuset_mems_allowed;
-			pol->w.cpuset_mems_allowed = *nodes;
-		} else
-			BUG();
+		nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed,
+								*nodes);
+		pol->w.cpuset_mems_allowed = tmp;
 	}
 
 	if (nodes_empty(tmp))
 		tmp = *nodes;
 
-	if (step == MPOL_REBIND_STEP1)
-		nodes_or(pol->v.nodes, pol->v.nodes, tmp);
-	else if (step == MPOL_REBIND_ONCE || step == MPOL_REBIND_STEP2)
-		pol->v.nodes = tmp;
-	else
-		BUG();
+	pol->v.nodes = tmp;
 }
 
 static void mpol_rebind_preferred(struct mempolicy *pol,
-				  const nodemask_t *nodes,
-				  enum mpol_rebind_step step)
+						const nodemask_t *nodes)
 {
 	nodemask_t tmp;
 
@@ -379,42 +340,19 @@ static void mpol_rebind_preferred(struct mempolicy *pol,
 /*
  * mpol_rebind_policy - Migrate a policy to a different set of nodes
  *
- * If read-side task has no lock to protect task->mempolicy, write-side
- * task will rebind the task->mempolicy by two step. The first step is
- * setting all the newly nodes, and the second step is cleaning all the
- * disallowed nodes. In this way, we can avoid finding no node to alloc
- * page.
- * If we have a lock to protect task->mempolicy in read-side, we do
- * rebind directly.
- *
- * step:
- * 	MPOL_REBIND_ONCE  - do rebind work at once
- * 	MPOL_REBIND_STEP1 - set all the newly nodes
- * 	MPOL_REBIND_STEP2 - clean all the disallowed nodes
+ * Per-vma policies are protected by mmap_sem. Allocations using per-task
+ * policies are protected by task->mems_allowed_seq to prevent a premature
+ * OOM/allocation failure due to parallel nodemask modification.
  */
-static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask,
-				enum mpol_rebind_step step)
+static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask)
 {
 	if (!pol)
 		return;
-	if (!mpol_store_user_nodemask(pol) && step == MPOL_REBIND_ONCE &&
+	if (!mpol_store_user_nodemask(pol) &&
 	    nodes_equal(pol->w.cpuset_mems_allowed, *newmask))
 		return;
 
-	if (step == MPOL_REBIND_STEP1 && (pol->flags & MPOL_F_REBINDING))
-		return;
-
-	if (step == MPOL_REBIND_STEP2 && !(pol->flags & MPOL_F_REBINDING))
-		BUG();
-
-	if (step == MPOL_REBIND_STEP1)
-		pol->flags |= MPOL_F_REBINDING;
-	else if (step == MPOL_REBIND_STEP2)
-		pol->flags &= ~MPOL_F_REBINDING;
-	else if (step >= MPOL_REBIND_NSTEP)
-		BUG();
-
-	mpol_ops[pol->mode].rebind(pol, newmask, step);
+	mpol_ops[pol->mode].rebind(pol, newmask);
 }
 
 /*
@@ -424,10 +362,9 @@ static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask,
  * Called with task's alloc_lock held.
  */
 
-void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
-			enum mpol_rebind_step step)
+void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new)
 {
-	mpol_rebind_policy(tsk->mempolicy, new, step);
+	mpol_rebind_policy(tsk->mempolicy, new);
 }
 
 /*
@@ -442,7 +379,7 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
 
 	down_write(&mm->mmap_sem);
 	for (vma = mm->mmap; vma; vma = vma->vm_next)
-		mpol_rebind_policy(vma->vm_policy, new, MPOL_REBIND_ONCE);
+		mpol_rebind_policy(vma->vm_policy, new);
 	up_write(&mm->mmap_sem);
 }
 
@@ -2103,10 +2040,7 @@ struct mempolicy *__mpol_dup(struct mempolicy *old)
 
 	if (current_cpuset_is_being_rebound()) {
 		nodemask_t mems = cpuset_mems_allowed(current);
-		if (new->flags & MPOL_F_REBINDING)
-			mpol_rebind_policy(new, &mems, MPOL_REBIND_STEP2);
-		else
-			mpol_rebind_policy(new, &mems, MPOL_REBIND_ONCE);
+		mpol_rebind_policy(new, &mems);
 	}
 	atomic_set(&new->refcnt, 1);
 	return new;
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC 4/6] mm, mempolicy: simplify rebinding mempolicies when updating cpusets
@ 2017-04-11 14:06   ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 14:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov,
	Vlastimil Babka

Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
cpuset's mems") has introduced a two-step protocol when rebinding task's
mempolicy due to cpuset update, in order to avoid a parallel allocation seeing
an empty effective nodemask and failing. Later, commit cc9a6c877661 ("cpuset:
mm: reduce large amounts of memory barrier related damage v3") introduced
a seqlock protection and removed the synchronization point between the two
update steps. At that point (or perhaps later), the two-step rebinding became
unnecessary. Currently it only makes sure that the update first adds new nodes
in step 1 and then removes nodes in step 2. Without memory barriers the effects
are questionable, and even then this cannot prevent a parallel zonelist
iteration checking the nodemask at each step to observe all nodes as unusable
for allocation. We now fully rely on the seqlock to prevent premature OOMs and
allocation failures.

We can thus remove the two-step update parts and simplify the code.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mempolicy.h      |   6 +--
 include/uapi/linux/mempolicy.h |   8 ----
 kernel/cgroup/cpuset.c         |   4 +-
 mm/mempolicy.c                 | 102 ++++++++---------------------------------
 4 files changed, 21 insertions(+), 99 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index ecb6cbeede5a..3a58b4be1b0c 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -142,8 +142,7 @@ bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
-extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
-				enum mpol_rebind_step step);
+extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new);
 extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
 
 extern int huge_node(struct vm_area_struct *vma,
@@ -260,8 +259,7 @@ static inline void numa_default_policy(void)
 }
 
 static inline void mpol_rebind_task(struct task_struct *tsk,
-				const nodemask_t *new,
-				enum mpol_rebind_step step)
+				const nodemask_t *new)
 {
 }
 
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 9cd8b21dddbe..2a4d89508fec 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -24,13 +24,6 @@ enum {
 	MPOL_MAX,	/* always last member of enum */
 };
 
-enum mpol_rebind_step {
-	MPOL_REBIND_ONCE,	/* do rebind work at once(not by two step) */
-	MPOL_REBIND_STEP1,	/* first step(set all the newly nodes) */
-	MPOL_REBIND_STEP2,	/* second step(clean all the disallowed nodes)*/
-	MPOL_REBIND_NSTEP,
-};
-
 /* Flags for set_mempolicy */
 #define MPOL_F_STATIC_NODES	(1 << 15)
 #define MPOL_F_RELATIVE_NODES	(1 << 14)
@@ -65,7 +58,6 @@ enum mpol_rebind_step {
  */
 #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
-#define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 #define MPOL_F_MORON	(1 << 4) /* Migrate On protnone Reference On Node */
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index f6501f4f6040..b0159f8f8c89 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1063,9 +1063,7 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
 	}
 
 	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
-	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
-
-	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2);
+	mpol_rebind_task(tsk, newmems);
 	tsk->mems_allowed = *newmems;
 
 	if (need_loop) {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 895d7a775f27..72e5aeb1feeb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -146,22 +146,7 @@ struct mempolicy *get_task_policy(struct task_struct *p)
 
 static const struct mempolicy_operations {
 	int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
-	/*
-	 * If read-side task has no lock to protect task->mempolicy, write-side
-	 * task will rebind the task->mempolicy by two step. The first step is
-	 * setting all the newly nodes, and the second step is cleaning all the
-	 * disallowed nodes. In this way, we can avoid finding no node to alloc
-	 * page.
-	 * If we have a lock to protect task->mempolicy in read-side, we do
-	 * rebind directly.
-	 *
-	 * step:
-	 * 	MPOL_REBIND_ONCE - do rebind work at once
-	 * 	MPOL_REBIND_STEP1 - set all the newly nodes
-	 * 	MPOL_REBIND_STEP2 - clean all the disallowed nodes
-	 */
-	void (*rebind)(struct mempolicy *pol, const nodemask_t *nodes,
-			enum mpol_rebind_step step);
+	void (*rebind)(struct mempolicy *pol, const nodemask_t *nodes);
 } mpol_ops[MPOL_MAX];
 
 static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
@@ -304,19 +289,11 @@ void __mpol_put(struct mempolicy *p)
 	kmem_cache_free(policy_cache, p);
 }
 
-static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes,
-				enum mpol_rebind_step step)
+static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes)
 {
 }
 
-/*
- * step:
- * 	MPOL_REBIND_ONCE  - do rebind work at once
- * 	MPOL_REBIND_STEP1 - set all the newly nodes
- * 	MPOL_REBIND_STEP2 - clean all the disallowed nodes
- */
-static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes,
-				 enum mpol_rebind_step step)
+static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
 {
 	nodemask_t tmp;
 
@@ -325,35 +302,19 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes,
 	else if (pol->flags & MPOL_F_RELATIVE_NODES)
 		mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
 	else {
-		/*
-		 * if step == 1, we use ->w.cpuset_mems_allowed to cache the
-		 * result
-		 */
-		if (step == MPOL_REBIND_ONCE || step == MPOL_REBIND_STEP1) {
-			nodes_remap(tmp, pol->v.nodes,
-					pol->w.cpuset_mems_allowed, *nodes);
-			pol->w.cpuset_mems_allowed = step ? tmp : *nodes;
-		} else if (step == MPOL_REBIND_STEP2) {
-			tmp = pol->w.cpuset_mems_allowed;
-			pol->w.cpuset_mems_allowed = *nodes;
-		} else
-			BUG();
+		nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed,
+								*nodes);
+		pol->w.cpuset_mems_allowed = tmp;
 	}
 
 	if (nodes_empty(tmp))
 		tmp = *nodes;
 
-	if (step == MPOL_REBIND_STEP1)
-		nodes_or(pol->v.nodes, pol->v.nodes, tmp);
-	else if (step == MPOL_REBIND_ONCE || step == MPOL_REBIND_STEP2)
-		pol->v.nodes = tmp;
-	else
-		BUG();
+	pol->v.nodes = tmp;
 }
 
 static void mpol_rebind_preferred(struct mempolicy *pol,
-				  const nodemask_t *nodes,
-				  enum mpol_rebind_step step)
+						const nodemask_t *nodes)
 {
 	nodemask_t tmp;
 
@@ -379,42 +340,19 @@ static void mpol_rebind_preferred(struct mempolicy *pol,
 /*
  * mpol_rebind_policy - Migrate a policy to a different set of nodes
  *
- * If read-side task has no lock to protect task->mempolicy, write-side
- * task will rebind the task->mempolicy by two step. The first step is
- * setting all the newly nodes, and the second step is cleaning all the
- * disallowed nodes. In this way, we can avoid finding no node to alloc
- * page.
- * If we have a lock to protect task->mempolicy in read-side, we do
- * rebind directly.
- *
- * step:
- * 	MPOL_REBIND_ONCE  - do rebind work at once
- * 	MPOL_REBIND_STEP1 - set all the newly nodes
- * 	MPOL_REBIND_STEP2 - clean all the disallowed nodes
+ * Per-vma policies are protected by mmap_sem. Allocations using per-task
+ * policies are protected by task->mems_allowed_seq to prevent a premature
+ * OOM/allocation failure due to parallel nodemask modification.
  */
-static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask,
-				enum mpol_rebind_step step)
+static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask)
 {
 	if (!pol)
 		return;
-	if (!mpol_store_user_nodemask(pol) && step == MPOL_REBIND_ONCE &&
+	if (!mpol_store_user_nodemask(pol) &&
 	    nodes_equal(pol->w.cpuset_mems_allowed, *newmask))
 		return;
 
-	if (step == MPOL_REBIND_STEP1 && (pol->flags & MPOL_F_REBINDING))
-		return;
-
-	if (step == MPOL_REBIND_STEP2 && !(pol->flags & MPOL_F_REBINDING))
-		BUG();
-
-	if (step == MPOL_REBIND_STEP1)
-		pol->flags |= MPOL_F_REBINDING;
-	else if (step == MPOL_REBIND_STEP2)
-		pol->flags &= ~MPOL_F_REBINDING;
-	else if (step >= MPOL_REBIND_NSTEP)
-		BUG();
-
-	mpol_ops[pol->mode].rebind(pol, newmask, step);
+	mpol_ops[pol->mode].rebind(pol, newmask);
 }
 
 /*
@@ -424,10 +362,9 @@ static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask,
  * Called with task's alloc_lock held.
  */
 
-void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
-			enum mpol_rebind_step step)
+void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new)
 {
-	mpol_rebind_policy(tsk->mempolicy, new, step);
+	mpol_rebind_policy(tsk->mempolicy, new);
 }
 
 /*
@@ -442,7 +379,7 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
 
 	down_write(&mm->mmap_sem);
 	for (vma = mm->mmap; vma; vma = vma->vm_next)
-		mpol_rebind_policy(vma->vm_policy, new, MPOL_REBIND_ONCE);
+		mpol_rebind_policy(vma->vm_policy, new);
 	up_write(&mm->mmap_sem);
 }
 
@@ -2103,10 +2040,7 @@ struct mempolicy *__mpol_dup(struct mempolicy *old)
 
 	if (current_cpuset_is_being_rebound()) {
 		nodemask_t mems = cpuset_mems_allowed(current);
-		if (new->flags & MPOL_F_REBINDING)
-			mpol_rebind_policy(new, &mems, MPOL_REBIND_STEP2);
-		else
-			mpol_rebind_policy(new, &mems, MPOL_REBIND_ONCE);
+		mpol_rebind_policy(new, &mems);
 	}
 	atomic_set(&new->refcnt, 1);
 	return new;
-- 
2.12.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC 5/6] mm, cpuset: always use seqlock when changing task's nodemask
  2017-04-11 14:06 ` Vlastimil Babka
@ 2017-04-11 14:06   ` Vlastimil Babka
  -1 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 14:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov,
	Vlastimil Babka

When updating task's mems_allowed and rebinding its mempolicy due to cpuset's
mems being changed, we currently only take the seqlock for writing when either
the task has a mempolicy, or the new mems has no intersection with the old
mems. This should be enough to prevent a parallel allocation seeing no
available nodes, but the optimization is IMHO unnecessary (cpuset updates
should not be frequent), and we still potentially risk issues if the
intersection of new and old nodes has limited amount of free/reclaimable
memory. Let's just use the seqlock for all tasks.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 kernel/cgroup/cpuset.c | 29 +++++++----------------------
 1 file changed, 7 insertions(+), 22 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b0159f8f8c89..e76d18daf085 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1038,38 +1038,23 @@ static void cpuset_post_attach(void)
  * @tsk: the task to change
  * @newmems: new nodes that the task will be set
  *
- * In order to avoid seeing no nodes if the old and new nodes are disjoint,
- * we structure updates as setting all new allowed nodes, then clearing newly
- * disallowed ones.
+ * We use the mems_allowed_seq seqlock to safely update both tsk->mems_allowed
+ * and rebind an eventual tasks' mempolicy. If the task is allocating in
+ * parallel, it might temporarily see an empty intersection, which results in
+ * a seqlock check and retry before OOM or allocation failure.
  */
 static void cpuset_change_task_nodemask(struct task_struct *tsk,
 					nodemask_t *newmems)
 {
-	bool need_loop;
-
 	task_lock(tsk);
-	/*
-	 * Determine if a loop is necessary if another thread is doing
-	 * read_mems_allowed_begin().  If at least one node remains unchanged and
-	 * tsk does not have a mempolicy, then an empty nodemask will not be
-	 * possible when mems_allowed is larger than a word.
-	 */
-	need_loop = task_has_mempolicy(tsk) ||
-			!nodes_intersects(*newmems, tsk->mems_allowed);
 
-	if (need_loop) {
-		local_irq_disable();
-		write_seqcount_begin(&tsk->mems_allowed_seq);
-	}
+	local_irq_disable();
+	write_seqcount_begin(&tsk->mems_allowed_seq);
 
-	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
 	mpol_rebind_task(tsk, newmems);
 	tsk->mems_allowed = *newmems;
 
-	if (need_loop) {
-		write_seqcount_end(&tsk->mems_allowed_seq);
-		local_irq_enable();
-	}
+	write_seqcount_end(&tsk->mems_allowed_seq);
 
 	task_unlock(tsk);
 }
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC 5/6] mm, cpuset: always use seqlock when changing task's nodemask
@ 2017-04-11 14:06   ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 14:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov,
	Vlastimil Babka

When updating task's mems_allowed and rebinding its mempolicy due to cpuset's
mems being changed, we currently only take the seqlock for writing when either
the task has a mempolicy, or the new mems has no intersection with the old
mems. This should be enough to prevent a parallel allocation seeing no
available nodes, but the optimization is IMHO unnecessary (cpuset updates
should not be frequent), and we still potentially risk issues if the
intersection of new and old nodes has limited amount of free/reclaimable
memory. Let's just use the seqlock for all tasks.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 kernel/cgroup/cpuset.c | 29 +++++++----------------------
 1 file changed, 7 insertions(+), 22 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b0159f8f8c89..e76d18daf085 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1038,38 +1038,23 @@ static void cpuset_post_attach(void)
  * @tsk: the task to change
  * @newmems: new nodes that the task will be set
  *
- * In order to avoid seeing no nodes if the old and new nodes are disjoint,
- * we structure updates as setting all new allowed nodes, then clearing newly
- * disallowed ones.
+ * We use the mems_allowed_seq seqlock to safely update both tsk->mems_allowed
+ * and rebind an eventual tasks' mempolicy. If the task is allocating in
+ * parallel, it might temporarily see an empty intersection, which results in
+ * a seqlock check and retry before OOM or allocation failure.
  */
 static void cpuset_change_task_nodemask(struct task_struct *tsk,
 					nodemask_t *newmems)
 {
-	bool need_loop;
-
 	task_lock(tsk);
-	/*
-	 * Determine if a loop is necessary if another thread is doing
-	 * read_mems_allowed_begin().  If at least one node remains unchanged and
-	 * tsk does not have a mempolicy, then an empty nodemask will not be
-	 * possible when mems_allowed is larger than a word.
-	 */
-	need_loop = task_has_mempolicy(tsk) ||
-			!nodes_intersects(*newmems, tsk->mems_allowed);
 
-	if (need_loop) {
-		local_irq_disable();
-		write_seqcount_begin(&tsk->mems_allowed_seq);
-	}
+	local_irq_disable();
+	write_seqcount_begin(&tsk->mems_allowed_seq);
 
-	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
 	mpol_rebind_task(tsk, newmems);
 	tsk->mems_allowed = *newmems;
 
-	if (need_loop) {
-		write_seqcount_end(&tsk->mems_allowed_seq);
-		local_irq_enable();
-	}
+	write_seqcount_end(&tsk->mems_allowed_seq);
 
 	task_unlock(tsk);
 }
-- 
2.12.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC 6/6] mm, mempolicy: don't check cpuset seqlock where it doesn't matter
  2017-04-11 14:06 ` Vlastimil Babka
@ 2017-04-11 14:06   ` Vlastimil Babka
  -1 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 14:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov,
	Vlastimil Babka

Two wrappers of __alloc_pages_nodemask() are checking task->mems_allowed_seq
themselves to retry allocation that has raced with a cpuset update. This has
been shown to be ineffective in preventing premature OOM's which can happen in
__alloc_pages_slowpath() long before it returns back to the wrappers to detect
the race at that level. Previous patches have made __alloc_pages_slowpath()
more robust, so we can now simply remove the seqlock checking in the wrappers
to prevent further wrong impression that it can actually help.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mempolicy.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 72e5aeb1feeb..9a542b7a2189 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1900,12 +1900,9 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	struct mempolicy *pol;
 	struct page *page;
 	int preferred_nid;
-	unsigned int cpuset_mems_cookie;
 	nodemask_t *nmask;
 
-retry_cpuset:
 	pol = get_vma_policy(vma, addr);
-	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	if (pol->mode == MPOL_INTERLEAVE) {
 		unsigned nid;
@@ -1947,8 +1944,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
 	mpol_cond_put(pol);
 out:
-	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
-		goto retry_cpuset;
 	return page;
 }
 
@@ -1966,23 +1961,15 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
  *	Allocate a page from the kernel page pool.  When not in
  *	interrupt context and apply the current process NUMA policy.
  *	Returns NULL when no page can be allocated.
- *
- *	Don't call cpuset_update_task_memory_state() unless
- *	1) it's ok to take cpuset_sem (can WAIT), and
- *	2) allocating for current task (not interrupt).
  */
 struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
 	struct mempolicy *pol = &default_policy;
 	struct page *page;
-	unsigned int cpuset_mems_cookie;
 
 	if (!in_interrupt() && !(gfp & __GFP_THISNODE))
 		pol = get_task_policy(current);
 
-retry_cpuset:
-	cpuset_mems_cookie = read_mems_allowed_begin();
-
 	/*
 	 * No reference counting needed for current->mempolicy
 	 * nor system default_policy
@@ -1994,9 +1981,6 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 				policy_node(gfp, pol, numa_node_id()),
 				policy_nodemask(gfp, pol));
 
-	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
-		goto retry_cpuset;
-
 	return page;
 }
 EXPORT_SYMBOL(alloc_pages_current);
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* [RFC 6/6] mm, mempolicy: don't check cpuset seqlock where it doesn't matter
@ 2017-04-11 14:06   ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 14:06 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov,
	Vlastimil Babka

Two wrappers of __alloc_pages_nodemask() are checking task->mems_allowed_seq
themselves to retry allocation that has raced with a cpuset update. This has
been shown to be ineffective in preventing premature OOM's which can happen in
__alloc_pages_slowpath() long before it returns back to the wrappers to detect
the race at that level. Previous patches have made __alloc_pages_slowpath()
more robust, so we can now simply remove the seqlock checking in the wrappers
to prevent further wrong impression that it can actually help.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mempolicy.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 72e5aeb1feeb..9a542b7a2189 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1900,12 +1900,9 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	struct mempolicy *pol;
 	struct page *page;
 	int preferred_nid;
-	unsigned int cpuset_mems_cookie;
 	nodemask_t *nmask;
 
-retry_cpuset:
 	pol = get_vma_policy(vma, addr);
-	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	if (pol->mode == MPOL_INTERLEAVE) {
 		unsigned nid;
@@ -1947,8 +1944,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
 	mpol_cond_put(pol);
 out:
-	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
-		goto retry_cpuset;
 	return page;
 }
 
@@ -1966,23 +1961,15 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
  *	Allocate a page from the kernel page pool.  When not in
  *	interrupt context and apply the current process NUMA policy.
  *	Returns NULL when no page can be allocated.
- *
- *	Don't call cpuset_update_task_memory_state() unless
- *	1) it's ok to take cpuset_sem (can WAIT), and
- *	2) allocating for current task (not interrupt).
  */
 struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
 	struct mempolicy *pol = &default_policy;
 	struct page *page;
-	unsigned int cpuset_mems_cookie;
 
 	if (!in_interrupt() && !(gfp & __GFP_THISNODE))
 		pol = get_task_policy(current);
 
-retry_cpuset:
-	cpuset_mems_cookie = read_mems_allowed_begin();
-
 	/*
 	 * No reference counting needed for current->mempolicy
 	 * nor system default_policy
@@ -1994,9 +1981,6 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 				policy_node(gfp, pol, numa_node_id()),
 				policy_nodemask(gfp, pol));
 
-	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
-		goto retry_cpuset;
-
 	return page;
 }
 EXPORT_SYMBOL(alloc_pages_current);
-- 
2.12.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
  2017-04-11 14:06   ` Vlastimil Babka
@ 2017-04-11 17:24     ` Christoph Lameter
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-04-11 17:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov

On Tue, 11 Apr 2017, Vlastimil Babka wrote:

> The root of the problem is that the cpuset's mems_allowed and mempolicy's
> nodemask can temporarily have no intersection, thus get_page_from_freelist()
> cannot find any usable zone. The current semantic for empty intersection is to
> ignore mempolicy's nodemask and honour cpuset restrictions. This is checked in
> node_zonelist(), but the racy update can happen after we already passed the

The fallback was only intended for a cpuset on which boundaries are not enforced
in critical conditions (softwall). A hardwall cpuset (CS_MEM_HARDWALL)
should fail the allocation.

> This patch fixes the issue by having __alloc_pages_slowpath() check for empty
> intersection of cpuset and ac->nodemask before OOM or allocation failure. If
> it's indeed empty, the nodemask is ignored and allocation retried, which mimics
> node_zonelist(). This works fine, because almost all callers of

Well that would need to be subject to the hardwall flag. Allocation needs
to fail for a hardwall cpuset.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-04-11 17:24     ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-04-11 17:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov

On Tue, 11 Apr 2017, Vlastimil Babka wrote:

> The root of the problem is that the cpuset's mems_allowed and mempolicy's
> nodemask can temporarily have no intersection, thus get_page_from_freelist()
> cannot find any usable zone. The current semantic for empty intersection is to
> ignore mempolicy's nodemask and honour cpuset restrictions. This is checked in
> node_zonelist(), but the racy update can happen after we already passed the

The fallback was only intended for a cpuset on which boundaries are not enforced
in critical conditions (softwall). A hardwall cpuset (CS_MEM_HARDWALL)
should fail the allocation.

> This patch fixes the issue by having __alloc_pages_slowpath() check for empty
> intersection of cpuset and ac->nodemask before OOM or allocation failure. If
> it's indeed empty, the nodemask is ignored and allocation retried, which mimics
> node_zonelist(). This works fine, because almost all callers of

Well that would need to be subject to the hardwall flag. Allocation needs
to fail for a hardwall cpuset.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()
  2017-04-11 14:06   ` Vlastimil Babka
@ 2017-04-11 17:32     ` Christoph Lameter
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-04-11 17:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov

On Tue, 11 Apr 2017, Vlastimil Babka wrote:

> The task->il_next variable remembers the last allocation node for task's
> MPOL_INTERLEAVE policy. mpol_rebind_nodemask() updates interleave and
> bind mempolicies due to changing cpuset mems. Currently it also tries to
> make sure that current->il_next is valid within the updated nodemask. This is
> bogus, because 1) we are updating potentially any task's mempolicy, not just
> current, and 2) we might be updating per-vma mempolicy, not task one.
>
> The interleave_nodes() function that uses il_next can cope fine with the value
> not being within the currently allowed nodes, so this hasn't manifested as an
> actual issue. Thus it also won't be an issue if we just remove this adjustment
> completely.

Well, interleave_nodes() will then potentially return a node outside of
the allowed memory policy when its called for the first time after
mpol_rebind_.. . But thenn it will find the next node within the
nodemask and work correctly for the next invocations.

But yea the race can probably be ignored. The idea was that the
application has a stable memory footprint during rebinding.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()
@ 2017-04-11 17:32     ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-04-11 17:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov

On Tue, 11 Apr 2017, Vlastimil Babka wrote:

> The task->il_next variable remembers the last allocation node for task's
> MPOL_INTERLEAVE policy. mpol_rebind_nodemask() updates interleave and
> bind mempolicies due to changing cpuset mems. Currently it also tries to
> make sure that current->il_next is valid within the updated nodemask. This is
> bogus, because 1) we are updating potentially any task's mempolicy, not just
> current, and 2) we might be updating per-vma mempolicy, not task one.
>
> The interleave_nodes() function that uses il_next can cope fine with the value
> not being within the currently allowed nodes, so this hasn't manifested as an
> actual issue. Thus it also won't be an issue if we just remove this adjustment
> completely.

Well, interleave_nodes() will then potentially return a node outside of
the allowed memory policy when its called for the first time after
mpol_rebind_.. . But thenn it will find the next node within the
nodemask and work correctly for the next invocations.

But yea the race can probably be ignored. The idea was that the
application has a stable memory footprint during rebinding.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-04-11 19:00       ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 19:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

+CC linux-api

On 11.4.2017 19:24, Christoph Lameter wrote:
> On Tue, 11 Apr 2017, Vlastimil Babka wrote:
> 
>> The root of the problem is that the cpuset's mems_allowed and mempolicy's
>> nodemask can temporarily have no intersection, thus get_page_from_freelist()
>> cannot find any usable zone. The current semantic for empty intersection is to
>> ignore mempolicy's nodemask and honour cpuset restrictions. This is checked in
>> node_zonelist(), but the racy update can happen after we already passed the
> 
> The fallback was only intended for a cpuset on which boundaries are not enforced
> in critical conditions (softwall). A hardwall cpuset (CS_MEM_HARDWALL)
> should fail the allocation.

Hmm just to clarify - I'm talking about ignoring the *mempolicy's* nodemask on
the basis of cpuset having higher priority, while you seem to be talking about
ignoring a (softwall) cpuset nodemask, right? man set_mempolicy says "... if
required nodemask contains no nodes that are allowed by the process's current
cpuset context, the memory  policy reverts to local allocation" which does come
down to ignoring mempolicy's nodemask.

>> This patch fixes the issue by having __alloc_pages_slowpath() check for empty
>> intersection of cpuset and ac->nodemask before OOM or allocation failure. If
>> it's indeed empty, the nodemask is ignored and allocation retried, which mimics
>> node_zonelist(). This works fine, because almost all callers of
> 
> Well that would need to be subject to the hardwall flag. Allocation needs
> to fail for a hardwall cpuset.

They still do, if no hardwall cpuset node can satisfy the allocation with
mempolicy ignored.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-04-11 19:00       ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 19:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov,
	linux-api-u79uwXL29TY76Z2rM5mHXA

+CC linux-api

On 11.4.2017 19:24, Christoph Lameter wrote:
> On Tue, 11 Apr 2017, Vlastimil Babka wrote:
> 
>> The root of the problem is that the cpuset's mems_allowed and mempolicy's
>> nodemask can temporarily have no intersection, thus get_page_from_freelist()
>> cannot find any usable zone. The current semantic for empty intersection is to
>> ignore mempolicy's nodemask and honour cpuset restrictions. This is checked in
>> node_zonelist(), but the racy update can happen after we already passed the
> 
> The fallback was only intended for a cpuset on which boundaries are not enforced
> in critical conditions (softwall). A hardwall cpuset (CS_MEM_HARDWALL)
> should fail the allocation.

Hmm just to clarify - I'm talking about ignoring the *mempolicy's* nodemask on
the basis of cpuset having higher priority, while you seem to be talking about
ignoring a (softwall) cpuset nodemask, right? man set_mempolicy says "... if
required nodemask contains no nodes that are allowed by the process's current
cpuset context, the memory  policy reverts to local allocation" which does come
down to ignoring mempolicy's nodemask.

>> This patch fixes the issue by having __alloc_pages_slowpath() check for empty
>> intersection of cpuset and ac->nodemask before OOM or allocation failure. If
>> it's indeed empty, the nodemask is ignored and allocation retried, which mimics
>> node_zonelist(). This works fine, because almost all callers of
> 
> Well that would need to be subject to the hardwall flag. Allocation needs
> to fail for a hardwall cpuset.

They still do, if no hardwall cpuset node can satisfy the allocation with
mempolicy ignored.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-04-11 19:00       ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 19:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

+CC linux-api

On 11.4.2017 19:24, Christoph Lameter wrote:
> On Tue, 11 Apr 2017, Vlastimil Babka wrote:
> 
>> The root of the problem is that the cpuset's mems_allowed and mempolicy's
>> nodemask can temporarily have no intersection, thus get_page_from_freelist()
>> cannot find any usable zone. The current semantic for empty intersection is to
>> ignore mempolicy's nodemask and honour cpuset restrictions. This is checked in
>> node_zonelist(), but the racy update can happen after we already passed the
> 
> The fallback was only intended for a cpuset on which boundaries are not enforced
> in critical conditions (softwall). A hardwall cpuset (CS_MEM_HARDWALL)
> should fail the allocation.

Hmm just to clarify - I'm talking about ignoring the *mempolicy's* nodemask on
the basis of cpuset having higher priority, while you seem to be talking about
ignoring a (softwall) cpuset nodemask, right? man set_mempolicy says "... if
required nodemask contains no nodes that are allowed by the process's current
cpuset context, the memory  policy reverts to local allocation" which does come
down to ignoring mempolicy's nodemask.

>> This patch fixes the issue by having __alloc_pages_slowpath() check for empty
>> intersection of cpuset and ac->nodemask before OOM or allocation failure. If
>> it's indeed empty, the nodemask is ignored and allocation retried, which mimics
>> node_zonelist(). This works fine, because almost all callers of
> 
> Well that would need to be subject to the hardwall flag. Allocation needs
> to fail for a hardwall cpuset.

They still do, if no hardwall cpuset node can satisfy the allocation with
mempolicy ignored.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()
  2017-04-11 17:32     ` Christoph Lameter
  (?)
@ 2017-04-11 19:03       ` Vlastimil Babka
  -1 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 19:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov

On 11.4.2017 19:32, Christoph Lameter wrote:
> On Tue, 11 Apr 2017, Vlastimil Babka wrote:
> 
>> The task->il_next variable remembers the last allocation node for task's
>> MPOL_INTERLEAVE policy. mpol_rebind_nodemask() updates interleave and
>> bind mempolicies due to changing cpuset mems. Currently it also tries to
>> make sure that current->il_next is valid within the updated nodemask. This is
>> bogus, because 1) we are updating potentially any task's mempolicy, not just
>> current, and 2) we might be updating per-vma mempolicy, not task one.
>>
>> The interleave_nodes() function that uses il_next can cope fine with the value
>> not being within the currently allowed nodes, so this hasn't manifested as an
>> actual issue. Thus it also won't be an issue if we just remove this adjustment
>> completely.
> 
> Well, interleave_nodes() will then potentially return a node outside of
> the allowed memory policy when its called for the first time after
> mpol_rebind_.. . But thenn it will find the next node within the
> nodemask and work correctly for the next invocations.

Hmm, you're right. But that could be easily fixed if il_next became il_prev, so
we would return the result of next_node_in(il_prev) and also store it as the new
il_prev, right? I somehow assumed it already worked that way.

> But yea the race can probably be ignored. The idea was that the
> application has a stable memory footprint during rebinding.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()
@ 2017-04-11 19:03       ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 19:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov

On 11.4.2017 19:32, Christoph Lameter wrote:
> On Tue, 11 Apr 2017, Vlastimil Babka wrote:
> 
>> The task->il_next variable remembers the last allocation node for task's
>> MPOL_INTERLEAVE policy. mpol_rebind_nodemask() updates interleave and
>> bind mempolicies due to changing cpuset mems. Currently it also tries to
>> make sure that current->il_next is valid within the updated nodemask. This is
>> bogus, because 1) we are updating potentially any task's mempolicy, not just
>> current, and 2) we might be updating per-vma mempolicy, not task one.
>>
>> The interleave_nodes() function that uses il_next can cope fine with the value
>> not being within the currently allowed nodes, so this hasn't manifested as an
>> actual issue. Thus it also won't be an issue if we just remove this adjustment
>> completely.
> 
> Well, interleave_nodes() will then potentially return a node outside of
> the allowed memory policy when its called for the first time after
> mpol_rebind_.. . But thenn it will find the next node within the
> nodemask and work correctly for the next invocations.

Hmm, you're right. But that could be easily fixed if il_next became il_prev, so
we would return the result of next_node_in(il_prev) and also store it as the new
il_prev, right? I somehow assumed it already worked that way.

> But yea the race can probably be ignored. The idea was that the
> application has a stable memory footprint during rebinding.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()
@ 2017-04-11 19:03       ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-11 19:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov

On 11.4.2017 19:32, Christoph Lameter wrote:
> On Tue, 11 Apr 2017, Vlastimil Babka wrote:
> 
>> The task->il_next variable remembers the last allocation node for task's
>> MPOL_INTERLEAVE policy. mpol_rebind_nodemask() updates interleave and
>> bind mempolicies due to changing cpuset mems. Currently it also tries to
>> make sure that current->il_next is valid within the updated nodemask. This is
>> bogus, because 1) we are updating potentially any task's mempolicy, not just
>> current, and 2) we might be updating per-vma mempolicy, not task one.
>>
>> The interleave_nodes() function that uses il_next can cope fine with the value
>> not being within the currently allowed nodes, so this hasn't manifested as an
>> actual issue. Thus it also won't be an issue if we just remove this adjustment
>> completely.
> 
> Well, interleave_nodes() will then potentially return a node outside of
> the allowed memory policy when its called for the first time after
> mpol_rebind_.. . But thenn it will find the next node within the
> nodemask and work correctly for the next invocations.

Hmm, you're right. But that could be easily fixed if il_next became il_prev, so
we would return the result of next_node_in(il_prev) and also store it as the new
il_prev, right? I somehow assumed it already worked that way.

> But yea the race can probably be ignored. The idea was that the
> application has a stable memory footprint during rebinding.



^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 5/6] mm, cpuset: always use seqlock when changing task's nodemask
  2017-04-11 14:06   ` Vlastimil Babka
@ 2017-04-12  8:10     ` Hillf Danton
  -1 siblings, 0 replies; 91+ messages in thread
From: Hillf Danton @ 2017-04-12  8:10 UTC (permalink / raw)
  To: 'Vlastimil Babka', linux-mm
  Cc: linux-kernel, cgroups, 'Li Zefan', 'Michal Hocko',
	'Mel Gorman', 'David Rientjes',
	'Christoph Lameter', 'Hugh Dickins',
	'Andrea Arcangeli', 'Anshuman Khandual',
	'Kirill A. Shutemov'

On April 11, 2017 10:06 PM Vlastimil Babka wrote: 
> 
>  static void cpuset_change_task_nodemask(struct task_struct *tsk,
>  					nodemask_t *newmems)
>  {
> -	bool need_loop;
> -
>  	task_lock(tsk);
> -	/*
> -	 * Determine if a loop is necessary if another thread is doing
> -	 * read_mems_allowed_begin().  If at least one node remains unchanged and
> -	 * tsk does not have a mempolicy, then an empty nodemask will not be
> -	 * possible when mems_allowed is larger than a word.
> -	 */
> -	need_loop = task_has_mempolicy(tsk) ||
> -			!nodes_intersects(*newmems, tsk->mems_allowed);
> 
> -	if (need_loop) {
> -		local_irq_disable();
> -		write_seqcount_begin(&tsk->mems_allowed_seq);
> -	}
> +	local_irq_disable();
> +	write_seqcount_begin(&tsk->mems_allowed_seq);
> 
> -	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
>  	mpol_rebind_task(tsk, newmems);
>  	tsk->mems_allowed = *newmems;
> 
> -	if (need_loop) {
> -		write_seqcount_end(&tsk->mems_allowed_seq);
> -		local_irq_enable();
> -	}
> +	write_seqcount_end(&tsk->mems_allowed_seq);
> 
Doubt if we'd listen irq again.

>  	task_unlock(tsk);
>  }
> --
> 2.12.2

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 5/6] mm, cpuset: always use seqlock when changing task's nodemask
@ 2017-04-12  8:10     ` Hillf Danton
  0 siblings, 0 replies; 91+ messages in thread
From: Hillf Danton @ 2017-04-12  8:10 UTC (permalink / raw)
  To: 'Vlastimil Babka', linux-mm
  Cc: linux-kernel, cgroups, 'Li Zefan', 'Michal Hocko',
	'Mel Gorman', 'David Rientjes',
	'Christoph Lameter', 'Hugh Dickins',
	'Andrea Arcangeli', 'Anshuman Khandual',
	'Kirill A. Shutemov'

On April 11, 2017 10:06 PM Vlastimil Babka wrote: 
> 
>  static void cpuset_change_task_nodemask(struct task_struct *tsk,
>  					nodemask_t *newmems)
>  {
> -	bool need_loop;
> -
>  	task_lock(tsk);
> -	/*
> -	 * Determine if a loop is necessary if another thread is doing
> -	 * read_mems_allowed_begin().  If at least one node remains unchanged and
> -	 * tsk does not have a mempolicy, then an empty nodemask will not be
> -	 * possible when mems_allowed is larger than a word.
> -	 */
> -	need_loop = task_has_mempolicy(tsk) ||
> -			!nodes_intersects(*newmems, tsk->mems_allowed);
> 
> -	if (need_loop) {
> -		local_irq_disable();
> -		write_seqcount_begin(&tsk->mems_allowed_seq);
> -	}
> +	local_irq_disable();
> +	write_seqcount_begin(&tsk->mems_allowed_seq);
> 
> -	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
>  	mpol_rebind_task(tsk, newmems);
>  	tsk->mems_allowed = *newmems;
> 
> -	if (need_loop) {
> -		write_seqcount_end(&tsk->mems_allowed_seq);
> -		local_irq_enable();
> -	}
> +	write_seqcount_end(&tsk->mems_allowed_seq);
> 
Doubt if we'd listen irq again.

>  	task_unlock(tsk);
>  }
> --
> 2.12.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 5/6] mm, cpuset: always use seqlock when changing task's nodemask
  2017-04-12  8:10     ` Hillf Danton
  (?)
@ 2017-04-12  8:18       ` Vlastimil Babka
  -1 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-12  8:18 UTC (permalink / raw)
  To: Hillf Danton, linux-mm
  Cc: linux-kernel, cgroups, 'Li Zefan', 'Michal Hocko',
	'Mel Gorman', 'David Rientjes',
	'Christoph Lameter', 'Hugh Dickins',
	'Andrea Arcangeli', 'Anshuman Khandual',
	'Kirill A. Shutemov'

On 04/12/2017 10:10 AM, Hillf Danton wrote:
> On April 11, 2017 10:06 PM Vlastimil Babka wrote: 
>>
>>  static void cpuset_change_task_nodemask(struct task_struct *tsk,
>>  					nodemask_t *newmems)
>>  {
>> -	bool need_loop;
>> -
>>  	task_lock(tsk);
>> -	/*
>> -	 * Determine if a loop is necessary if another thread is doing
>> -	 * read_mems_allowed_begin().  If at least one node remains unchanged and
>> -	 * tsk does not have a mempolicy, then an empty nodemask will not be
>> -	 * possible when mems_allowed is larger than a word.
>> -	 */
>> -	need_loop = task_has_mempolicy(tsk) ||
>> -			!nodes_intersects(*newmems, tsk->mems_allowed);
>>
>> -	if (need_loop) {
>> -		local_irq_disable();
>> -		write_seqcount_begin(&tsk->mems_allowed_seq);
>> -	}
>> +	local_irq_disable();
>> +	write_seqcount_begin(&tsk->mems_allowed_seq);
>>
>> -	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
>>  	mpol_rebind_task(tsk, newmems);
>>  	tsk->mems_allowed = *newmems;
>>
>> -	if (need_loop) {
>> -		write_seqcount_end(&tsk->mems_allowed_seq);
>> -		local_irq_enable();
>> -	}
>> +	write_seqcount_end(&tsk->mems_allowed_seq);
>>
> Doubt if we'd listen irq again.

Ugh, thanks for catching this. Looks like my testing config didn't have
lockup detectors enabled.

>>  	task_unlock(tsk);
>>  }
>> --
>> 2.12.2
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 5/6] mm, cpuset: always use seqlock when changing task's nodemask
@ 2017-04-12  8:18       ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-12  8:18 UTC (permalink / raw)
  To: Hillf Danton, linux-mm
  Cc: linux-kernel, cgroups, 'Li Zefan', 'Michal Hocko',
	'Mel Gorman', 'David Rientjes',
	'Christoph Lameter', 'Hugh Dickins',
	'Andrea Arcangeli', 'Anshuman Khandual',
	'Kirill A. Shutemov'

On 04/12/2017 10:10 AM, Hillf Danton wrote:
> On April 11, 2017 10:06 PM Vlastimil Babka wrote: 
>>
>>  static void cpuset_change_task_nodemask(struct task_struct *tsk,
>>  					nodemask_t *newmems)
>>  {
>> -	bool need_loop;
>> -
>>  	task_lock(tsk);
>> -	/*
>> -	 * Determine if a loop is necessary if another thread is doing
>> -	 * read_mems_allowed_begin().  If at least one node remains unchanged and
>> -	 * tsk does not have a mempolicy, then an empty nodemask will not be
>> -	 * possible when mems_allowed is larger than a word.
>> -	 */
>> -	need_loop = task_has_mempolicy(tsk) ||
>> -			!nodes_intersects(*newmems, tsk->mems_allowed);
>>
>> -	if (need_loop) {
>> -		local_irq_disable();
>> -		write_seqcount_begin(&tsk->mems_allowed_seq);
>> -	}
>> +	local_irq_disable();
>> +	write_seqcount_begin(&tsk->mems_allowed_seq);
>>
>> -	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
>>  	mpol_rebind_task(tsk, newmems);
>>  	tsk->mems_allowed = *newmems;
>>
>> -	if (need_loop) {
>> -		write_seqcount_end(&tsk->mems_allowed_seq);
>> -		local_irq_enable();
>> -	}
>> +	write_seqcount_end(&tsk->mems_allowed_seq);
>>
> Doubt if we'd listen irq again.

Ugh, thanks for catching this. Looks like my testing config didn't have
lockup detectors enabled.

>>  	task_unlock(tsk);
>>  }
>> --
>> 2.12.2
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 5/6] mm, cpuset: always use seqlock when changing task's nodemask
@ 2017-04-12  8:18       ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-12  8:18 UTC (permalink / raw)
  To: Hillf Danton, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, 'Li Zefan',
	'Michal Hocko', 'Mel Gorman',
	'David Rientjes', 'Christoph Lameter',
	'Hugh Dickins', 'Andrea Arcangeli',
	'Anshuman Khandual', 'Kirill A. Shutemov'

On 04/12/2017 10:10 AM, Hillf Danton wrote:
> On April 11, 2017 10:06 PM Vlastimil Babka wrote: 
>>
>>  static void cpuset_change_task_nodemask(struct task_struct *tsk,
>>  					nodemask_t *newmems)
>>  {
>> -	bool need_loop;
>> -
>>  	task_lock(tsk);
>> -	/*
>> -	 * Determine if a loop is necessary if another thread is doing
>> -	 * read_mems_allowed_begin().  If at least one node remains unchanged and
>> -	 * tsk does not have a mempolicy, then an empty nodemask will not be
>> -	 * possible when mems_allowed is larger than a word.
>> -	 */
>> -	need_loop = task_has_mempolicy(tsk) ||
>> -			!nodes_intersects(*newmems, tsk->mems_allowed);
>>
>> -	if (need_loop) {
>> -		local_irq_disable();
>> -		write_seqcount_begin(&tsk->mems_allowed_seq);
>> -	}
>> +	local_irq_disable();
>> +	write_seqcount_begin(&tsk->mems_allowed_seq);
>>
>> -	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
>>  	mpol_rebind_task(tsk, newmems);
>>  	tsk->mems_allowed = *newmems;
>>
>> -	if (need_loop) {
>> -		write_seqcount_end(&tsk->mems_allowed_seq);
>> -		local_irq_enable();
>> -	}
>> +	write_seqcount_end(&tsk->mems_allowed_seq);
>>
> Doubt if we'd listen irq again.

Ugh, thanks for catching this. Looks like my testing config didn't have
lockup detectors enabled.

>>  	task_unlock(tsk);
>>  }
>> --
>> 2.12.2
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()
  2017-04-11 19:03       ` Vlastimil Babka
@ 2017-04-12  8:49         ` Vlastimil Babka
  -1 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-12  8:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov

On 04/11/2017 09:03 PM, Vlastimil Babka wrote:
> On 11.4.2017 19:32, Christoph Lameter wrote:
>> On Tue, 11 Apr 2017, Vlastimil Babka wrote:
>>
>>> The task->il_next variable remembers the last allocation node for task's
>>> MPOL_INTERLEAVE policy. mpol_rebind_nodemask() updates interleave and
>>> bind mempolicies due to changing cpuset mems. Currently it also tries to
>>> make sure that current->il_next is valid within the updated nodemask. This is
>>> bogus, because 1) we are updating potentially any task's mempolicy, not just
>>> current, and 2) we might be updating per-vma mempolicy, not task one.
>>>
>>> The interleave_nodes() function that uses il_next can cope fine with the value
>>> not being within the currently allowed nodes, so this hasn't manifested as an
>>> actual issue. Thus it also won't be an issue if we just remove this adjustment
>>> completely.
>>
>> Well, interleave_nodes() will then potentially return a node outside of
>> the allowed memory policy when its called for the first time after
>> mpol_rebind_.. . But thenn it will find the next node within the
>> nodemask and work correctly for the next invocations.
> 
> Hmm, you're right. But that could be easily fixed if il_next became il_prev, so
> we would return the result of next_node_in(il_prev) and also store it as the new
> il_prev, right? I somehow assumed it already worked that way.

Like this?
----8<----
commit 0ec64a0b8e614ea655328d0fb539447c407ba7c1
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Mon Apr 3 13:11:32 2017 +0200

    mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()
    
    The task->il_next variable stores the next allocation node id for task's
    MPOL_INTERLEAVE policy. mpol_rebind_nodemask() updates interleave and
    bind mempolicies due to changing cpuset mems. Currently it also tries to
    make sure that current->il_next is valid within the updated nodemask. This is
    bogus, because 1) we are updating potentially any task's mempolicy, not just
    current, and 2) we might be updating a per-vma mempolicy, not task one.
    
    The interleave_nodes() function that uses il_next can cope fine with the value
    not being within the currently allowed nodes, so this hasn't manifested as an
    actual issue.
    
    We can remove the need for updating il_next completely by changing it to
    il_prev and store the node id of the previous interleave allocation instead of
    the next id. Then interleave_nodes() can calculate the next id using the
    current nodemask and also store it as il_prev, except when querying the next
    node via do_get_mempolicy().
    
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 050d7113924a..9aca0db1e588 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -886,7 +886,7 @@ struct task_struct {
 #ifdef CONFIG_NUMA
 	/* Protected by alloc_lock: */
 	struct mempolicy		*mempolicy;
-	short				il_next;
+	short				il_prev;
 	short				pref_node_fork;
 #endif
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 37d0b334bfe9..25f9bde58521 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -349,12 +349,6 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes,
 		pol->v.nodes = tmp;
 	else
 		BUG();
-
-	if (!node_isset(current->il_next, tmp)) {
-		current->il_next = next_node_in(current->il_next, tmp);
-		if (current->il_next >= MAX_NUMNODES)
-			current->il_next = numa_node_id();
-	}
 }
 
 static void mpol_rebind_preferred(struct mempolicy *pol,
@@ -812,9 +806,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 	}
 	old = current->mempolicy;
 	current->mempolicy = new;
-	if (new && new->mode == MPOL_INTERLEAVE &&
-	    nodes_weight(new->v.nodes))
-		current->il_next = first_node(new->v.nodes);
+	if (new && new->mode == MPOL_INTERLEAVE)
+		current->il_prev = MAX_NUMNODES-1;
 	task_unlock(current);
 	mpol_put(old);
 	ret = 0;
@@ -863,6 +856,18 @@ static int lookup_node(unsigned long addr)
 	return err;
 }
 
+/* Do dynamic interleaving for a process */
+static unsigned interleave_nodes(struct mempolicy *policy, bool update_prev)
+{
+	unsigned next;
+	struct task_struct *me = current;
+
+	next = next_node_in(me->il_prev, policy->v.nodes);
+	if (next < MAX_NUMNODES && update_prev)
+		me->il_prev = next;
+	return next;
+}
+
 /* Retrieve NUMA policy */
 static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 			     unsigned long addr, unsigned long flags)
@@ -916,7 +921,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 			*policy = err;
 		} else if (pol == current->mempolicy &&
 				pol->mode == MPOL_INTERLEAVE) {
-			*policy = current->il_next;
+			*policy = interleave_nodes(current->mempolicy, false);
 		} else {
 			err = -EINVAL;
 			goto out;
@@ -1694,19 +1699,6 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
 	return node_zonelist(nd, gfp);
 }
 
-/* Do dynamic interleaving for a process */
-static unsigned interleave_nodes(struct mempolicy *policy)
-{
-	unsigned nid, next;
-	struct task_struct *me = current;
-
-	nid = me->il_next;
-	next = next_node_in(nid, policy->v.nodes);
-	if (next < MAX_NUMNODES)
-		me->il_next = next;
-	return nid;
-}
-
 /*
  * Depending on the memory policy provide a node from which to allocate the
  * next slab entry.
@@ -1731,7 +1723,7 @@ unsigned int mempolicy_slab_node(void)
 		return policy->v.preferred_node;
 
 	case MPOL_INTERLEAVE:
-		return interleave_nodes(policy);
+		return interleave_nodes(policy, true);
 
 	case MPOL_BIND: {
 		struct zoneref *z;
@@ -1794,7 +1786,7 @@ static inline unsigned interleave_nid(struct mempolicy *pol,
 		off += (addr - vma->vm_start) >> shift;
 		return offset_il_node(pol, vma, off);
 	} else
-		return interleave_nodes(pol);
+		return interleave_nodes(pol, true);
 }
 
 #ifdef CONFIG_HUGETLBFS
@@ -2060,7 +2052,8 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 	 * nor system default_policy
 	 */
 	if (pol->mode == MPOL_INTERLEAVE)
-		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
+		page = alloc_page_interleave(gfp, order,
+				interleave_nodes(pol, true));
 	else
 		page = __alloc_pages_nodemask(gfp, order,
 				policy_zonelist(gfp, pol, numa_node_id()),

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [RFC 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()
@ 2017-04-12  8:49         ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-12  8:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov

On 04/11/2017 09:03 PM, Vlastimil Babka wrote:
> On 11.4.2017 19:32, Christoph Lameter wrote:
>> On Tue, 11 Apr 2017, Vlastimil Babka wrote:
>>
>>> The task->il_next variable remembers the last allocation node for task's
>>> MPOL_INTERLEAVE policy. mpol_rebind_nodemask() updates interleave and
>>> bind mempolicies due to changing cpuset mems. Currently it also tries to
>>> make sure that current->il_next is valid within the updated nodemask. This is
>>> bogus, because 1) we are updating potentially any task's mempolicy, not just
>>> current, and 2) we might be updating per-vma mempolicy, not task one.
>>>
>>> The interleave_nodes() function that uses il_next can cope fine with the value
>>> not being within the currently allowed nodes, so this hasn't manifested as an
>>> actual issue. Thus it also won't be an issue if we just remove this adjustment
>>> completely.
>>
>> Well, interleave_nodes() will then potentially return a node outside of
>> the allowed memory policy when its called for the first time after
>> mpol_rebind_.. . But thenn it will find the next node within the
>> nodemask and work correctly for the next invocations.
> 
> Hmm, you're right. But that could be easily fixed if il_next became il_prev, so
> we would return the result of next_node_in(il_prev) and also store it as the new
> il_prev, right? I somehow assumed it already worked that way.

Like this?
----8<----
commit 0ec64a0b8e614ea655328d0fb539447c407ba7c1
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Mon Apr 3 13:11:32 2017 +0200

    mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()
    
    The task->il_next variable stores the next allocation node id for task's
    MPOL_INTERLEAVE policy. mpol_rebind_nodemask() updates interleave and
    bind mempolicies due to changing cpuset mems. Currently it also tries to
    make sure that current->il_next is valid within the updated nodemask. This is
    bogus, because 1) we are updating potentially any task's mempolicy, not just
    current, and 2) we might be updating a per-vma mempolicy, not task one.
    
    The interleave_nodes() function that uses il_next can cope fine with the value
    not being within the currently allowed nodes, so this hasn't manifested as an
    actual issue.
    
    We can remove the need for updating il_next completely by changing it to
    il_prev and store the node id of the previous interleave allocation instead of
    the next id. Then interleave_nodes() can calculate the next id using the
    current nodemask and also store it as il_prev, except when querying the next
    node via do_get_mempolicy().
    
    Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 050d7113924a..9aca0db1e588 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -886,7 +886,7 @@ struct task_struct {
 #ifdef CONFIG_NUMA
 	/* Protected by alloc_lock: */
 	struct mempolicy		*mempolicy;
-	short				il_next;
+	short				il_prev;
 	short				pref_node_fork;
 #endif
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 37d0b334bfe9..25f9bde58521 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -349,12 +349,6 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes,
 		pol->v.nodes = tmp;
 	else
 		BUG();
-
-	if (!node_isset(current->il_next, tmp)) {
-		current->il_next = next_node_in(current->il_next, tmp);
-		if (current->il_next >= MAX_NUMNODES)
-			current->il_next = numa_node_id();
-	}
 }
 
 static void mpol_rebind_preferred(struct mempolicy *pol,
@@ -812,9 +806,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 	}
 	old = current->mempolicy;
 	current->mempolicy = new;
-	if (new && new->mode == MPOL_INTERLEAVE &&
-	    nodes_weight(new->v.nodes))
-		current->il_next = first_node(new->v.nodes);
+	if (new && new->mode == MPOL_INTERLEAVE)
+		current->il_prev = MAX_NUMNODES-1;
 	task_unlock(current);
 	mpol_put(old);
 	ret = 0;
@@ -863,6 +856,18 @@ static int lookup_node(unsigned long addr)
 	return err;
 }
 
+/* Do dynamic interleaving for a process */
+static unsigned interleave_nodes(struct mempolicy *policy, bool update_prev)
+{
+	unsigned next;
+	struct task_struct *me = current;
+
+	next = next_node_in(me->il_prev, policy->v.nodes);
+	if (next < MAX_NUMNODES && update_prev)
+		me->il_prev = next;
+	return next;
+}
+
 /* Retrieve NUMA policy */
 static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 			     unsigned long addr, unsigned long flags)
@@ -916,7 +921,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 			*policy = err;
 		} else if (pol == current->mempolicy &&
 				pol->mode == MPOL_INTERLEAVE) {
-			*policy = current->il_next;
+			*policy = interleave_nodes(current->mempolicy, false);
 		} else {
 			err = -EINVAL;
 			goto out;
@@ -1694,19 +1699,6 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
 	return node_zonelist(nd, gfp);
 }
 
-/* Do dynamic interleaving for a process */
-static unsigned interleave_nodes(struct mempolicy *policy)
-{
-	unsigned nid, next;
-	struct task_struct *me = current;
-
-	nid = me->il_next;
-	next = next_node_in(nid, policy->v.nodes);
-	if (next < MAX_NUMNODES)
-		me->il_next = next;
-	return nid;
-}
-
 /*
  * Depending on the memory policy provide a node from which to allocate the
  * next slab entry.
@@ -1731,7 +1723,7 @@ unsigned int mempolicy_slab_node(void)
 		return policy->v.preferred_node;
 
 	case MPOL_INTERLEAVE:
-		return interleave_nodes(policy);
+		return interleave_nodes(policy, true);
 
 	case MPOL_BIND: {
 		struct zoneref *z;
@@ -1794,7 +1786,7 @@ static inline unsigned interleave_nid(struct mempolicy *pol,
 		off += (addr - vma->vm_start) >> shift;
 		return offset_il_node(pol, vma, off);
 	} else
-		return interleave_nodes(pol);
+		return interleave_nodes(pol, true);
 }
 
 #ifdef CONFIG_HUGETLBFS
@@ -2060,7 +2052,8 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 	 * nor system default_policy
 	 */
 	if (pol->mode == MPOL_INTERLEAVE)
-		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
+		page = alloc_page_interleave(gfp, order,
+				interleave_nodes(pol, true));
 	else
 		page = __alloc_pages_nodemask(gfp, order,
 				policy_zonelist(gfp, pol, numa_node_id()),


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 91+ messages in thread

* Re: [RFC 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()
  2017-04-12  8:49         ` Vlastimil Babka
  (?)
@ 2017-04-12 21:16           ` Christoph Lameter
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-04-12 21:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov

On Wed, 12 Apr 2017, Vlastimil Babka wrote:

> >> Well, interleave_nodes() will then potentially return a node outside of
> >> the allowed memory policy when its called for the first time after
> >> mpol_rebind_.. . But thenn it will find the next node within the
> >> nodemask and work correctly for the next invocations.
> >
> > Hmm, you're right. But that could be easily fixed if il_next became il_prev, so
> > we would return the result of next_node_in(il_prev) and also store it as the new
> > il_prev, right? I somehow assumed it already worked that way.

Yup that makes sense and I thought about that when I saw the problem too.

> @@ -863,6 +856,18 @@ static int lookup_node(unsigned long addr)
>  	return err;
>  }
>
> +/* Do dynamic interleaving for a process */
> +static unsigned interleave_nodes(struct mempolicy *policy, bool update_prev)

Why do you need an additional flag? Would it not be better to always
update and switch the update_prev=false case to simply use
next_node_in()?

> +{
> +	unsigned next;
> +	struct task_struct *me = current;
> +
> +	next = next_node_in(me->il_prev, policy->v.nodes);
> +	if (next < MAX_NUMNODES && update_prev)
> +		me->il_prev = next;
> +	return next;
> +}
> +
>  /* Retrieve NUMA policy */
>  static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>  			     unsigned long addr, unsigned long flags)
> @@ -916,7 +921,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>  			*policy = err;
>  		} else if (pol == current->mempolicy &&
>  				pol->mode == MPOL_INTERLEAVE) {
> -			*policy = current->il_next;
> +			*policy = interleave_nodes(current->mempolicy, false);

Here

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()
@ 2017-04-12 21:16           ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-04-12 21:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov

On Wed, 12 Apr 2017, Vlastimil Babka wrote:

> >> Well, interleave_nodes() will then potentially return a node outside of
> >> the allowed memory policy when its called for the first time after
> >> mpol_rebind_.. . But thenn it will find the next node within the
> >> nodemask and work correctly for the next invocations.
> >
> > Hmm, you're right. But that could be easily fixed if il_next became il_prev, so
> > we would return the result of next_node_in(il_prev) and also store it as the new
> > il_prev, right? I somehow assumed it already worked that way.

Yup that makes sense and I thought about that when I saw the problem too.

> @@ -863,6 +856,18 @@ static int lookup_node(unsigned long addr)
>  	return err;
>  }
>
> +/* Do dynamic interleaving for a process */
> +static unsigned interleave_nodes(struct mempolicy *policy, bool update_prev)

Why do you need an additional flag? Would it not be better to always
update and switch the update_prev=false case to simply use
next_node_in()?

> +{
> +	unsigned next;
> +	struct task_struct *me = current;
> +
> +	next = next_node_in(me->il_prev, policy->v.nodes);
> +	if (next < MAX_NUMNODES && update_prev)
> +		me->il_prev = next;
> +	return next;
> +}
> +
>  /* Retrieve NUMA policy */
>  static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>  			     unsigned long addr, unsigned long flags)
> @@ -916,7 +921,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>  			*policy = err;
>  		} else if (pol == current->mempolicy &&
>  				pol->mode == MPOL_INTERLEAVE) {
> -			*policy = current->il_next;
> +			*policy = interleave_nodes(current->mempolicy, false);

Here

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()
@ 2017-04-12 21:16           ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-04-12 21:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov

On Wed, 12 Apr 2017, Vlastimil Babka wrote:

> >> Well, interleave_nodes() will then potentially return a node outside of
> >> the allowed memory policy when its called for the first time after
> >> mpol_rebind_.. . But thenn it will find the next node within the
> >> nodemask and work correctly for the next invocations.
> >
> > Hmm, you're right. But that could be easily fixed if il_next became il_prev, so
> > we would return the result of next_node_in(il_prev) and also store it as the new
> > il_prev, right? I somehow assumed it already worked that way.

Yup that makes sense and I thought about that when I saw the problem too.

> @@ -863,6 +856,18 @@ static int lookup_node(unsigned long addr)
>  	return err;
>  }
>
> +/* Do dynamic interleaving for a process */
> +static unsigned interleave_nodes(struct mempolicy *policy, bool update_prev)

Why do you need an additional flag? Would it not be better to always
update and switch the update_prev=false case to simply use
next_node_in()?

> +{
> +	unsigned next;
> +	struct task_struct *me = current;
> +
> +	next = next_node_in(me->il_prev, policy->v.nodes);
> +	if (next < MAX_NUMNODES && update_prev)
> +		me->il_prev = next;
> +	return next;
> +}
> +
>  /* Retrieve NUMA policy */
>  static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>  			     unsigned long addr, unsigned long flags)
> @@ -916,7 +921,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>  			*policy = err;
>  		} else if (pol == current->mempolicy &&
>  				pol->mode == MPOL_INTERLEAVE) {
> -			*policy = current->il_next;
> +			*policy = interleave_nodes(current->mempolicy, false);

Here

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()
  2017-04-12 21:16           ` Christoph Lameter
@ 2017-04-12 21:18             ` Vlastimil Babka
  -1 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-12 21:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov

On 12.4.2017 23:16, Christoph Lameter wrote:
> On Wed, 12 Apr 2017, Vlastimil Babka wrote:
> 
>>>> Well, interleave_nodes() will then potentially return a node outside of
>>>> the allowed memory policy when its called for the first time after
>>>> mpol_rebind_.. . But thenn it will find the next node within the
>>>> nodemask and work correctly for the next invocations.
>>>
>>> Hmm, you're right. But that could be easily fixed if il_next became il_prev, so
>>> we would return the result of next_node_in(il_prev) and also store it as the new
>>> il_prev, right? I somehow assumed it already worked that way.
> 
> Yup that makes sense and I thought about that when I saw the problem too.
> 
>> @@ -863,6 +856,18 @@ static int lookup_node(unsigned long addr)
>>  	return err;
>>  }
>>
>> +/* Do dynamic interleaving for a process */
>> +static unsigned interleave_nodes(struct mempolicy *policy, bool update_prev)
> 
> Why do you need an additional flag? Would it not be better to always
> update and switch the update_prev=false case to simply use
> next_node_in()?

Looked to me as better wrapping, but probably overengineered, ok. Will change
for v2.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()
@ 2017-04-12 21:18             ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-12 21:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov

On 12.4.2017 23:16, Christoph Lameter wrote:
> On Wed, 12 Apr 2017, Vlastimil Babka wrote:
> 
>>>> Well, interleave_nodes() will then potentially return a node outside of
>>>> the allowed memory policy when its called for the first time after
>>>> mpol_rebind_.. . But thenn it will find the next node within the
>>>> nodemask and work correctly for the next invocations.
>>>
>>> Hmm, you're right. But that could be easily fixed if il_next became il_prev, so
>>> we would return the result of next_node_in(il_prev) and also store it as the new
>>> il_prev, right? I somehow assumed it already worked that way.
> 
> Yup that makes sense and I thought about that when I saw the problem too.
> 
>> @@ -863,6 +856,18 @@ static int lookup_node(unsigned long addr)
>>  	return err;
>>  }
>>
>> +/* Do dynamic interleaving for a process */
>> +static unsigned interleave_nodes(struct mempolicy *policy, bool update_prev)
> 
> Why do you need an additional flag? Would it not be better to always
> update and switch the update_prev=false case to simply use
> next_node_in()?

Looked to me as better wrapping, but probably overengineered, ok. Will change
for v2.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
  2017-04-11 19:00       ` Vlastimil Babka
@ 2017-04-12 21:25         ` Christoph Lameter
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-04-12 21:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Tue, 11 Apr 2017, Vlastimil Babka wrote:

> > The fallback was only intended for a cpuset on which boundaries are not enforced
> > in critical conditions (softwall). A hardwall cpuset (CS_MEM_HARDWALL)
> > should fail the allocation.
>
> Hmm just to clarify - I'm talking about ignoring the *mempolicy's* nodemask on
> the basis of cpuset having higher priority, while you seem to be talking about
> ignoring a (softwall) cpuset nodemask, right? man set_mempolicy says "... if
> required nodemask contains no nodes that are allowed by the process's current
> cpuset context, the memory  policy reverts to local allocation" which does come
> down to ignoring mempolicy's nodemask.

I am talking of allocating outside of the current allowed nodes
(determined by mempolicy -- MPOL_BIND is the only concern as far as I can
tell -- as well as the current cpuset). One can violate the cpuset if its not
a hardwall but  the MPOL_MBIND node restriction cannot be violated.

Those allocations are also not allowed if the allocation was for a user
space page even if this is a softwall cpuset.

> >> This patch fixes the issue by having __alloc_pages_slowpath() check for empty
> >> intersection of cpuset and ac->nodemask before OOM or allocation failure. If
> >> it's indeed empty, the nodemask is ignored and allocation retried, which mimics
> >> node_zonelist(). This works fine, because almost all callers of
> >
> > Well that would need to be subject to the hardwall flag. Allocation needs
> > to fail for a hardwall cpuset.
>
> They still do, if no hardwall cpuset node can satisfy the allocation with
> mempolicy ignored.

If the memory policy is MPOL_MBIND then allocations outside of the given
nodes should fail. They can violate the cpuset boundaries only if they are
kernel allocations and we are not in a hardwall cpuset.

That was at least my understand when working on this code years ago.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-04-12 21:25         ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-04-12 21:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Tue, 11 Apr 2017, Vlastimil Babka wrote:

> > The fallback was only intended for a cpuset on which boundaries are not enforced
> > in critical conditions (softwall). A hardwall cpuset (CS_MEM_HARDWALL)
> > should fail the allocation.
>
> Hmm just to clarify - I'm talking about ignoring the *mempolicy's* nodemask on
> the basis of cpuset having higher priority, while you seem to be talking about
> ignoring a (softwall) cpuset nodemask, right? man set_mempolicy says "... if
> required nodemask contains no nodes that are allowed by the process's current
> cpuset context, the memory  policy reverts to local allocation" which does come
> down to ignoring mempolicy's nodemask.

I am talking of allocating outside of the current allowed nodes
(determined by mempolicy -- MPOL_BIND is the only concern as far as I can
tell -- as well as the current cpuset). One can violate the cpuset if its not
a hardwall but  the MPOL_MBIND node restriction cannot be violated.

Those allocations are also not allowed if the allocation was for a user
space page even if this is a softwall cpuset.

> >> This patch fixes the issue by having __alloc_pages_slowpath() check for empty
> >> intersection of cpuset and ac->nodemask before OOM or allocation failure. If
> >> it's indeed empty, the nodemask is ignored and allocation retried, which mimics
> >> node_zonelist(). This works fine, because almost all callers of
> >
> > Well that would need to be subject to the hardwall flag. Allocation needs
> > to fail for a hardwall cpuset.
>
> They still do, if no hardwall cpuset node can satisfy the allocation with
> mempolicy ignored.

If the memory policy is MPOL_MBIND then allocations outside of the given
nodes should fail. They can violate the cpuset boundaries only if they are
kernel allocations and we are not in a hardwall cpuset.

That was at least my understand when working on this code years ago.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
  2017-04-11 14:06   ` Vlastimil Babka
@ 2017-04-13  5:42     ` Anshuman Khandual
  -1 siblings, 0 replies; 91+ messages in thread
From: Anshuman Khandual @ 2017-04-13  5:42 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov

On 04/11/2017 07:36 PM, Vlastimil Babka wrote:
> Commit e47483bca2cc ("mm, page_alloc: fix premature OOM when racing with cpuset
> mems update") has fixed known recent regressions found by LTP's cpuset01
> testcase. I have however found that by modifying the testcase to use per-vma
> mempolicies via bind(2) instead of per-task mempolicies via set_mempolicy(2),
> the premature OOM still happens and the issue is much older.

Meanwhile while we are discussing this RFC, will it be better to WARN
out these situations where we dont have node in the intersection,
hence no usable zone during allocation. That might actually give
a hint to the user before a premature OOM/allocation failure comes.

> 
> The root of the problem is that the cpuset's mems_allowed and mempolicy's
> nodemask can temporarily have no intersection, thus get_page_from_freelist()
> cannot find any usable zone. The current semantic for empty intersection is to
> ignore mempolicy's nodemask and honour cpuset restrictions. This is checked in
> node_zonelist(), but the racy update can happen after we already passed the
> check. Such races should be protected by the seqlock task->mems_allowed_seq,
> but it doesn't work here, because 1) mpol_rebind_mm() does not happen under
> seqlock for write, and doing so would lead to deadlock, as it takes mmap_sem
> for write, while the allocation can have mmap_sem for read when it's taking the
> seqlock for read. And 2) the seqlock cookie of callers of node_zonelist()
> (alloc_pages_vma() and alloc_pages_current()) is different than the one of
> __alloc_pages_slowpath(), so there's still a potential race window.
> 
> This patch fixes the issue by having __alloc_pages_slowpath() check for empty
> intersection of cpuset and ac->nodemask before OOM or allocation failure. If
> it's indeed empty, the nodemask is ignored and allocation retried, which mimics
> node_zonelist(). This works fine, because almost all callers of
> __alloc_pages_nodemask are obtaining the nodemask via node_zonelist(). The only
> exception is new_node_page() from hotplug, where the potential violation of
> nodemask isn't an issue, as there's already a fallback allocation attempt
> without any nodemask. If there's a future caller that needs to have its specific
> nodemask honoured over task's cpuset restrictions, we'll have to e.g. add a gfp
> flag for that.

Did you really mean node_zonelist() in both the instances above. Because
that function just picks up either FALLBACK_ZONELIST or NOFALLBACK_ZONELIST
depending upon the passed GFP flags in the allocation request and does not
deal with ignoring the passed nodemask.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-04-13  5:42     ` Anshuman Khandual
  0 siblings, 0 replies; 91+ messages in thread
From: Anshuman Khandual @ 2017-04-13  5:42 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov

On 04/11/2017 07:36 PM, Vlastimil Babka wrote:
> Commit e47483bca2cc ("mm, page_alloc: fix premature OOM when racing with cpuset
> mems update") has fixed known recent regressions found by LTP's cpuset01
> testcase. I have however found that by modifying the testcase to use per-vma
> mempolicies via bind(2) instead of per-task mempolicies via set_mempolicy(2),
> the premature OOM still happens and the issue is much older.

Meanwhile while we are discussing this RFC, will it be better to WARN
out these situations where we dont have node in the intersection,
hence no usable zone during allocation. That might actually give
a hint to the user before a premature OOM/allocation failure comes.

> 
> The root of the problem is that the cpuset's mems_allowed and mempolicy's
> nodemask can temporarily have no intersection, thus get_page_from_freelist()
> cannot find any usable zone. The current semantic for empty intersection is to
> ignore mempolicy's nodemask and honour cpuset restrictions. This is checked in
> node_zonelist(), but the racy update can happen after we already passed the
> check. Such races should be protected by the seqlock task->mems_allowed_seq,
> but it doesn't work here, because 1) mpol_rebind_mm() does not happen under
> seqlock for write, and doing so would lead to deadlock, as it takes mmap_sem
> for write, while the allocation can have mmap_sem for read when it's taking the
> seqlock for read. And 2) the seqlock cookie of callers of node_zonelist()
> (alloc_pages_vma() and alloc_pages_current()) is different than the one of
> __alloc_pages_slowpath(), so there's still a potential race window.
> 
> This patch fixes the issue by having __alloc_pages_slowpath() check for empty
> intersection of cpuset and ac->nodemask before OOM or allocation failure. If
> it's indeed empty, the nodemask is ignored and allocation retried, which mimics
> node_zonelist(). This works fine, because almost all callers of
> __alloc_pages_nodemask are obtaining the nodemask via node_zonelist(). The only
> exception is new_node_page() from hotplug, where the potential violation of
> nodemask isn't an issue, as there's already a fallback allocation attempt
> without any nodemask. If there's a future caller that needs to have its specific
> nodemask honoured over task's cpuset restrictions, we'll have to e.g. add a gfp
> flag for that.

Did you really mean node_zonelist() in both the instances above. Because
that function just picks up either FALLBACK_ZONELIST or NOFALLBACK_ZONELIST
depending upon the passed GFP flags in the allocation request and does not
deal with ignoring the passed nodemask.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
  2017-04-13  5:42     ` Anshuman Khandual
@ 2017-04-13  6:06       ` Vlastimil Babka
  -1 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-13  6:06 UTC (permalink / raw)
  To: Anshuman Khandual, linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov

On 04/13/2017 07:42 AM, Anshuman Khandual wrote:
> On 04/11/2017 07:36 PM, Vlastimil Babka wrote:
>> Commit e47483bca2cc ("mm, page_alloc: fix premature OOM when racing with cpuset
>> mems update") has fixed known recent regressions found by LTP's cpuset01
>> testcase. I have however found that by modifying the testcase to use per-vma
>> mempolicies via bind(2) instead of per-task mempolicies via set_mempolicy(2),
>> the premature OOM still happens and the issue is much older.
> 
> Meanwhile while we are discussing this RFC, will it be better to WARN
> out these situations where we dont have node in the intersection,
> hence no usable zone during allocation. That might actually give
> a hint to the user before a premature OOM/allocation failure comes.

Well, the bug is very old and nobody reported it so far, AFAIK. So it's
not that urgent.

>>
>> The root of the problem is that the cpuset's mems_allowed and mempolicy's
>> nodemask can temporarily have no intersection, thus get_page_from_freelist()
>> cannot find any usable zone. The current semantic for empty intersection is to
>> ignore mempolicy's nodemask and honour cpuset restrictions. This is checked in
>> node_zonelist(), but the racy update can happen after we already passed the
>> check. Such races should be protected by the seqlock task->mems_allowed_seq,
>> but it doesn't work here, because 1) mpol_rebind_mm() does not happen under
>> seqlock for write, and doing so would lead to deadlock, as it takes mmap_sem
>> for write, while the allocation can have mmap_sem for read when it's taking the
>> seqlock for read. And 2) the seqlock cookie of callers of node_zonelist()
>> (alloc_pages_vma() and alloc_pages_current()) is different than the one of
>> __alloc_pages_slowpath(), so there's still a potential race window.
>>
>> This patch fixes the issue by having __alloc_pages_slowpath() check for empty
>> intersection of cpuset and ac->nodemask before OOM or allocation failure. If
>> it's indeed empty, the nodemask is ignored and allocation retried, which mimics
>> node_zonelist(). This works fine, because almost all callers of
>> __alloc_pages_nodemask are obtaining the nodemask via node_zonelist(). The only
>> exception is new_node_page() from hotplug, where the potential violation of
>> nodemask isn't an issue, as there's already a fallback allocation attempt
>> without any nodemask. If there's a future caller that needs to have its specific
>> nodemask honoured over task's cpuset restrictions, we'll have to e.g. add a gfp
>> flag for that.
> 
> Did you really mean node_zonelist() in both the instances above. Because
> that function just picks up either FALLBACK_ZONELIST or NOFALLBACK_ZONELIST
> depending upon the passed GFP flags in the allocation request and does not
> deal with ignoring the passed nodemask.

Oops, I meant policy_zonelist(), thanks for noticing.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-04-13  6:06       ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-13  6:06 UTC (permalink / raw)
  To: Anshuman Khandual, linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov

On 04/13/2017 07:42 AM, Anshuman Khandual wrote:
> On 04/11/2017 07:36 PM, Vlastimil Babka wrote:
>> Commit e47483bca2cc ("mm, page_alloc: fix premature OOM when racing with cpuset
>> mems update") has fixed known recent regressions found by LTP's cpuset01
>> testcase. I have however found that by modifying the testcase to use per-vma
>> mempolicies via bind(2) instead of per-task mempolicies via set_mempolicy(2),
>> the premature OOM still happens and the issue is much older.
> 
> Meanwhile while we are discussing this RFC, will it be better to WARN
> out these situations where we dont have node in the intersection,
> hence no usable zone during allocation. That might actually give
> a hint to the user before a premature OOM/allocation failure comes.

Well, the bug is very old and nobody reported it so far, AFAIK. So it's
not that urgent.

>>
>> The root of the problem is that the cpuset's mems_allowed and mempolicy's
>> nodemask can temporarily have no intersection, thus get_page_from_freelist()
>> cannot find any usable zone. The current semantic for empty intersection is to
>> ignore mempolicy's nodemask and honour cpuset restrictions. This is checked in
>> node_zonelist(), but the racy update can happen after we already passed the
>> check. Such races should be protected by the seqlock task->mems_allowed_seq,
>> but it doesn't work here, because 1) mpol_rebind_mm() does not happen under
>> seqlock for write, and doing so would lead to deadlock, as it takes mmap_sem
>> for write, while the allocation can have mmap_sem for read when it's taking the
>> seqlock for read. And 2) the seqlock cookie of callers of node_zonelist()
>> (alloc_pages_vma() and alloc_pages_current()) is different than the one of
>> __alloc_pages_slowpath(), so there's still a potential race window.
>>
>> This patch fixes the issue by having __alloc_pages_slowpath() check for empty
>> intersection of cpuset and ac->nodemask before OOM or allocation failure. If
>> it's indeed empty, the nodemask is ignored and allocation retried, which mimics
>> node_zonelist(). This works fine, because almost all callers of
>> __alloc_pages_nodemask are obtaining the nodemask via node_zonelist(). The only
>> exception is new_node_page() from hotplug, where the potential violation of
>> nodemask isn't an issue, as there's already a fallback allocation attempt
>> without any nodemask. If there's a future caller that needs to have its specific
>> nodemask honoured over task's cpuset restrictions, we'll have to e.g. add a gfp
>> flag for that.
> 
> Did you really mean node_zonelist() in both the instances above. Because
> that function just picks up either FALLBACK_ZONELIST or NOFALLBACK_ZONELIST
> depending upon the passed GFP flags in the allocation request and does not
> deal with ignoring the passed nodemask.

Oops, I meant policy_zonelist(), thanks for noticing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
  2017-04-13  6:06       ` Vlastimil Babka
@ 2017-04-13  6:07         ` Vlastimil Babka
  -1 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-13  6:07 UTC (permalink / raw)
  To: Anshuman Khandual, linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov

On 04/13/2017 08:06 AM, Vlastimil Babka wrote:
>> Did you really mean node_zonelist() in both the instances above. Because
>> that function just picks up either FALLBACK_ZONELIST or NOFALLBACK_ZONELIST
>> depending upon the passed GFP flags in the allocation request and does not
>> deal with ignoring the passed nodemask.
> 
> Oops, I meant policy_zonelist(), thanks for noticing.

Nah, policy_nodemask()... I need coffee.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-04-13  6:07         ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-13  6:07 UTC (permalink / raw)
  To: Anshuman Khandual, linux-mm
  Cc: linux-kernel, cgroups, Li Zefan, Michal Hocko, Mel Gorman,
	David Rientjes, Christoph Lameter, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov

On 04/13/2017 08:06 AM, Vlastimil Babka wrote:
>> Did you really mean node_zonelist() in both the instances above. Because
>> that function just picks up either FALLBACK_ZONELIST or NOFALLBACK_ZONELIST
>> depending upon the passed GFP flags in the allocation request and does not
>> deal with ignoring the passed nodemask.
> 
> Oops, I meant policy_zonelist(), thanks for noticing.

Nah, policy_nodemask()... I need coffee.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-04-13  6:24           ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-13  6:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On 04/12/2017 11:25 PM, Christoph Lameter wrote:
> On Tue, 11 Apr 2017, Vlastimil Babka wrote:
> 
>>> The fallback was only intended for a cpuset on which boundaries are not enforced
>>> in critical conditions (softwall). A hardwall cpuset (CS_MEM_HARDWALL)
>>> should fail the allocation.
>>
>> Hmm just to clarify - I'm talking about ignoring the *mempolicy's* nodemask on
>> the basis of cpuset having higher priority, while you seem to be talking about
>> ignoring a (softwall) cpuset nodemask, right? man set_mempolicy says "... if
>> required nodemask contains no nodes that are allowed by the process's current
>> cpuset context, the memory  policy reverts to local allocation" which does come
>> down to ignoring mempolicy's nodemask.
> 
> I am talking of allocating outside of the current allowed nodes
> (determined by mempolicy -- MPOL_BIND is the only concern as far as I can
> tell -- as well as the current cpuset). One can violate the cpuset if its not
> a hardwall but  the MPOL_MBIND node restriction cannot be violated.
> 
> Those allocations are also not allowed if the allocation was for a user
> space page even if this is a softwall cpuset.
> 
>>>> This patch fixes the issue by having __alloc_pages_slowpath() check for empty
>>>> intersection of cpuset and ac->nodemask before OOM or allocation failure. If
>>>> it's indeed empty, the nodemask is ignored and allocation retried, which mimics
>>>> node_zonelist(). This works fine, because almost all callers of
>>>
>>> Well that would need to be subject to the hardwall flag. Allocation needs
>>> to fail for a hardwall cpuset.
>>
>> They still do, if no hardwall cpuset node can satisfy the allocation with
>> mempolicy ignored.
> 
> If the memory policy is MPOL_MBIND then allocations outside of the given
> nodes should fail. They can violate the cpuset boundaries only if they are
> kernel allocations and we are not in a hardwall cpuset.
> 
> That was at least my understand when working on this code years ago.

Hmm, I see policy_nodemask() (I wrongly mentioned node_zonelist()
before) ignores BIND mempolicy nodemask when it doesn't overlap with
cpuset allowed nodes since initial git commit 1da177e4c3f4 (back then it
was zonelist_policy()). But AFAIU this couldn't actually happen (outside
of races), because 1) one is not allowed to create such effectively
empty BIND mempolicy in the first place and 2) an existing mempolicy is
rebound on cpuset changes to maintain the overlap.

The point 2) does not apply to MPOL_F_STATIC_NODES mempolicies
introduced in 2008 by DavidR, but it's documented in
Documentation/vm/numa_memory_policy.txt and manpages that when they
don't overlap with cpuset allowed nodes, the default mempolicy is used
instead.

I doubt we can change that now, because that can break existing
programs. It also makes some sense at least to me, because a task can
control its own mempolicy (for performance reasons), but cpuset changes
are admin decisions that the task cannot even anticipate. I think it's
better to continue working with suboptimal performance than start
failing allocations?

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-04-13  6:24           ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-13  6:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On 04/12/2017 11:25 PM, Christoph Lameter wrote:
> On Tue, 11 Apr 2017, Vlastimil Babka wrote:
> 
>>> The fallback was only intended for a cpuset on which boundaries are not enforced
>>> in critical conditions (softwall). A hardwall cpuset (CS_MEM_HARDWALL)
>>> should fail the allocation.
>>
>> Hmm just to clarify - I'm talking about ignoring the *mempolicy's* nodemask on
>> the basis of cpuset having higher priority, while you seem to be talking about
>> ignoring a (softwall) cpuset nodemask, right? man set_mempolicy says "... if
>> required nodemask contains no nodes that are allowed by the process's current
>> cpuset context, the memory  policy reverts to local allocation" which does come
>> down to ignoring mempolicy's nodemask.
> 
> I am talking of allocating outside of the current allowed nodes
> (determined by mempolicy -- MPOL_BIND is the only concern as far as I can
> tell -- as well as the current cpuset). One can violate the cpuset if its not
> a hardwall but  the MPOL_MBIND node restriction cannot be violated.
> 
> Those allocations are also not allowed if the allocation was for a user
> space page even if this is a softwall cpuset.
> 
>>>> This patch fixes the issue by having __alloc_pages_slowpath() check for empty
>>>> intersection of cpuset and ac->nodemask before OOM or allocation failure. If
>>>> it's indeed empty, the nodemask is ignored and allocation retried, which mimics
>>>> node_zonelist(). This works fine, because almost all callers of
>>>
>>> Well that would need to be subject to the hardwall flag. Allocation needs
>>> to fail for a hardwall cpuset.
>>
>> They still do, if no hardwall cpuset node can satisfy the allocation with
>> mempolicy ignored.
> 
> If the memory policy is MPOL_MBIND then allocations outside of the given
> nodes should fail. They can violate the cpuset boundaries only if they are
> kernel allocations and we are not in a hardwall cpuset.
> 
> That was at least my understand when working on this code years ago.

Hmm, I see policy_nodemask() (I wrongly mentioned node_zonelist()
before) ignores BIND mempolicy nodemask when it doesn't overlap with
cpuset allowed nodes since initial git commit 1da177e4c3f4 (back then it
was zonelist_policy()). But AFAIU this couldn't actually happen (outside
of races), because 1) one is not allowed to create such effectively
empty BIND mempolicy in the first place and 2) an existing mempolicy is
rebound on cpuset changes to maintain the overlap.

The point 2) does not apply to MPOL_F_STATIC_NODES mempolicies
introduced in 2008 by DavidR, but it's documented in
Documentation/vm/numa_memory_policy.txt and manpages that when they
don't overlap with cpuset allowed nodes, the default mempolicy is used
instead.

I doubt we can change that now, because that can break existing
programs. It also makes some sense at least to me, because a task can
control its own mempolicy (for performance reasons), but cpuset changes
are admin decisions that the task cannot even anticipate. I think it's
better to continue working with suboptimal performance than start
failing allocations?

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo-Bw31MaZKKs0EbZ0PF+XxCw@public.gmane.org  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org"> email-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org </a>
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-04-13  6:24           ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-13  6:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On 04/12/2017 11:25 PM, Christoph Lameter wrote:
> On Tue, 11 Apr 2017, Vlastimil Babka wrote:
> 
>>> The fallback was only intended for a cpuset on which boundaries are not enforced
>>> in critical conditions (softwall). A hardwall cpuset (CS_MEM_HARDWALL)
>>> should fail the allocation.
>>
>> Hmm just to clarify - I'm talking about ignoring the *mempolicy's* nodemask on
>> the basis of cpuset having higher priority, while you seem to be talking about
>> ignoring a (softwall) cpuset nodemask, right? man set_mempolicy says "... if
>> required nodemask contains no nodes that are allowed by the process's current
>> cpuset context, the memory  policy reverts to local allocation" which does come
>> down to ignoring mempolicy's nodemask.
> 
> I am talking of allocating outside of the current allowed nodes
> (determined by mempolicy -- MPOL_BIND is the only concern as far as I can
> tell -- as well as the current cpuset). One can violate the cpuset if its not
> a hardwall but  the MPOL_MBIND node restriction cannot be violated.
> 
> Those allocations are also not allowed if the allocation was for a user
> space page even if this is a softwall cpuset.
> 
>>>> This patch fixes the issue by having __alloc_pages_slowpath() check for empty
>>>> intersection of cpuset and ac->nodemask before OOM or allocation failure. If
>>>> it's indeed empty, the nodemask is ignored and allocation retried, which mimics
>>>> node_zonelist(). This works fine, because almost all callers of
>>>
>>> Well that would need to be subject to the hardwall flag. Allocation needs
>>> to fail for a hardwall cpuset.
>>
>> They still do, if no hardwall cpuset node can satisfy the allocation with
>> mempolicy ignored.
> 
> If the memory policy is MPOL_MBIND then allocations outside of the given
> nodes should fail. They can violate the cpuset boundaries only if they are
> kernel allocations and we are not in a hardwall cpuset.
> 
> That was at least my understand when working on this code years ago.

Hmm, I see policy_nodemask() (I wrongly mentioned node_zonelist()
before) ignores BIND mempolicy nodemask when it doesn't overlap with
cpuset allowed nodes since initial git commit 1da177e4c3f4 (back then it
was zonelist_policy()). But AFAIU this couldn't actually happen (outside
of races), because 1) one is not allowed to create such effectively
empty BIND mempolicy in the first place and 2) an existing mempolicy is
rebound on cpuset changes to maintain the overlap.

The point 2) does not apply to MPOL_F_STATIC_NODES mempolicies
introduced in 2008 by DavidR, but it's documented in
Documentation/vm/numa_memory_policy.txt and manpages that when they
don't overlap with cpuset allowed nodes, the default mempolicy is used
instead.

I doubt we can change that now, because that can break existing
programs. It also makes some sense at least to me, because a task can
control its own mempolicy (for performance reasons), but cpuset changes
are admin decisions that the task cannot even anticipate. I think it's
better to continue working with suboptimal performance than start
failing allocations?

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
  2017-04-13  6:24           ` Vlastimil Babka
@ 2017-04-14 20:37             ` Christoph Lameter
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-04-14 20:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Thu, 13 Apr 2017, Vlastimil Babka wrote:

>
> I doubt we can change that now, because that can break existing
> programs. It also makes some sense at least to me, because a task can
> control its own mempolicy (for performance reasons), but cpuset changes
> are admin decisions that the task cannot even anticipate. I think it's
> better to continue working with suboptimal performance than start
> failing allocations?

If the expected semantics (hardwall) are that allocations should fail then
lets be consistent and do so.

Adding more and more exceptions gets this convoluted mess into an even
worse shape. Adding the static binding of nodes was already a screwball
if used within a cpuset because now one has to anticipate how a user would
move the nodes of a cpuset and how the static bindings would work in such
a context.

The admin basically needs to know how the application has used memory
policies if one still wants to move the applications within a cpuset with
the fixed bindings.

Maybe the best way to handle this is to give up on cpuset migration of
live applications? After all this can be done with a script in the same
way as the kernel is doing:

1. Extend the cpuset to include the new nodes.

2. Loop over the processes and use the migrate_pages() to move the apps
one by one.

3. Remove the nodes no longer to be used.

Then forget about translating memory policies. If an application that is
supposed to run in a cpuset and supposed to be moveable has fixed bindings
then the application should be aware of that and be equipped with
some logic to rebind its memory on its own.

Such an application typically already has such logic and executes a
binding after discovering its numa node configuration on startup. It would
have to be modified to redo that action when it gets some sort of a signal
from the script telling it that the node config would be changed.

Having this logic in the application instead of the kernel avoids all the
kernel messes that we keep on trying to deal with and IMHO is much
cleaner.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-04-14 20:37             ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-04-14 20:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Thu, 13 Apr 2017, Vlastimil Babka wrote:

>
> I doubt we can change that now, because that can break existing
> programs. It also makes some sense at least to me, because a task can
> control its own mempolicy (for performance reasons), but cpuset changes
> are admin decisions that the task cannot even anticipate. I think it's
> better to continue working with suboptimal performance than start
> failing allocations?

If the expected semantics (hardwall) are that allocations should fail then
lets be consistent and do so.

Adding more and more exceptions gets this convoluted mess into an even
worse shape. Adding the static binding of nodes was already a screwball
if used within a cpuset because now one has to anticipate how a user would
move the nodes of a cpuset and how the static bindings would work in such
a context.

The admin basically needs to know how the application has used memory
policies if one still wants to move the applications within a cpuset with
the fixed bindings.

Maybe the best way to handle this is to give up on cpuset migration of
live applications? After all this can be done with a script in the same
way as the kernel is doing:

1. Extend the cpuset to include the new nodes.

2. Loop over the processes and use the migrate_pages() to move the apps
one by one.

3. Remove the nodes no longer to be used.

Then forget about translating memory policies. If an application that is
supposed to run in a cpuset and supposed to be moveable has fixed bindings
then the application should be aware of that and be equipped with
some logic to rebind its memory on its own.

Such an application typically already has such logic and executes a
binding after discovering its numa node configuration on startup. It would
have to be modified to redo that action when it gets some sort of a signal
from the script telling it that the node config would be changed.

Having this logic in the application instead of the kernel avoids all the
kernel messes that we keep on trying to deal with and IMHO is much
cleaner.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
  2017-04-14 20:37             ` Christoph Lameter
@ 2017-04-26  8:07               ` Vlastimil Babka
  -1 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-26  8:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On 04/14/2017 10:37 PM, Christoph Lameter wrote:
> On Thu, 13 Apr 2017, Vlastimil Babka wrote:
> 
>>
>> I doubt we can change that now, because that can break existing
>> programs. It also makes some sense at least to me, because a task can
>> control its own mempolicy (for performance reasons), but cpuset changes
>> are admin decisions that the task cannot even anticipate. I think it's
>> better to continue working with suboptimal performance than start
>> failing allocations?
> 
> If the expected semantics (hardwall) are that allocations should fail then
> lets be consistent and do so.

It's not "expected" right now. The documented semantics is that (static,
as the others are rebound) mempolicy is ignored when it's not compatible
with cpuset. I'm just reusing the same existing semantic for race
situations. We can discuss whether we can change the semantics now, but
I don't think it should block this fix.

> Adding more and more exceptions gets this convoluted mess into an even
> worse shape.

Again, it's not a new exception semantics-wise, but I agree that the
code of __alloc_pages_slowpath() is even more subtle. But I don't see
any other easy fix.

> Adding the static binding of nodes was already a screwball
> if used within a cpuset because now one has to anticipate how a user would
> move the nodes of a cpuset and how the static bindings would work in such
> a context.

On the other hand, static mempolicy is the only one that does not need
rebinding, and removing the other modes would allow much simpler
implementation. I thought the outcome of LSF/MM session was that we
should try to go that way.

> The admin basically needs to know how the application has used memory
> policies if one still wants to move the applications within a cpuset with
> the fixed bindings.
> 
> Maybe the best way to handle this is to give up on cpuset migration of
> live applications? After all this can be done with a script in the same
> way as the kernel is doing:
> 
> 1. Extend the cpuset to include the new nodes.
> 
> 2. Loop over the processes and use the migrate_pages() to move the apps
> one by one.
> 
> 3. Remove the nodes no longer to be used.
> 
> Then forget about translating memory policies. If an application that is
> supposed to run in a cpuset and supposed to be moveable has fixed bindings
> then the application should be aware of that and be equipped with
> some logic to rebind its memory on its own.
> 
> Such an application typically already has such logic and executes a
> binding after discovering its numa node configuration on startup. It would
> have to be modified to redo that action when it gets some sort of a signal
> from the script telling it that the node config would be changed.
> 
> Having this logic in the application instead of the kernel avoids all the
> kernel messes that we keep on trying to deal with and IMHO is much
> cleaner.

That would be much simpler for us indeed. But we still IMHO can't
abruptly start denying page fault allocations for existing applications
that don't have the necessary awareness.


> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-04-26  8:07               ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-04-26  8:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On 04/14/2017 10:37 PM, Christoph Lameter wrote:
> On Thu, 13 Apr 2017, Vlastimil Babka wrote:
> 
>>
>> I doubt we can change that now, because that can break existing
>> programs. It also makes some sense at least to me, because a task can
>> control its own mempolicy (for performance reasons), but cpuset changes
>> are admin decisions that the task cannot even anticipate. I think it's
>> better to continue working with suboptimal performance than start
>> failing allocations?
> 
> If the expected semantics (hardwall) are that allocations should fail then
> lets be consistent and do so.

It's not "expected" right now. The documented semantics is that (static,
as the others are rebound) mempolicy is ignored when it's not compatible
with cpuset. I'm just reusing the same existing semantic for race
situations. We can discuss whether we can change the semantics now, but
I don't think it should block this fix.

> Adding more and more exceptions gets this convoluted mess into an even
> worse shape.

Again, it's not a new exception semantics-wise, but I agree that the
code of __alloc_pages_slowpath() is even more subtle. But I don't see
any other easy fix.

> Adding the static binding of nodes was already a screwball
> if used within a cpuset because now one has to anticipate how a user would
> move the nodes of a cpuset and how the static bindings would work in such
> a context.

On the other hand, static mempolicy is the only one that does not need
rebinding, and removing the other modes would allow much simpler
implementation. I thought the outcome of LSF/MM session was that we
should try to go that way.

> The admin basically needs to know how the application has used memory
> policies if one still wants to move the applications within a cpuset with
> the fixed bindings.
> 
> Maybe the best way to handle this is to give up on cpuset migration of
> live applications? After all this can be done with a script in the same
> way as the kernel is doing:
> 
> 1. Extend the cpuset to include the new nodes.
> 
> 2. Loop over the processes and use the migrate_pages() to move the apps
> one by one.
> 
> 3. Remove the nodes no longer to be used.
> 
> Then forget about translating memory policies. If an application that is
> supposed to run in a cpuset and supposed to be moveable has fixed bindings
> then the application should be aware of that and be equipped with
> some logic to rebind its memory on its own.
> 
> Such an application typically already has such logic and executes a
> binding after discovering its numa node configuration on startup. It would
> have to be modified to redo that action when it gets some sort of a signal
> from the script telling it that the node config would be changed.
> 
> Having this logic in the application instead of the kernel avoids all the
> kernel messes that we keep on trying to deal with and IMHO is much
> cleaner.

That would be much simpler for us indeed. But we still IMHO can't
abruptly start denying page fault allocations for existing applications
that don't have the necessary awareness.


> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
  2017-04-26  8:07               ` Vlastimil Babka
@ 2017-04-30 21:33                 ` Christoph Lameter
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-04-30 21:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Wed, 26 Apr 2017, Vlastimil Babka wrote:

> > Such an application typically already has such logic and executes a
> > binding after discovering its numa node configuration on startup. It would
> > have to be modified to redo that action when it gets some sort of a signal
> > from the script telling it that the node config would be changed.
> >
> > Having this logic in the application instead of the kernel avoids all the
> > kernel messes that we keep on trying to deal with and IMHO is much
> > cleaner.
>
> That would be much simpler for us indeed. But we still IMHO can't
> abruptly start denying page fault allocations for existing applications
> that don't have the necessary awareness.

We certainly can do that. The failure of the page faults are due to the
admin trying to move an application that is not aware of this and is using
mempols. That could be an error. Trying to move an application that
contains both absolute and relative node numbers is definitely something
that is potentiall so screwed up that the kernel should not muck around
with such an app.

Also user space can determine if the application is using memory policies
and can then take appropriate measures (message to the sysadmin to eval
tge situation f.e.) or mess aroud with the processes memory policies on
its own.

So this is certainly a way out of this mess.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-04-30 21:33                 ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-04-30 21:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Michal Hocko,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Wed, 26 Apr 2017, Vlastimil Babka wrote:

> > Such an application typically already has such logic and executes a
> > binding after discovering its numa node configuration on startup. It would
> > have to be modified to redo that action when it gets some sort of a signal
> > from the script telling it that the node config would be changed.
> >
> > Having this logic in the application instead of the kernel avoids all the
> > kernel messes that we keep on trying to deal with and IMHO is much
> > cleaner.
>
> That would be much simpler for us indeed. But we still IMHO can't
> abruptly start denying page fault allocations for existing applications
> that don't have the necessary awareness.

We certainly can do that. The failure of the page faults are due to the
admin trying to move an application that is not aware of this and is using
mempols. That could be an error. Trying to move an application that
contains both absolute and relative node numbers is definitely something
that is potentiall so screwed up that the kernel should not muck around
with such an app.

Also user space can determine if the application is using memory policies
and can then take appropriate measures (message to the sysadmin to eval
tge situation f.e.) or mess aroud with the processes memory policies on
its own.

So this is certainly a way out of this mess.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-17  9:20                   ` Michal Hocko
  0 siblings, 0 replies; 91+ messages in thread
From: Michal Hocko @ 2017-05-17  9:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Sun 30-04-17 16:33:10, Cristopher Lameter wrote:
> On Wed, 26 Apr 2017, Vlastimil Babka wrote:
> 
> > > Such an application typically already has such logic and executes a
> > > binding after discovering its numa node configuration on startup. It would
> > > have to be modified to redo that action when it gets some sort of a signal
> > > from the script telling it that the node config would be changed.
> > >
> > > Having this logic in the application instead of the kernel avoids all the
> > > kernel messes that we keep on trying to deal with and IMHO is much
> > > cleaner.
> >
> > That would be much simpler for us indeed. But we still IMHO can't
> > abruptly start denying page fault allocations for existing applications
> > that don't have the necessary awareness.
> 
> We certainly can do that. The failure of the page faults are due to the
> admin trying to move an application that is not aware of this and is using
> mempols. That could be an error. Trying to move an application that
> contains both absolute and relative node numbers is definitely something
> that is potentiall so screwed up that the kernel should not muck around
> with such an app.
> 
> Also user space can determine if the application is using memory policies
> and can then take appropriate measures (message to the sysadmin to eval
> tge situation f.e.) or mess aroud with the processes memory policies on
> its own.
> 
> So this is certainly a way out of this mess.

So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
case in a raceless way?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-17  9:20                   ` Michal Hocko
  0 siblings, 0 replies; 91+ messages in thread
From: Michal Hocko @ 2017-05-17  9:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Mel Gorman,
	David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Sun 30-04-17 16:33:10, Cristopher Lameter wrote:
> On Wed, 26 Apr 2017, Vlastimil Babka wrote:
> 
> > > Such an application typically already has such logic and executes a
> > > binding after discovering its numa node configuration on startup. It would
> > > have to be modified to redo that action when it gets some sort of a signal
> > > from the script telling it that the node config would be changed.
> > >
> > > Having this logic in the application instead of the kernel avoids all the
> > > kernel messes that we keep on trying to deal with and IMHO is much
> > > cleaner.
> >
> > That would be much simpler for us indeed. But we still IMHO can't
> > abruptly start denying page fault allocations for existing applications
> > that don't have the necessary awareness.
> 
> We certainly can do that. The failure of the page faults are due to the
> admin trying to move an application that is not aware of this and is using
> mempols. That could be an error. Trying to move an application that
> contains both absolute and relative node numbers is definitely something
> that is potentiall so screwed up that the kernel should not muck around
> with such an app.
> 
> Also user space can determine if the application is using memory policies
> and can then take appropriate measures (message to the sysadmin to eval
> tge situation f.e.) or mess aroud with the processes memory policies on
> its own.
> 
> So this is certainly a way out of this mess.

So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
case in a raceless way?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-17  9:20                   ` Michal Hocko
  0 siblings, 0 replies; 91+ messages in thread
From: Michal Hocko @ 2017-05-17  9:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Sun 30-04-17 16:33:10, Cristopher Lameter wrote:
> On Wed, 26 Apr 2017, Vlastimil Babka wrote:
> 
> > > Such an application typically already has such logic and executes a
> > > binding after discovering its numa node configuration on startup. It would
> > > have to be modified to redo that action when it gets some sort of a signal
> > > from the script telling it that the node config would be changed.
> > >
> > > Having this logic in the application instead of the kernel avoids all the
> > > kernel messes that we keep on trying to deal with and IMHO is much
> > > cleaner.
> >
> > That would be much simpler for us indeed. But we still IMHO can't
> > abruptly start denying page fault allocations for existing applications
> > that don't have the necessary awareness.
> 
> We certainly can do that. The failure of the page faults are due to the
> admin trying to move an application that is not aware of this and is using
> mempols. That could be an error. Trying to move an application that
> contains both absolute and relative node numbers is definitely something
> that is potentiall so screwed up that the kernel should not muck around
> with such an app.
> 
> Also user space can determine if the application is using memory policies
> and can then take appropriate measures (message to the sysadmin to eval
> tge situation f.e.) or mess aroud with the processes memory policies on
> its own.
> 
> So this is certainly a way out of this mess.

So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
case in a raceless way?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-17 13:56                     ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-05-17 13:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Wed, 17 May 2017, Michal Hocko wrote:

> > We certainly can do that. The failure of the page faults are due to the
> > admin trying to move an application that is not aware of this and is using
> > mempols. That could be an error. Trying to move an application that
> > contains both absolute and relative node numbers is definitely something
> > that is potentiall so screwed up that the kernel should not muck around
> > with such an app.
> >
> > Also user space can determine if the application is using memory policies
> > and can then take appropriate measures (message to the sysadmin to eval
> > tge situation f.e.) or mess aroud with the processes memory policies on
> > its own.
> >
> > So this is certainly a way out of this mess.
>
> So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
> case in a raceless way?

You dont have to do that if you do not create an empty mempolicy in the
first place. The current kernel code avoids that by first allowing access
to the new set of nodes and removing the old ones from the set when done.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-17 13:56                     ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-05-17 13:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Mel Gorman,
	David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Wed, 17 May 2017, Michal Hocko wrote:

> > We certainly can do that. The failure of the page faults are due to the
> > admin trying to move an application that is not aware of this and is using
> > mempols. That could be an error. Trying to move an application that
> > contains both absolute and relative node numbers is definitely something
> > that is potentiall so screwed up that the kernel should not muck around
> > with such an app.
> >
> > Also user space can determine if the application is using memory policies
> > and can then take appropriate measures (message to the sysadmin to eval
> > tge situation f.e.) or mess aroud with the processes memory policies on
> > its own.
> >
> > So this is certainly a way out of this mess.
>
> So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
> case in a raceless way?

You dont have to do that if you do not create an empty mempolicy in the
first place. The current kernel code avoids that by first allowing access
to the new set of nodes and removing the old ones from the set when done.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-17 13:56                     ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-05-17 13:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Wed, 17 May 2017, Michal Hocko wrote:

> > We certainly can do that. The failure of the page faults are due to the
> > admin trying to move an application that is not aware of this and is using
> > mempols. That could be an error. Trying to move an application that
> > contains both absolute and relative node numbers is definitely something
> > that is potentiall so screwed up that the kernel should not muck around
> > with such an app.
> >
> > Also user space can determine if the application is using memory policies
> > and can then take appropriate measures (message to the sysadmin to eval
> > tge situation f.e.) or mess aroud with the processes memory policies on
> > its own.
> >
> > So this is certainly a way out of this mess.
>
> So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
> case in a raceless way?

You dont have to do that if you do not create an empty mempolicy in the
first place. The current kernel code avoids that by first allowing access
to the new set of nodes and removing the old ones from the set when done.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-17 14:05                       ` Michal Hocko
  0 siblings, 0 replies; 91+ messages in thread
From: Michal Hocko @ 2017-05-17 14:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Wed 17-05-17 08:56:34, Cristopher Lameter wrote:
> On Wed, 17 May 2017, Michal Hocko wrote:
> 
> > > We certainly can do that. The failure of the page faults are due to the
> > > admin trying to move an application that is not aware of this and is using
> > > mempols. That could be an error. Trying to move an application that
> > > contains both absolute and relative node numbers is definitely something
> > > that is potentiall so screwed up that the kernel should not muck around
> > > with such an app.
> > >
> > > Also user space can determine if the application is using memory policies
> > > and can then take appropriate measures (message to the sysadmin to eval
> > > tge situation f.e.) or mess aroud with the processes memory policies on
> > > its own.
> > >
> > > So this is certainly a way out of this mess.
> >
> > So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
> > case in a raceless way?
> 
> You dont have to do that if you do not create an empty mempolicy in the
> first place. The current kernel code avoids that by first allowing access
> to the new set of nodes and removing the old ones from the set when done.

which is racy and as Vlastimil pointed out. If we simply fail such an
allocation the failure will go up the call chain until we hit the OOM
killer due to VM_FAULT_OOM. How would you want to handle that?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-17 14:05                       ` Michal Hocko
  0 siblings, 0 replies; 91+ messages in thread
From: Michal Hocko @ 2017-05-17 14:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Mel Gorman,
	David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Wed 17-05-17 08:56:34, Cristopher Lameter wrote:
> On Wed, 17 May 2017, Michal Hocko wrote:
> 
> > > We certainly can do that. The failure of the page faults are due to the
> > > admin trying to move an application that is not aware of this and is using
> > > mempols. That could be an error. Trying to move an application that
> > > contains both absolute and relative node numbers is definitely something
> > > that is potentiall so screwed up that the kernel should not muck around
> > > with such an app.
> > >
> > > Also user space can determine if the application is using memory policies
> > > and can then take appropriate measures (message to the sysadmin to eval
> > > tge situation f.e.) or mess aroud with the processes memory policies on
> > > its own.
> > >
> > > So this is certainly a way out of this mess.
> >
> > So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
> > case in a raceless way?
> 
> You dont have to do that if you do not create an empty mempolicy in the
> first place. The current kernel code avoids that by first allowing access
> to the new set of nodes and removing the old ones from the set when done.

which is racy and as Vlastimil pointed out. If we simply fail such an
allocation the failure will go up the call chain until we hit the OOM
killer due to VM_FAULT_OOM. How would you want to handle that?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-17 14:05                       ` Michal Hocko
  0 siblings, 0 replies; 91+ messages in thread
From: Michal Hocko @ 2017-05-17 14:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Wed 17-05-17 08:56:34, Cristopher Lameter wrote:
> On Wed, 17 May 2017, Michal Hocko wrote:
> 
> > > We certainly can do that. The failure of the page faults are due to the
> > > admin trying to move an application that is not aware of this and is using
> > > mempols. That could be an error. Trying to move an application that
> > > contains both absolute and relative node numbers is definitely something
> > > that is potentiall so screwed up that the kernel should not muck around
> > > with such an app.
> > >
> > > Also user space can determine if the application is using memory policies
> > > and can then take appropriate measures (message to the sysadmin to eval
> > > tge situation f.e.) or mess aroud with the processes memory policies on
> > > its own.
> > >
> > > So this is certainly a way out of this mess.
> >
> > So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
> > case in a raceless way?
> 
> You dont have to do that if you do not create an empty mempolicy in the
> first place. The current kernel code avoids that by first allowing access
> to the new set of nodes and removing the old ones from the set when done.

which is racy and as Vlastimil pointed out. If we simply fail such an
allocation the failure will go up the call chain until we hit the OOM
killer due to VM_FAULT_OOM. How would you want to handle that?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
  2017-05-17 14:05                       ` Michal Hocko
@ 2017-05-17 14:48                         ` Christoph Lameter
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-05-17 14:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Wed, 17 May 2017, Michal Hocko wrote:

> > > So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
> > > case in a raceless way?
> >
> > You dont have to do that if you do not create an empty mempolicy in the
> > first place. The current kernel code avoids that by first allowing access
> > to the new set of nodes and removing the old ones from the set when done.
>
> which is racy and as Vlastimil pointed out. If we simply fail such an
> allocation the failure will go up the call chain until we hit the OOM
> killer due to VM_FAULT_OOM. How would you want to handle that?

The race is where? If you expand the node set during the move of the
application then you are safe in terms of the legacy apps that did not
include static bindings.

If you have screwy things like static mbinds in there then you are
hopelessly lost anyways. You may have moved the process to another set
of nodes but the static bindings may refer to a node no longer
available. Thus the OOM is legitimate.

At least a user space app could inspect
the situation and come up with custom ways of dealing with the mess.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-17 14:48                         ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-05-17 14:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Wed, 17 May 2017, Michal Hocko wrote:

> > > So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
> > > case in a raceless way?
> >
> > You dont have to do that if you do not create an empty mempolicy in the
> > first place. The current kernel code avoids that by first allowing access
> > to the new set of nodes and removing the old ones from the set when done.
>
> which is racy and as Vlastimil pointed out. If we simply fail such an
> allocation the failure will go up the call chain until we hit the OOM
> killer due to VM_FAULT_OOM. How would you want to handle that?

The race is where? If you expand the node set during the move of the
application then you are safe in terms of the legacy apps that did not
include static bindings.

If you have screwy things like static mbinds in there then you are
hopelessly lost anyways. You may have moved the process to another set
of nodes but the static bindings may refer to a node no longer
available. Thus the OOM is legitimate.

At least a user space app could inspect
the situation and come up with custom ways of dealing with the mess.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-17 14:56                           ` Michal Hocko
  0 siblings, 0 replies; 91+ messages in thread
From: Michal Hocko @ 2017-05-17 14:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Wed 17-05-17 09:48:25, Cristopher Lameter wrote:
> On Wed, 17 May 2017, Michal Hocko wrote:
> 
> > > > So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
> > > > case in a raceless way?
> > >
> > > You dont have to do that if you do not create an empty mempolicy in the
> > > first place. The current kernel code avoids that by first allowing access
> > > to the new set of nodes and removing the old ones from the set when done.
> >
> > which is racy and as Vlastimil pointed out. If we simply fail such an
> > allocation the failure will go up the call chain until we hit the OOM
> > killer due to VM_FAULT_OOM. How would you want to handle that?
> 
> The race is where? If you expand the node set during the move of the
> application then you are safe in terms of the legacy apps that did not
> include static bindings.

I am pretty sure it is describe in those changelogs and I won't repeat
it here.

> If you have screwy things like static mbinds in there then you are
> hopelessly lost anyways. You may have moved the process to another set
> of nodes but the static bindings may refer to a node no longer
> available. Thus the OOM is legitimate.

The point is that you do _not_ want such a process to trigger the OOM
because it can cause other processes being killed.

> At least a user space app could inspect
> the situation and come up with custom ways of dealing with the mess.

I do not really see how would this help to prevent a malicious user from
playing tricks.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-17 14:56                           ` Michal Hocko
  0 siblings, 0 replies; 91+ messages in thread
From: Michal Hocko @ 2017-05-17 14:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Mel Gorman,
	David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Wed 17-05-17 09:48:25, Cristopher Lameter wrote:
> On Wed, 17 May 2017, Michal Hocko wrote:
> 
> > > > So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
> > > > case in a raceless way?
> > >
> > > You dont have to do that if you do not create an empty mempolicy in the
> > > first place. The current kernel code avoids that by first allowing access
> > > to the new set of nodes and removing the old ones from the set when done.
> >
> > which is racy and as Vlastimil pointed out. If we simply fail such an
> > allocation the failure will go up the call chain until we hit the OOM
> > killer due to VM_FAULT_OOM. How would you want to handle that?
> 
> The race is where? If you expand the node set during the move of the
> application then you are safe in terms of the legacy apps that did not
> include static bindings.

I am pretty sure it is describe in those changelogs and I won't repeat
it here.

> If you have screwy things like static mbinds in there then you are
> hopelessly lost anyways. You may have moved the process to another set
> of nodes but the static bindings may refer to a node no longer
> available. Thus the OOM is legitimate.

The point is that you do _not_ want such a process to trigger the OOM
because it can cause other processes being killed.

> At least a user space app could inspect
> the situation and come up with custom ways of dealing with the mess.

I do not really see how would this help to prevent a malicious user from
playing tricks.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-17 14:56                           ` Michal Hocko
  0 siblings, 0 replies; 91+ messages in thread
From: Michal Hocko @ 2017-05-17 14:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Wed 17-05-17 09:48:25, Cristopher Lameter wrote:
> On Wed, 17 May 2017, Michal Hocko wrote:
> 
> > > > So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
> > > > case in a raceless way?
> > >
> > > You dont have to do that if you do not create an empty mempolicy in the
> > > first place. The current kernel code avoids that by first allowing access
> > > to the new set of nodes and removing the old ones from the set when done.
> >
> > which is racy and as Vlastimil pointed out. If we simply fail such an
> > allocation the failure will go up the call chain until we hit the OOM
> > killer due to VM_FAULT_OOM. How would you want to handle that?
> 
> The race is where? If you expand the node set during the move of the
> application then you are safe in terms of the legacy apps that did not
> include static bindings.

I am pretty sure it is describe in those changelogs and I won't repeat
it here.

> If you have screwy things like static mbinds in there then you are
> hopelessly lost anyways. You may have moved the process to another set
> of nodes but the static bindings may refer to a node no longer
> available. Thus the OOM is legitimate.

The point is that you do _not_ want such a process to trigger the OOM
because it can cause other processes being killed.

> At least a user space app could inspect
> the situation and come up with custom ways of dealing with the mess.

I do not really see how would this help to prevent a malicious user from
playing tricks.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
  2017-05-17 14:56                           ` Michal Hocko
@ 2017-05-17 15:25                             ` Christoph Lameter
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-05-17 15:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Wed, 17 May 2017, Michal Hocko wrote:

> > If you have screwy things like static mbinds in there then you are
> > hopelessly lost anyways. You may have moved the process to another set
> > of nodes but the static bindings may refer to a node no longer
> > available. Thus the OOM is legitimate.
>
> The point is that you do _not_ want such a process to trigger the OOM
> because it can cause other processes being killed.

Nope. The OOM in a cpuset gets the process doing the alloc killed. Or what
that changed?

At this point you have messed up royally and nothing is going to rescue
you anyways. OOM or not does not matter anymore. The app will fail.

> > At least a user space app could inspect
> > the situation and come up with custom ways of dealing with the mess.
>
> I do not really see how would this help to prevent a malicious user from
> playing tricks.

How did a malicious user come into this? Of course you can mess up in
significant ways if you can overflow nodes and cause an app that has
restrictions to fail but nothing is going to change that.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-17 15:25                             ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-05-17 15:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Wed, 17 May 2017, Michal Hocko wrote:

> > If you have screwy things like static mbinds in there then you are
> > hopelessly lost anyways. You may have moved the process to another set
> > of nodes but the static bindings may refer to a node no longer
> > available. Thus the OOM is legitimate.
>
> The point is that you do _not_ want such a process to trigger the OOM
> because it can cause other processes being killed.

Nope. The OOM in a cpuset gets the process doing the alloc killed. Or what
that changed?

At this point you have messed up royally and nothing is going to rescue
you anyways. OOM or not does not matter anymore. The app will fail.

> > At least a user space app could inspect
> > the situation and come up with custom ways of dealing with the mess.
>
> I do not really see how would this help to prevent a malicious user from
> playing tricks.

How did a malicious user come into this? Of course you can mess up in
significant ways if you can overflow nodes and cause an app that has
restrictions to fail but nothing is going to change that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
  2017-05-17 14:56                           ` Michal Hocko
@ 2017-05-17 15:27                             ` Christoph Lameter
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-05-17 15:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Wed, 17 May 2017, Michal Hocko wrote:

> > The race is where? If you expand the node set during the move of the
> > application then you are safe in terms of the legacy apps that did not
> > include static bindings.
>
> I am pretty sure it is describe in those changelogs and I won't repeat
> it here.

I cannot figure out what you are referring to. There are numerous
patches and discussions about OOM scenarios in this context.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-17 15:27                             ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-05-17 15:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Wed, 17 May 2017, Michal Hocko wrote:

> > The race is where? If you expand the node set during the move of the
> > application then you are safe in terms of the legacy apps that did not
> > include static bindings.
>
> I am pretty sure it is describe in those changelogs and I won't repeat
> it here.

I cannot figure out what you are referring to. There are numerous
patches and discussions about OOM scenarios in this context.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-18  9:08                               ` Michal Hocko
  0 siblings, 0 replies; 91+ messages in thread
From: Michal Hocko @ 2017-05-18  9:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Wed 17-05-17 10:25:09, Cristopher Lameter wrote:
> On Wed, 17 May 2017, Michal Hocko wrote:
> 
> > > If you have screwy things like static mbinds in there then you are
> > > hopelessly lost anyways. You may have moved the process to another set
> > > of nodes but the static bindings may refer to a node no longer
> > > available. Thus the OOM is legitimate.
> >
> > The point is that you do _not_ want such a process to trigger the OOM
> > because it can cause other processes being killed.
> 
> Nope. The OOM in a cpuset gets the process doing the alloc killed. Or what
> that changed?
> 
> At this point you have messed up royally and nothing is going to rescue
> you anyways. OOM or not does not matter anymore. The app will fail.

Not really. If you can trick the system to _think_ that the intersection
between mempolicy and the cpuset is empty then the OOM killer might
trigger an innocent task rather than the one which tricked it into that
situation.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-18  9:08                               ` Michal Hocko
  0 siblings, 0 replies; 91+ messages in thread
From: Michal Hocko @ 2017-05-18  9:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Mel Gorman,
	David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Wed 17-05-17 10:25:09, Cristopher Lameter wrote:
> On Wed, 17 May 2017, Michal Hocko wrote:
> 
> > > If you have screwy things like static mbinds in there then you are
> > > hopelessly lost anyways. You may have moved the process to another set
> > > of nodes but the static bindings may refer to a node no longer
> > > available. Thus the OOM is legitimate.
> >
> > The point is that you do _not_ want such a process to trigger the OOM
> > because it can cause other processes being killed.
> 
> Nope. The OOM in a cpuset gets the process doing the alloc killed. Or what
> that changed?
> 
> At this point you have messed up royally and nothing is going to rescue
> you anyways. OOM or not does not matter anymore. The app will fail.

Not really. If you can trick the system to _think_ that the intersection
between mempolicy and the cpuset is empty then the OOM killer might
trigger an innocent task rather than the one which tricked it into that
situation.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-18  9:08                               ` Michal Hocko
  0 siblings, 0 replies; 91+ messages in thread
From: Michal Hocko @ 2017-05-18  9:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Wed 17-05-17 10:25:09, Cristopher Lameter wrote:
> On Wed, 17 May 2017, Michal Hocko wrote:
> 
> > > If you have screwy things like static mbinds in there then you are
> > > hopelessly lost anyways. You may have moved the process to another set
> > > of nodes but the static bindings may refer to a node no longer
> > > available. Thus the OOM is legitimate.
> >
> > The point is that you do _not_ want such a process to trigger the OOM
> > because it can cause other processes being killed.
> 
> Nope. The OOM in a cpuset gets the process doing the alloc killed. Or what
> that changed?
> 
> At this point you have messed up royally and nothing is going to rescue
> you anyways. OOM or not does not matter anymore. The app will fail.

Not really. If you can trick the system to _think_ that the intersection
between mempolicy and the cpuset is empty then the OOM killer might
trigger an innocent task rather than the one which tricked it into that
situation.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
  2017-05-17 14:48                         ` Christoph Lameter
@ 2017-05-18 10:03                           ` Vlastimil Babka
  -1 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-05-18 10:03 UTC (permalink / raw)
  To: Christoph Lameter, Michal Hocko
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Mel Gorman,
	David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On 05/17/2017 04:48 PM, Christoph Lameter wrote:
> On Wed, 17 May 2017, Michal Hocko wrote:
> 
>>>> So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
>>>> case in a raceless way?
>>>
>>> You dont have to do that if you do not create an empty mempolicy in the
>>> first place. The current kernel code avoids that by first allowing access
>>> to the new set of nodes and removing the old ones from the set when done.
>>
>> which is racy and as Vlastimil pointed out. If we simply fail such an
>> allocation the failure will go up the call chain until we hit the OOM
>> killer due to VM_FAULT_OOM. How would you want to handle that?
> 
> The race is where? If you expand the node set during the move of the
> application then you are safe in terms of the legacy apps that did not
> include static bindings.

No, that expand/shrink by itself doesn't work against parallel
get_page_from_freelist going through a zonelist. Moving from node 0 to
1, with zonelist containing nodes 1 and 0 in that order:

- mempolicy mask is 0
- zonelist iteration checks node 1, it's not allowed, skip
- mempolicy mask is 0,1 (expand)
- mempolicy mask is 1 (shrink)
- zonelist iteration checks node 0, it's not allowed, skip
- OOM

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-18 10:03                           ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-05-18 10:03 UTC (permalink / raw)
  To: Christoph Lameter, Michal Hocko
  Cc: linux-mm, linux-kernel, cgroups, Li Zefan, Mel Gorman,
	David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On 05/17/2017 04:48 PM, Christoph Lameter wrote:
> On Wed, 17 May 2017, Michal Hocko wrote:
> 
>>>> So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
>>>> case in a raceless way?
>>>
>>> You dont have to do that if you do not create an empty mempolicy in the
>>> first place. The current kernel code avoids that by first allowing access
>>> to the new set of nodes and removing the old ones from the set when done.
>>
>> which is racy and as Vlastimil pointed out. If we simply fail such an
>> allocation the failure will go up the call chain until we hit the OOM
>> killer due to VM_FAULT_OOM. How would you want to handle that?
> 
> The race is where? If you expand the node set during the move of the
> application then you are safe in terms of the legacy apps that did not
> include static bindings.

No, that expand/shrink by itself doesn't work against parallel
get_page_from_freelist going through a zonelist. Moving from node 0 to
1, with zonelist containing nodes 1 and 0 in that order:

- mempolicy mask is 0
- zonelist iteration checks node 1, it's not allowed, skip
- mempolicy mask is 0,1 (expand)
- mempolicy mask is 1 (shrink)
- zonelist iteration checks node 0, it's not allowed, skip
- OOM

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-18 16:57                                 ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-05-18 16:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Thu, 18 May 2017, Michal Hocko wrote:

> > Nope. The OOM in a cpuset gets the process doing the alloc killed. Or what
> > that changed?

!!!!!

> >
> > At this point you have messed up royally and nothing is going to rescue
> > you anyways. OOM or not does not matter anymore. The app will fail.
>
> Not really. If you can trick the system to _think_ that the intersection
> between mempolicy and the cpuset is empty then the OOM killer might
> trigger an innocent task rather than the one which tricked it into that
> situation.

See above. OOM Kill in a cpuset does not kill an innocent task but a task
that does an allocation in that specific context meaning a task in that
cpuset that also has a memory policty.

Regardless of that the point earlier was that the moving logic can avoid
creating temporary situations of empty sets of nodes by analysing the
memory policies etc and only performing moves when doing so is safe.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-18 16:57                                 ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-05-18 16:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Mel Gorman,
	David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Thu, 18 May 2017, Michal Hocko wrote:

> > Nope. The OOM in a cpuset gets the process doing the alloc killed. Or what
> > that changed?

!!!!!

> >
> > At this point you have messed up royally and nothing is going to rescue
> > you anyways. OOM or not does not matter anymore. The app will fail.
>
> Not really. If you can trick the system to _think_ that the intersection
> between mempolicy and the cpuset is empty then the OOM killer might
> trigger an innocent task rather than the one which tricked it into that
> situation.

See above. OOM Kill in a cpuset does not kill an innocent task but a task
that does an allocation in that specific context meaning a task in that
cpuset that also has a memory policty.

Regardless of that the point earlier was that the moving logic can avoid
creating temporary situations of empty sets of nodes by analysing the
memory policies etc and only performing moves when doing so is safe.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-18 16:57                                 ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-05-18 16:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Thu, 18 May 2017, Michal Hocko wrote:

> > Nope. The OOM in a cpuset gets the process doing the alloc killed. Or what
> > that changed?

!!!!!

> >
> > At this point you have messed up royally and nothing is going to rescue
> > you anyways. OOM or not does not matter anymore. The app will fail.
>
> Not really. If you can trick the system to _think_ that the intersection
> between mempolicy and the cpuset is empty then the OOM killer might
> trigger an innocent task rather than the one which tricked it into that
> situation.

See above. OOM Kill in a cpuset does not kill an innocent task but a task
that does an allocation in that specific context meaning a task in that
cpuset that also has a memory policty.

Regardless of that the point earlier was that the moving logic can avoid
creating temporary situations of empty sets of nodes by analysing the
memory policies etc and only performing moves when doing so is safe.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
  2017-05-18 10:03                           ` Vlastimil Babka
@ 2017-05-18 17:07                             ` Christoph Lameter
  -1 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-05-18 17:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Thu, 18 May 2017, Vlastimil Babka wrote:

> > The race is where? If you expand the node set during the move of the
> > application then you are safe in terms of the legacy apps that did not
> > include static bindings.
>
> No, that expand/shrink by itself doesn't work against parallel

Parallel? I think we are clear that ithis is inherently racy against the
app changing policies etc etc? There is a huge issue there already. The
app needs to be well behaved in some heretofore undefined way in order to
make moves clean.

> get_page_from_freelist going through a zonelist. Moving from node 0 to
> 1, with zonelist containing nodes 1 and 0 in that order:
>
> - mempolicy mask is 0
> - zonelist iteration checks node 1, it's not allowed, skip

There is an allocation from node 1? This is not allowed before the move.
So it should fail. Not skipping to another node.

> - mempolicy mask is 0,1 (expand)
> - mempolicy mask is 1 (shrink)
> - zonelist iteration checks node 0, it's not allowed, skip
> - OOM

Are you talking about a race here between zonelist scanning and the
moving? That has been there forever.

And frankly there are gazillions of these races. The best thing to do is
to get the cpuset moving logic out of the kernel and into user space.

Understand that this is a heuristic and maybe come up with a list of
restrictions that make an app safe. An safe app that can be moved must f.e

1. Not allocate new memory while its being moved
2. Not change memory policies after its initialization and while its being
moved.
3. Not save memory policy state in some variable (because the logic to
translate the memory policies for the new context cannot find it).

...

Again cpuset process migration  is a huge mess that you do not want to
have in the kernel and AFAICT this is a corner case with difficult
semantics. Better have that in user space...

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-18 17:07                             ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-05-18 17:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Thu, 18 May 2017, Vlastimil Babka wrote:

> > The race is where? If you expand the node set during the move of the
> > application then you are safe in terms of the legacy apps that did not
> > include static bindings.
>
> No, that expand/shrink by itself doesn't work against parallel

Parallel? I think we are clear that ithis is inherently racy against the
app changing policies etc etc? There is a huge issue there already. The
app needs to be well behaved in some heretofore undefined way in order to
make moves clean.

> get_page_from_freelist going through a zonelist. Moving from node 0 to
> 1, with zonelist containing nodes 1 and 0 in that order:
>
> - mempolicy mask is 0
> - zonelist iteration checks node 1, it's not allowed, skip

There is an allocation from node 1? This is not allowed before the move.
So it should fail. Not skipping to another node.

> - mempolicy mask is 0,1 (expand)
> - mempolicy mask is 1 (shrink)
> - zonelist iteration checks node 0, it's not allowed, skip
> - OOM

Are you talking about a race here between zonelist scanning and the
moving? That has been there forever.

And frankly there are gazillions of these races. The best thing to do is
to get the cpuset moving logic out of the kernel and into user space.

Understand that this is a heuristic and maybe come up with a list of
restrictions that make an app safe. An safe app that can be moved must f.e

1. Not allocate new memory while its being moved
2. Not change memory policies after its initialization and while its being
moved.
3. Not save memory policy state in some variable (because the logic to
translate the memory policies for the new context cannot find it).

...

Again cpuset process migration  is a huge mess that you do not want to
have in the kernel and AFAICT this is a corner case with difficult
semantics. Better have that in user space...


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-18 17:24                                   ` Michal Hocko
  0 siblings, 0 replies; 91+ messages in thread
From: Michal Hocko @ 2017-05-18 17:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Thu 18-05-17 11:57:55, Cristopher Lameter wrote:
> On Thu, 18 May 2017, Michal Hocko wrote:
> 
> > > Nope. The OOM in a cpuset gets the process doing the alloc killed. Or what
> > > that changed?
> 
> !!!!!
> 
> > >
> > > At this point you have messed up royally and nothing is going to rescue
> > > you anyways. OOM or not does not matter anymore. The app will fail.
> >
> > Not really. If you can trick the system to _think_ that the intersection
> > between mempolicy and the cpuset is empty then the OOM killer might
> > trigger an innocent task rather than the one which tricked it into that
> > situation.
> 
> See above. OOM Kill in a cpuset does not kill an innocent task but a task
> that does an allocation in that specific context meaning a task in that
> cpuset that also has a memory policty.

No, the oom killer will chose the largest task in the specific NUMA
domain. If you just fail such an allocation then a page fault would get
VM_FAULT_OOM and pagefault_out_of_memory would kill a task regardless of
the cpusets.
 
> Regardless of that the point earlier was that the moving logic can avoid
> creating temporary situations of empty sets of nodes by analysing the
> memory policies etc and only performing moves when doing so is safe.

How are you going to do that in a raceless way? Moreover the whole
discussion is about _failing_ allocations on an empty cpuset and
mempolicy intersection.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-18 17:24                                   ` Michal Hocko
  0 siblings, 0 replies; 91+ messages in thread
From: Michal Hocko @ 2017-05-18 17:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Mel Gorman,
	David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Thu 18-05-17 11:57:55, Cristopher Lameter wrote:
> On Thu, 18 May 2017, Michal Hocko wrote:
> 
> > > Nope. The OOM in a cpuset gets the process doing the alloc killed. Or what
> > > that changed?
> 
> !!!!!
> 
> > >
> > > At this point you have messed up royally and nothing is going to rescue
> > > you anyways. OOM or not does not matter anymore. The app will fail.
> >
> > Not really. If you can trick the system to _think_ that the intersection
> > between mempolicy and the cpuset is empty then the OOM killer might
> > trigger an innocent task rather than the one which tricked it into that
> > situation.
> 
> See above. OOM Kill in a cpuset does not kill an innocent task but a task
> that does an allocation in that specific context meaning a task in that
> cpuset that also has a memory policty.

No, the oom killer will chose the largest task in the specific NUMA
domain. If you just fail such an allocation then a page fault would get
VM_FAULT_OOM and pagefault_out_of_memory would kill a task regardless of
the cpusets.
 
> Regardless of that the point earlier was that the moving logic can avoid
> creating temporary situations of empty sets of nodes by analysing the
> memory policies etc and only performing moves when doing so is safe.

How are you going to do that in a raceless way? Moreover the whole
discussion is about _failing_ allocations on an empty cpuset and
mempolicy intersection.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-18 17:24                                   ` Michal Hocko
  0 siblings, 0 replies; 91+ messages in thread
From: Michal Hocko @ 2017-05-18 17:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Thu 18-05-17 11:57:55, Cristopher Lameter wrote:
> On Thu, 18 May 2017, Michal Hocko wrote:
> 
> > > Nope. The OOM in a cpuset gets the process doing the alloc killed. Or what
> > > that changed?
> 
> !!!!!
> 
> > >
> > > At this point you have messed up royally and nothing is going to rescue
> > > you anyways. OOM or not does not matter anymore. The app will fail.
> >
> > Not really. If you can trick the system to _think_ that the intersection
> > between mempolicy and the cpuset is empty then the OOM killer might
> > trigger an innocent task rather than the one which tricked it into that
> > situation.
> 
> See above. OOM Kill in a cpuset does not kill an innocent task but a task
> that does an allocation in that specific context meaning a task in that
> cpuset that also has a memory policty.

No, the oom killer will chose the largest task in the specific NUMA
domain. If you just fail such an allocation then a page fault would get
VM_FAULT_OOM and pagefault_out_of_memory would kill a task regardless of
the cpusets.
 
> Regardless of that the point earlier was that the moving logic can avoid
> creating temporary situations of empty sets of nodes by analysing the
> memory policies etc and only performing moves when doing so is safe.

How are you going to do that in a raceless way? Moreover the whole
discussion is about _failing_ allocations on an empty cpuset and
mempolicy intersection.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-18 19:07                                     ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-05-18 19:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Thu, 18 May 2017, Michal Hocko wrote:

> > See above. OOM Kill in a cpuset does not kill an innocent task but a task
> > that does an allocation in that specific context meaning a task in that
> > cpuset that also has a memory policty.
>
> No, the oom killer will chose the largest task in the specific NUMA
> domain. If you just fail such an allocation then a page fault would get
> VM_FAULT_OOM and pagefault_out_of_memory would kill a task regardless of
> the cpusets.

Ok someone screwed up that code. There still is the determination that we
have a constrained alloc:

oom_kill:
	/*
         * Check if there were limitations on the allocation (only relevant for
         * NUMA and memcg) that may require different handling.
         */
        constraint = constrained_alloc(oc);
        if (constraint != CONSTRAINT_MEMORY_POLICY)
                oc->nodemask = NULL;
        check_panic_on_oom(oc, constraint);

-- Ok. A constrained failing alloc used to terminate the allocating
	process here. But it falls through to selecting a "bad process"


        if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
            current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
            current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
                get_task_struct(current);
                oc->chosen = current;
                oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
                return true;
        }

--  A constrained allocation should not get here but fail the process that
	attempts the alloc.

        select_bad_process(oc);


Can we restore the old behavior? If I just specify the right memory policy
I can cause other processes to just be terminated?


> > Regardless of that the point earlier was that the moving logic can avoid
> > creating temporary situations of empty sets of nodes by analysing the
> > memory policies etc and only performing moves when doing so is safe.
>
> How are you going to do that in a raceless way? Moreover the whole
> discussion is about _failing_ allocations on an empty cpuset and
> mempolicy intersection.

Again this is only working for processes that are well behaved and it
never worked in a different way before. There was always the assumption
that a process does not allocate in the areas that have allocation
constraints and that the process does not change memory policies nor
store them somewhere for late etc etc. HPC apps typically allocate memory
on startup and then go through long times of processing and I/O.

The idea that cpuset node to node migration will work with a running
process that does abitrary activity is a pipe dream that we should give
up. There must be constraints on a process in order to allow this to work
and as far as I can tell this is best done in userspace with a library and
by putting requirements on the applications that desire to be movable that
way.

F.e. an application that does not use memory policies or other allocation
constraints should be fine. That has been working.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-18 19:07                                     ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-05-18 19:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Mel Gorman,
	David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Thu, 18 May 2017, Michal Hocko wrote:

> > See above. OOM Kill in a cpuset does not kill an innocent task but a task
> > that does an allocation in that specific context meaning a task in that
> > cpuset that also has a memory policty.
>
> No, the oom killer will chose the largest task in the specific NUMA
> domain. If you just fail such an allocation then a page fault would get
> VM_FAULT_OOM and pagefault_out_of_memory would kill a task regardless of
> the cpusets.

Ok someone screwed up that code. There still is the determination that we
have a constrained alloc:

oom_kill:
	/*
         * Check if there were limitations on the allocation (only relevant for
         * NUMA and memcg) that may require different handling.
         */
        constraint = constrained_alloc(oc);
        if (constraint != CONSTRAINT_MEMORY_POLICY)
                oc->nodemask = NULL;
        check_panic_on_oom(oc, constraint);

-- Ok. A constrained failing alloc used to terminate the allocating
	process here. But it falls through to selecting a "bad process"


        if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
            current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
            current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
                get_task_struct(current);
                oc->chosen = current;
                oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
                return true;
        }

--  A constrained allocation should not get here but fail the process that
	attempts the alloc.

        select_bad_process(oc);


Can we restore the old behavior? If I just specify the right memory policy
I can cause other processes to just be terminated?


> > Regardless of that the point earlier was that the moving logic can avoid
> > creating temporary situations of empty sets of nodes by analysing the
> > memory policies etc and only performing moves when doing so is safe.
>
> How are you going to do that in a raceless way? Moreover the whole
> discussion is about _failing_ allocations on an empty cpuset and
> mempolicy intersection.

Again this is only working for processes that are well behaved and it
never worked in a different way before. There was always the assumption
that a process does not allocate in the areas that have allocation
constraints and that the process does not change memory policies nor
store them somewhere for late etc etc. HPC apps typically allocate memory
on startup and then go through long times of processing and I/O.

The idea that cpuset node to node migration will work with a running
process that does abitrary activity is a pipe dream that we should give
up. There must be constraints on a process in order to allow this to work
and as far as I can tell this is best done in userspace with a library and
by putting requirements on the applications that desire to be movable that
way.

F.e. an application that does not use memory policies or other allocation
constraints should be fine. That has been working.

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-18 19:07                                     ` Christoph Lameter
  0 siblings, 0 replies; 91+ messages in thread
From: Christoph Lameter @ 2017-05-18 19:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Thu, 18 May 2017, Michal Hocko wrote:

> > See above. OOM Kill in a cpuset does not kill an innocent task but a task
> > that does an allocation in that specific context meaning a task in that
> > cpuset that also has a memory policty.
>
> No, the oom killer will chose the largest task in the specific NUMA
> domain. If you just fail such an allocation then a page fault would get
> VM_FAULT_OOM and pagefault_out_of_memory would kill a task regardless of
> the cpusets.

Ok someone screwed up that code. There still is the determination that we
have a constrained alloc:

oom_kill:
	/*
         * Check if there were limitations on the allocation (only relevant for
         * NUMA and memcg) that may require different handling.
         */
        constraint = constrained_alloc(oc);
        if (constraint != CONSTRAINT_MEMORY_POLICY)
                oc->nodemask = NULL;
        check_panic_on_oom(oc, constraint);

-- Ok. A constrained failing alloc used to terminate the allocating
	process here. But it falls through to selecting a "bad process"


        if (!is_memcg_oom(oc) && sysctl_oom_kill_allocating_task &&
            current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
            current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
                get_task_struct(current);
                oc->chosen = current;
                oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
                return true;
        }

--  A constrained allocation should not get here but fail the process that
	attempts the alloc.

        select_bad_process(oc);


Can we restore the old behavior? If I just specify the right memory policy
I can cause other processes to just be terminated?


> > Regardless of that the point earlier was that the moving logic can avoid
> > creating temporary situations of empty sets of nodes by analysing the
> > memory policies etc and only performing moves when doing so is safe.
>
> How are you going to do that in a raceless way? Moreover the whole
> discussion is about _failing_ allocations on an empty cpuset and
> mempolicy intersection.

Again this is only working for processes that are well behaved and it
never worked in a different way before. There was always the assumption
that a process does not allocate in the areas that have allocation
constraints and that the process does not change memory policies nor
store them somewhere for late etc etc. HPC apps typically allocate memory
on startup and then go through long times of processing and I/O.

The idea that cpuset node to node migration will work with a running
process that does abitrary activity is a pipe dream that we should give
up. There must be constraints on a process in order to allow this to work
and as far as I can tell this is best done in userspace with a library and
by putting requirements on the applications that desire to be movable that
way.

F.e. an application that does not use memory policies or other allocation
constraints should be fine. That has been working.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
  2017-05-18 19:07                                     ` Christoph Lameter
@ 2017-05-19  7:37                                       ` Michal Hocko
  -1 siblings, 0 replies; 91+ messages in thread
From: Michal Hocko @ 2017-05-19  7:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Thu 18-05-17 14:07:45, Cristopher Lameter wrote:
> On Thu, 18 May 2017, Michal Hocko wrote:
> 
> > > See above. OOM Kill in a cpuset does not kill an innocent task but a task
> > > that does an allocation in that specific context meaning a task in that
> > > cpuset that also has a memory policty.
> >
> > No, the oom killer will chose the largest task in the specific NUMA
> > domain. If you just fail such an allocation then a page fault would get
> > VM_FAULT_OOM and pagefault_out_of_memory would kill a task regardless of
> > the cpusets.
> 
> Ok someone screwed up that code. There still is the determination that we
> have a constrained alloc:

It would be much more easier if you read emails more carefully. In order
to have a constrained OOM you have to have either a non-null nodemask or
zonelist which. And as I've said above you do not have them from the
pagefault_out_of_memory context. The whole point of this discussion is
_that_ failing allocations will not work currently!

> oom_kill:
> 	/*
>          * Check if there were limitations on the allocation (only relevant for
>          * NUMA and memcg) that may require different handling.
>          */
>         constraint = constrained_alloc(oc);
>         if (constraint != CONSTRAINT_MEMORY_POLICY)
>                 oc->nodemask = NULL;
>         check_panic_on_oom(oc, constraint);
> 
> -- Ok. A constrained failing alloc used to terminate the allocating
> 	process here. But it falls through to selecting a "bad process"

This behavior is there for ~10 years.
[...]
> Can we restore the old behavior? If I just specify the right memory policy
> I can cause other processes to just be terminated?

Not normally. Because out_of_memory called from the page allocator
context makes sure to kill tasks from the same NUMA domain (see
oom_unkillable_task).
 
> > > Regardless of that the point earlier was that the moving logic can avoid
> > > creating temporary situations of empty sets of nodes by analysing the
> > > memory policies etc and only performing moves when doing so is safe.
> >
> > How are you going to do that in a raceless way? Moreover the whole
> > discussion is about _failing_ allocations on an empty cpuset and
> > mempolicy intersection.
> 
> Again this is only working for processes that are well behaved and it
> never worked in a different way before. There was always the assumption
> that a process does not allocate in the areas that have allocation
> constraints and that the process does not change memory policies nor
> store them somewhere for late etc etc. HPC apps typically allocate memory
> on startup and then go through long times of processing and I/O.

I would call it a bad design which then triggered a lot of work to make
it semi-working over years. This is what Vlastimil tries to address now.
And yes that might mean we would have to do some restrictions on the
semantics. But as you know this is a user visible API and changing
something that has been fundamentally underdefined initially is quite
hard to fix.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-19  7:37                                       ` Michal Hocko
  0 siblings, 0 replies; 91+ messages in thread
From: Michal Hocko @ 2017-05-19  7:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On Thu 18-05-17 14:07:45, Cristopher Lameter wrote:
> On Thu, 18 May 2017, Michal Hocko wrote:
> 
> > > See above. OOM Kill in a cpuset does not kill an innocent task but a task
> > > that does an allocation in that specific context meaning a task in that
> > > cpuset that also has a memory policty.
> >
> > No, the oom killer will chose the largest task in the specific NUMA
> > domain. If you just fail such an allocation then a page fault would get
> > VM_FAULT_OOM and pagefault_out_of_memory would kill a task regardless of
> > the cpusets.
> 
> Ok someone screwed up that code. There still is the determination that we
> have a constrained alloc:

It would be much more easier if you read emails more carefully. In order
to have a constrained OOM you have to have either a non-null nodemask or
zonelist which. And as I've said above you do not have them from the
pagefault_out_of_memory context. The whole point of this discussion is
_that_ failing allocations will not work currently!

> oom_kill:
> 	/*
>          * Check if there were limitations on the allocation (only relevant for
>          * NUMA and memcg) that may require different handling.
>          */
>         constraint = constrained_alloc(oc);
>         if (constraint != CONSTRAINT_MEMORY_POLICY)
>                 oc->nodemask = NULL;
>         check_panic_on_oom(oc, constraint);
> 
> -- Ok. A constrained failing alloc used to terminate the allocating
> 	process here. But it falls through to selecting a "bad process"

This behavior is there for ~10 years.
[...]
> Can we restore the old behavior? If I just specify the right memory policy
> I can cause other processes to just be terminated?

Not normally. Because out_of_memory called from the page allocator
context makes sure to kill tasks from the same NUMA domain (see
oom_unkillable_task).
 
> > > Regardless of that the point earlier was that the moving logic can avoid
> > > creating temporary situations of empty sets of nodes by analysing the
> > > memory policies etc and only performing moves when doing so is safe.
> >
> > How are you going to do that in a raceless way? Moreover the whole
> > discussion is about _failing_ allocations on an empty cpuset and
> > mempolicy intersection.
> 
> Again this is only working for processes that are well behaved and it
> never worked in a different way before. There was always the assumption
> that a process does not allocate in the areas that have allocation
> constraints and that the process does not change memory policies nor
> store them somewhere for late etc etc. HPC apps typically allocate memory
> on startup and then go through long times of processing and I/O.

I would call it a bad design which then triggered a lot of work to make
it semi-working over years. This is what Vlastimil tries to address now.
And yes that might mean we would have to do some restrictions on the
semantics. But as you know this is a user visible API and changing
something that has been fundamentally underdefined initially is quite
hard to fix.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
  2017-05-18 17:07                             ` Christoph Lameter
@ 2017-05-19 11:27                               ` Vlastimil Babka
  -1 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-05-19 11:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Michal Hocko, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On 05/18/2017 07:07 PM, Christoph Lameter wrote:
> On Thu, 18 May 2017, Vlastimil Babka wrote:
> 
>>> The race is where? If you expand the node set during the move of the
>>> application then you are safe in terms of the legacy apps that did not
>>> include static bindings.
>>
>> No, that expand/shrink by itself doesn't work against parallel
> 
> Parallel? I think we are clear that ithis is inherently racy against the
> app changing policies etc etc? There is a huge issue there already. The
> app needs to be well behaved in some heretofore undefined way in order to
> make moves clean.

The code is safe against mbind() changing a vma's mempolicy parallel to
another thread page faulting within that vma, because mbind() takes
mmap_sem for write, and page faults take it for read. The per-task
mempolicy can be changed by set_mempolicy() call which means the task
itself doesn't allocate stuff in parallel.
So, the application never needed to be "well behaved" wrt changing its
own mempolicies.

Now with mempolicy rebinding due to cpuset migrations, the application
cannot be "well behaved" as it has no way to learn about being under a
cpuset, or cpuset change. Any application can be put in a cpuset and we
can't really expect that all would be adapted, even if the necessary
interfaces existed. Thus, the rebinding implementation in the kernel
itself has to be robust against parallel allocations.

>> get_page_from_freelist going through a zonelist. Moving from node 0 to
>> 1, with zonelist containing nodes 1 and 0 in that order:
>>
>> - mempolicy mask is 0
>> - zonelist iteration checks node 1, it's not allowed, skip
> 
> There is an allocation from node 1?

Sorry, I missed to mention the full scenario. Let's say the allocation
is on cpu local to node 1, so it gets zonelist from node 1, which
contains nodes 1 and 0 in that order.

> This is not allowed before the move.
> So it should fail. Not skipping to another node.
> 
>> - mempolicy mask is 0,1 (expand)
>> - mempolicy mask is 1 (shrink)
>> - zonelist iteration checks node 0, it's not allowed, skip
>> - OOM
> 
> Are you talking about a race here between zonelist scanning and the
> moving? That has been there forever.

As far as I can tell from my git archeology in [1] there was always some
kind of protection against the race (generation counters, two-step
protocol, seqlock...), which however had some corner cases. This patch
is merely plugging the last known one.

> And frankly there are gazillions of these races.

I don't know about any other existing race that we don't handle after
this patch.

> The best thing to do is
> to get the cpuset moving logic out of the kernel and into user space.
> 
> Understand that this is a heuristic and maybe come up with a list of
> restrictions that make an app safe. An safe app that can be moved must f.e
> 
> 1. Not allocate new memory while its being moved
> 2. Not change memory policies after its initialization and while its being
> moved.

As I explainer eariler in this mail, changing mempolicy by app itself is
safe, the problem was always due to cpuset-triggered rebinding.

> 3. Not save memory policy state in some variable (because the logic to
> translate the memory policies for the new context cannot find it).
> 
> ...
> 
> Again cpuset process migration  is a huge mess that you do not want to
> have in the kernel and AFAICT this is a corner case with difficult
> semantics. Better have that in user space...

Moving this out of kernel etc is changing the current semantics and
breaking existing userspace, this patch is a fix within the existing one.

[1] https://marc.info/?l=linux-mm&m=148611344511408&w=2


> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

^ permalink raw reply	[flat|nested] 91+ messages in thread

* Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
@ 2017-05-19 11:27                               ` Vlastimil Babka
  0 siblings, 0 replies; 91+ messages in thread
From: Vlastimil Babka @ 2017-05-19 11:27 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Michal Hocko, linux-mm, linux-kernel, cgroups, Li Zefan,
	Mel Gorman, David Rientjes, Hugh Dickins, Andrea Arcangeli,
	Anshuman Khandual, Kirill A. Shutemov, linux-api

On 05/18/2017 07:07 PM, Christoph Lameter wrote:
> On Thu, 18 May 2017, Vlastimil Babka wrote:
> 
>>> The race is where? If you expand the node set during the move of the
>>> application then you are safe in terms of the legacy apps that did not
>>> include static bindings.
>>
>> No, that expand/shrink by itself doesn't work against parallel
> 
> Parallel? I think we are clear that ithis is inherently racy against the
> app changing policies etc etc? There is a huge issue there already. The
> app needs to be well behaved in some heretofore undefined way in order to
> make moves clean.

The code is safe against mbind() changing a vma's mempolicy parallel to
another thread page faulting within that vma, because mbind() takes
mmap_sem for write, and page faults take it for read. The per-task
mempolicy can be changed by set_mempolicy() call which means the task
itself doesn't allocate stuff in parallel.
So, the application never needed to be "well behaved" wrt changing its
own mempolicies.

Now with mempolicy rebinding due to cpuset migrations, the application
cannot be "well behaved" as it has no way to learn about being under a
cpuset, or cpuset change. Any application can be put in a cpuset and we
can't really expect that all would be adapted, even if the necessary
interfaces existed. Thus, the rebinding implementation in the kernel
itself has to be robust against parallel allocations.

>> get_page_from_freelist going through a zonelist. Moving from node 0 to
>> 1, with zonelist containing nodes 1 and 0 in that order:
>>
>> - mempolicy mask is 0
>> - zonelist iteration checks node 1, it's not allowed, skip
> 
> There is an allocation from node 1?

Sorry, I missed to mention the full scenario. Let's say the allocation
is on cpu local to node 1, so it gets zonelist from node 1, which
contains nodes 1 and 0 in that order.

> This is not allowed before the move.
> So it should fail. Not skipping to another node.
> 
>> - mempolicy mask is 0,1 (expand)
>> - mempolicy mask is 1 (shrink)
>> - zonelist iteration checks node 0, it's not allowed, skip
>> - OOM
> 
> Are you talking about a race here between zonelist scanning and the
> moving? That has been there forever.

As far as I can tell from my git archeology in [1] there was always some
kind of protection against the race (generation counters, two-step
protocol, seqlock...), which however had some corner cases. This patch
is merely plugging the last known one.

> And frankly there are gazillions of these races.

I don't know about any other existing race that we don't handle after
this patch.

> The best thing to do is
> to get the cpuset moving logic out of the kernel and into user space.
> 
> Understand that this is a heuristic and maybe come up with a list of
> restrictions that make an app safe. An safe app that can be moved must f.e
> 
> 1. Not allocate new memory while its being moved
> 2. Not change memory policies after its initialization and while its being
> moved.

As I explainer eariler in this mail, changing mempolicy by app itself is
safe, the problem was always due to cpuset-triggered rebinding.

> 3. Not save memory policy state in some variable (because the logic to
> translate the memory policies for the new context cannot find it).
> 
> ...
> 
> Again cpuset process migration  is a huge mess that you do not want to
> have in the kernel and AFAICT this is a corner case with difficult
> semantics. Better have that in user space...

Moving this out of kernel etc is changing the current semantics and
breaking existing userspace, this patch is a fix within the existing one.

[1] https://marc.info/?l=linux-mm&m=148611344511408&w=2


> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 91+ messages in thread

end of thread, other threads:[~2017-05-19 11:28 UTC | newest]

Thread overview: 91+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-11 14:06 [RFC 0/6] cpuset/mempolicies related fixes and cleanups Vlastimil Babka
2017-04-11 14:06 ` Vlastimil Babka
2017-04-11 14:06 ` [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update Vlastimil Babka
2017-04-11 14:06   ` Vlastimil Babka
2017-04-11 17:24   ` Christoph Lameter
2017-04-11 17:24     ` Christoph Lameter
2017-04-11 19:00     ` Vlastimil Babka
2017-04-11 19:00       ` Vlastimil Babka
2017-04-11 19:00       ` Vlastimil Babka
2017-04-12 21:25       ` Christoph Lameter
2017-04-12 21:25         ` Christoph Lameter
2017-04-13  6:24         ` Vlastimil Babka
2017-04-13  6:24           ` Vlastimil Babka
2017-04-13  6:24           ` Vlastimil Babka
2017-04-14 20:37           ` Christoph Lameter
2017-04-14 20:37             ` Christoph Lameter
2017-04-26  8:07             ` Vlastimil Babka
2017-04-26  8:07               ` Vlastimil Babka
2017-04-30 21:33               ` Christoph Lameter
2017-04-30 21:33                 ` Christoph Lameter
2017-05-17  9:20                 ` Michal Hocko
2017-05-17  9:20                   ` Michal Hocko
2017-05-17  9:20                   ` Michal Hocko
2017-05-17 13:56                   ` Christoph Lameter
2017-05-17 13:56                     ` Christoph Lameter
2017-05-17 13:56                     ` Christoph Lameter
2017-05-17 14:05                     ` Michal Hocko
2017-05-17 14:05                       ` Michal Hocko
2017-05-17 14:05                       ` Michal Hocko
2017-05-17 14:48                       ` Christoph Lameter
2017-05-17 14:48                         ` Christoph Lameter
2017-05-17 14:56                         ` Michal Hocko
2017-05-17 14:56                           ` Michal Hocko
2017-05-17 14:56                           ` Michal Hocko
2017-05-17 15:25                           ` Christoph Lameter
2017-05-17 15:25                             ` Christoph Lameter
2017-05-18  9:08                             ` Michal Hocko
2017-05-18  9:08                               ` Michal Hocko
2017-05-18  9:08                               ` Michal Hocko
2017-05-18 16:57                               ` Christoph Lameter
2017-05-18 16:57                                 ` Christoph Lameter
2017-05-18 16:57                                 ` Christoph Lameter
2017-05-18 17:24                                 ` Michal Hocko
2017-05-18 17:24                                   ` Michal Hocko
2017-05-18 17:24                                   ` Michal Hocko
2017-05-18 19:07                                   ` Christoph Lameter
2017-05-18 19:07                                     ` Christoph Lameter
2017-05-18 19:07                                     ` Christoph Lameter
2017-05-19  7:37                                     ` Michal Hocko
2017-05-19  7:37                                       ` Michal Hocko
2017-05-17 15:27                           ` Christoph Lameter
2017-05-17 15:27                             ` Christoph Lameter
2017-05-18 10:03                         ` Vlastimil Babka
2017-05-18 10:03                           ` Vlastimil Babka
2017-05-18 17:07                           ` Christoph Lameter
2017-05-18 17:07                             ` Christoph Lameter
2017-05-19 11:27                             ` Vlastimil Babka
2017-05-19 11:27                               ` Vlastimil Babka
2017-04-13  5:42   ` Anshuman Khandual
2017-04-13  5:42     ` Anshuman Khandual
2017-04-13  6:06     ` Vlastimil Babka
2017-04-13  6:06       ` Vlastimil Babka
2017-04-13  6:07       ` Vlastimil Babka
2017-04-13  6:07         ` Vlastimil Babka
2017-04-11 14:06 ` [RFC 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask() Vlastimil Babka
2017-04-11 14:06   ` Vlastimil Babka
2017-04-11 17:32   ` Christoph Lameter
2017-04-11 17:32     ` Christoph Lameter
2017-04-11 19:03     ` Vlastimil Babka
2017-04-11 19:03       ` Vlastimil Babka
2017-04-11 19:03       ` Vlastimil Babka
2017-04-12  8:49       ` Vlastimil Babka
2017-04-12  8:49         ` Vlastimil Babka
2017-04-12 21:16         ` Christoph Lameter
2017-04-12 21:16           ` Christoph Lameter
2017-04-12 21:16           ` Christoph Lameter
2017-04-12 21:18           ` Vlastimil Babka
2017-04-12 21:18             ` Vlastimil Babka
2017-04-11 14:06 ` [RFC 3/6] mm, page_alloc: pass preferred nid instead of zonelist to allocator Vlastimil Babka
2017-04-11 14:06   ` Vlastimil Babka
2017-04-11 14:06 ` [RFC 4/6] mm, mempolicy: simplify rebinding mempolicies when updating cpusets Vlastimil Babka
2017-04-11 14:06   ` Vlastimil Babka
2017-04-11 14:06 ` [RFC 5/6] mm, cpuset: always use seqlock when changing task's nodemask Vlastimil Babka
2017-04-11 14:06   ` Vlastimil Babka
2017-04-12  8:10   ` Hillf Danton
2017-04-12  8:10     ` Hillf Danton
2017-04-12  8:18     ` Vlastimil Babka
2017-04-12  8:18       ` Vlastimil Babka
2017-04-12  8:18       ` Vlastimil Babka
2017-04-11 14:06 ` [RFC 6/6] mm, mempolicy: don't check cpuset seqlock where it doesn't matter Vlastimil Babka
2017-04-11 14:06   ` Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.