linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] cpuset/mempolicies related fixes and cleanups
@ 2017-05-17  8:11 Vlastimil Babka
  2017-05-17  8:11 ` [PATCH v2 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update Vlastimil Babka
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Vlastimil Babka @ 2017-05-17  8:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-api, linux-kernel, cgroups, Li Zefan,
	Michal Hocko, Mel Gorman, David Rientjes, Christoph Lameter,
	Hugh Dickins, Andrea Arcangeli, Anshuman Khandual,
	Kirill A. Shutemov, Vlastimil Babka

Changes since RFC v1 [3]:

- Reworked patch 2 after discussion with Christoph Lameter.
- Fix bug in patch 5 spotted by Hillf Danton.
- Rebased to mmotm-2017-05-12-15-53

I would like to stress that this patchset aims to fix issues and cleanup the
code *within the existing documented semantics*, i.e. patch 1 ignores mempolicy
restrictions if the set of allowed nodes has no intersection with set of nodes
allowed by cpuset. I believe discussing potential changes of the semantics can
be better done once we have a baseline with no known bugs of the current
semantics.

===

I've recently summarized the cpuset/mempolicy issues in a LSF/MM proposal [1]
and the discussion itself [2]. I've been trying to rewrite the handling as
proposed, with the idea that changing semantics to make all mempolicies static
wrt cpuset updates (and discarding the relative and default modes) can be tried
on top, as there's a high risk of being rejected/reverted because somebody
might still care about the removed modes.

However I haven't yet figured out how to properly:

1) make mempolicies swappable instead of rebinding in place. I thought mbind()
already works that way and uses refcounting to avoid use-after-free of the old
policy by a parallel allocation, but turns out true refcounting is only done
for shared (shmem) mempolicies, and the actual protection for mbind() comes
from mmap_sem. Extending the refcounting means more overhead in allocator hot
path. Also swapping whole mempolicies means that we have to allocate the new
ones, which can fail, and reverting of the partially done work also means
allocating (note that mbind() doesn't care and will just leave part of the
range updated and part not updated when returning -ENOMEM...).

2) make cpuset's task->mems_allowed also swappable (after converting it from
nodemask to zonelist, which is the easy part) for mostly the same reasons.

The good news is that while trying to do the above, I've at least figured out
how to hopefully close the remaining premature OOM's, and do a buch of cleanups
on top, removing quite some of the code that was also supposed to prevent the
cpuset update races, but doesn't work anymore nowadays. This should fix the
most pressing concerns with this topic and give us a better baseline before
either proceeding with the original proposal, or pushing a change of semantics
that removes the problem 1) above. I'd be then fine with trying to change the
semantic first and rewrite later.

Patchset is based on next-20170411 and has been tested with the LTP cpuset01
stress test.

[1] https://lkml.kernel.org/r/4c44a589-5fd8-08d0-892c-e893bb525b71@suse.cz
[2] https://lwn.net/Articles/717797/
[3] https://marc.info/?l=linux-mm&m=149191957922828&w=2

Vlastimil Babka (6):
  mm, page_alloc: fix more premature OOM due to race with cpuset update
  mm, mempolicy: stop adjusting current->il_next in
    mpol_rebind_nodemask()
  mm, page_alloc: pass preferred nid instead of zonelist to allocator
  mm, mempolicy: simplify rebinding mempolicies when updating cpusets
  mm, cpuset: always use seqlock when changing task's nodemask
  mm, mempolicy: don't check cpuset seqlock where it doesn't matter

 include/linux/gfp.h            |  11 ++-
 include/linux/mempolicy.h      |  12 ++-
 include/linux/sched.h          |   2 +-
 include/uapi/linux/mempolicy.h |   8 --
 kernel/cgroup/cpuset.c         |  33 ++------
 mm/hugetlb.c                   |  15 ++--
 mm/memory_hotplug.c            |   6 +-
 mm/mempolicy.c                 | 181 ++++++++++-------------------------------
 mm/page_alloc.c                |  61 ++++++++++----
 9 files changed, 118 insertions(+), 211 deletions(-)

-- 
2.12.2

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
  2017-05-17  8:11 [PATCH v2 0/6] cpuset/mempolicies related fixes and cleanups Vlastimil Babka
@ 2017-05-17  8:11 ` Vlastimil Babka
  2017-05-19 11:51   ` Michal Hocko
  2017-05-17  8:11 ` [PATCH v2 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask() Vlastimil Babka
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Vlastimil Babka @ 2017-05-17  8:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-api, linux-kernel, cgroups, Li Zefan,
	Michal Hocko, Mel Gorman, David Rientjes, Christoph Lameter,
	Hugh Dickins, Andrea Arcangeli, Anshuman Khandual,
	Kirill A. Shutemov, Vlastimil Babka

Commit e47483bca2cc ("mm, page_alloc: fix premature OOM when racing with cpuset
mems update") has fixed known recent regressions found by LTP's cpuset01
testcase. I have however found that by modifying the testcase to use per-vma
mempolicies via bind(2) instead of per-task mempolicies via set_mempolicy(2),
the premature OOM still happens and the issue is much older.

The root of the problem is that the cpuset's mems_allowed and mempolicy's
nodemask can temporarily have no intersection, thus get_page_from_freelist()
cannot find any usable zone. The current semantic for empty intersection is to
ignore mempolicy's nodemask and honour cpuset restrictions. This is checked in
node_zonelist(), but the racy update can happen after we already passed the
check. Such races should be protected by the seqlock task->mems_allowed_seq,
but it doesn't work here, because 1) mpol_rebind_mm() does not happen under
seqlock for write, and doing so would lead to deadlock, as it takes mmap_sem
for write, while the allocation can have mmap_sem for read when it's taking the
seqlock for read. And 2) the seqlock cookie of callers of node_zonelist()
(alloc_pages_vma() and alloc_pages_current()) is different than the one of
__alloc_pages_slowpath(), so there's still a potential race window.

This patch fixes the issue by having __alloc_pages_slowpath() check for empty
intersection of cpuset and ac->nodemask before OOM or allocation failure. If
it's indeed empty, the nodemask is ignored and allocation retried, which mimics
node_zonelist(). This works fine, because almost all callers of
__alloc_pages_nodemask are obtaining the nodemask via node_zonelist(). The only
exception is new_node_page() from hotplug, where the potential violation of
nodemask isn't an issue, as there's already a fallback allocation attempt
without any nodemask. If there's a future caller that needs to have its specific
nodemask honoured over task's cpuset restrictions, we'll have to e.g. add a gfp
flag for that.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 38 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index beb2827fd5de..43aa767c3188 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3661,6 +3661,39 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	return false;
 }
 
+static inline bool
+check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac)
+{
+	/*
+	 * It's possible that cpuset's mems_allowed and the nodemask from
+	 * mempolicy don't intersect. This should be normally dealt with by
+	 * policy_nodemask(), but it's possible to race with cpuset update in
+	 * such a way the check therein was true, and then it became false
+	 * before we got our cpuset_mems_cookie here.
+	 * This assumes that for all allocations, ac->nodemask can come only
+	 * from MPOL_BIND mempolicy (whose documented semantics is to be ignored
+	 * when it does not intersect with the cpuset restrictions) or the
+	 * caller can deal with a violated nodemask.
+	 */
+	if (cpusets_enabled() && ac->nodemask &&
+			!cpuset_nodemask_valid_mems_allowed(ac->nodemask)) {
+		ac->nodemask = NULL;
+		return true;
+	}
+
+	/*
+	 * When updating a task's mems_allowed or mempolicy nodemask, it is
+	 * possible to race with parallel threads in such a way that our
+	 * allocation can fail while the mask is being updated. If we are about
+	 * to fail, check if the cpuset changed during allocation and if so,
+	 * retry.
+	 */
+	if (read_mems_allowed_retry(cpuset_mems_cookie))
+		return true;
+
+	return false;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
@@ -3856,11 +3889,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				&compaction_retries))
 		goto retry;
 
-	/*
-	 * It's possible we raced with cpuset update so the OOM would be
-	 * premature (see below the nopage: label for full explanation).
-	 */
-	if (read_mems_allowed_retry(cpuset_mems_cookie))
+
+	/* Deal with possible cpuset update races before we start OOM killing */
+	if (check_retry_cpuset(cpuset_mems_cookie, ac))
 		goto retry_cpuset;
 
 	/* Reclaim has failed us, start killing things */
@@ -3879,14 +3910,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 nopage:
-	/*
-	 * When updating a task's mems_allowed or mempolicy nodemask, it is
-	 * possible to race with parallel threads in such a way that our
-	 * allocation can fail while the mask is being updated. If we are about
-	 * to fail, check if the cpuset changed during allocation and if so,
-	 * retry.
-	 */
-	if (read_mems_allowed_retry(cpuset_mems_cookie))
+	/* Deal with possible cpuset update races before we fail */
+	if (check_retry_cpuset(cpuset_mems_cookie, ac))
 		goto retry_cpuset;
 
 	/*
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()
  2017-05-17  8:11 [PATCH v2 0/6] cpuset/mempolicies related fixes and cleanups Vlastimil Babka
  2017-05-17  8:11 ` [PATCH v2 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update Vlastimil Babka
@ 2017-05-17  8:11 ` Vlastimil Babka
  2017-05-17 15:07   ` Christoph Lameter
  2017-05-17  8:11 ` [PATCH v2 3/6] mm, page_alloc: pass preferred nid instead of zonelist to allocator Vlastimil Babka
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Vlastimil Babka @ 2017-05-17  8:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-api, linux-kernel, cgroups, Li Zefan,
	Michal Hocko, Mel Gorman, David Rientjes, Christoph Lameter,
	Hugh Dickins, Andrea Arcangeli, Anshuman Khandual,
	Kirill A. Shutemov, Vlastimil Babka

The task->il_next variable stores the next allocation node id for task's
MPOL_INTERLEAVE policy. mpol_rebind_nodemask() updates interleave and
bind mempolicies due to changing cpuset mems. Currently it also tries to
make sure that current->il_next is valid within the updated nodemask. This is
bogus, because 1) we are updating potentially any task's mempolicy, not just
current, and 2) we might be updating a per-vma mempolicy, not task one.

The interleave_nodes() function that uses il_next can cope fine with the value
not being within the currently allowed nodes, so this hasn't manifested as an
actual issue.

We can remove the need for updating il_next completely by changing it to
il_prev and store the node id of the previous interleave allocation instead of
the next id. Then interleave_nodes() can calculate the next id using the
current nodemask and also store it as il_prev, except when querying the next
node via do_get_mempolicy().

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/sched.h |  2 +-
 mm/mempolicy.c        | 22 +++++++---------------
 2 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6a97386c785e..b72bbfec01f9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -884,7 +884,7 @@ struct task_struct {
 #ifdef CONFIG_NUMA
 	/* Protected by alloc_lock: */
 	struct mempolicy		*mempolicy;
-	short				il_next;
+	short				il_prev;
 	short				pref_node_fork;
 #endif
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 37d0b334bfe9..d77177c7283b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -349,12 +349,6 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes,
 		pol->v.nodes = tmp;
 	else
 		BUG();
-
-	if (!node_isset(current->il_next, tmp)) {
-		current->il_next = next_node_in(current->il_next, tmp);
-		if (current->il_next >= MAX_NUMNODES)
-			current->il_next = numa_node_id();
-	}
 }
 
 static void mpol_rebind_preferred(struct mempolicy *pol,
@@ -812,9 +806,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 	}
 	old = current->mempolicy;
 	current->mempolicy = new;
-	if (new && new->mode == MPOL_INTERLEAVE &&
-	    nodes_weight(new->v.nodes))
-		current->il_next = first_node(new->v.nodes);
+	if (new && new->mode == MPOL_INTERLEAVE)
+		current->il_prev = MAX_NUMNODES-1;
 	task_unlock(current);
 	mpol_put(old);
 	ret = 0;
@@ -916,7 +909,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 			*policy = err;
 		} else if (pol == current->mempolicy &&
 				pol->mode == MPOL_INTERLEAVE) {
-			*policy = current->il_next;
+			*policy = next_node_in(current->il_prev, pol->v.nodes);
 		} else {
 			err = -EINVAL;
 			goto out;
@@ -1697,14 +1690,13 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
 /* Do dynamic interleaving for a process */
 static unsigned interleave_nodes(struct mempolicy *policy)
 {
-	unsigned nid, next;
+	unsigned next;
 	struct task_struct *me = current;
 
-	nid = me->il_next;
-	next = next_node_in(nid, policy->v.nodes);
+	next = next_node_in(me->il_prev, policy->v.nodes);
 	if (next < MAX_NUMNODES)
-		me->il_next = next;
-	return nid;
+		me->il_prev = next;
+	return next;
 }
 
 /*
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 3/6] mm, page_alloc: pass preferred nid instead of zonelist to allocator
  2017-05-17  8:11 [PATCH v2 0/6] cpuset/mempolicies related fixes and cleanups Vlastimil Babka
  2017-05-17  8:11 ` [PATCH v2 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update Vlastimil Babka
  2017-05-17  8:11 ` [PATCH v2 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask() Vlastimil Babka
@ 2017-05-17  8:11 ` Vlastimil Babka
  2017-05-17 15:19   ` Christoph Lameter
  2017-05-19 11:59   ` Michal Hocko
  2017-05-17  8:11 ` [PATCH v2 4/6] mm, mempolicy: simplify rebinding mempolicies when updating cpusets Vlastimil Babka
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 17+ messages in thread
From: Vlastimil Babka @ 2017-05-17  8:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-api, linux-kernel, cgroups, Li Zefan,
	Michal Hocko, Mel Gorman, David Rientjes, Christoph Lameter,
	Hugh Dickins, Andrea Arcangeli, Anshuman Khandual,
	Kirill A. Shutemov, Vlastimil Babka

The main allocator function __alloc_pages_nodemask() takes a zonelist pointer
as one of its parameters. All of its callers directly or indirectly obtain the
zonelist via node_zonelist() using a preferred node id and gfp_mask. We can
make the code a bit simpler by doing the zonelist lookup in
__alloc_pages_nodemask(), passing it a preferred node id instead (gfp_mask is
already another parameter).

There are some code size benefits thanks to removal of inlined node_zonelist():

bloat-o-meter add/remove: 2/2 grow/shrink: 4/36 up/down: 399/-1351 (-952)

This will also make things simpler if we proceed with converting cpusets to
zonelists.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/gfp.h       | 11 +++++------
 include/linux/mempolicy.h |  6 +++---
 mm/hugetlb.c              | 15 +++++++++------
 mm/memory_hotplug.c       |  6 ++----
 mm/mempolicy.c            | 41 +++++++++++++++++++----------------------
 mm/page_alloc.c           | 10 +++++-----
 6 files changed, 43 insertions(+), 46 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 2b1a44f5bdb6..666af3c39d00 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -432,14 +432,13 @@ static inline void arch_alloc_page(struct page *page, int order) { }
 #endif
 
 struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-		       struct zonelist *zonelist, nodemask_t *nodemask);
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
+							nodemask_t *nodemask);
 
 static inline struct page *
-__alloc_pages(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist)
+__alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
 {
-	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
+	return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
 }
 
 /*
@@ -452,7 +451,7 @@ __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
 	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
 	VM_WARN_ON(!node_online(nid));
 
-	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+	return __alloc_pages(gfp_mask, order, nid);
 }
 
 /*
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5f4d8281832b..ecb6cbeede5a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -146,7 +146,7 @@ extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
 				enum mpol_rebind_step step);
 extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
 
-extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
+extern int huge_node(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
 extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
@@ -269,13 +269,13 @@ static inline void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
 {
 }
 
-static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
+static inline int huge_node(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask)
 {
 	*mpol = NULL;
 	*nodemask = NULL;
-	return node_zonelist(0, gfp_flags);
+	return 0;
 }
 
 static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e5828875f7bb..9f1f399bb913 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -904,6 +904,8 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	struct page *page = NULL;
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
+	gfp_t gfp_mask;
+	int nid;
 	struct zonelist *zonelist;
 	struct zone *zone;
 	struct zoneref *z;
@@ -924,12 +926,13 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
-	zonelist = huge_zonelist(vma, address,
-					htlb_alloc_mask(h), &mpol, &nodemask);
+	gfp_mask = htlb_alloc_mask(h);
+	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
+	zonelist = node_zonelist(nid, gfp_mask);
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 						MAX_NR_ZONES - 1, nodemask) {
-		if (cpuset_zone_allowed(zone, htlb_alloc_mask(h))) {
+		if (cpuset_zone_allowed(zone, gfp_mask)) {
 			page = dequeue_huge_page_node(h, zone_to_nid(zone));
 			if (page) {
 				if (avoid_reserve)
@@ -1545,13 +1548,13 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
 	do {
 		struct page *page;
 		struct mempolicy *mpol;
-		struct zonelist *zl;
+		int nid;
 		nodemask_t *nodemask;
 
 		cpuset_mems_cookie = read_mems_allowed_begin();
-		zl = huge_zonelist(vma, addr, gfp, &mpol, &nodemask);
+		nid = huge_node(vma, addr, gfp, &mpol, &nodemask);
 		mpol_cond_put(mpol);
-		page = __alloc_pages_nodemask(gfp, order, zl, nodemask);
+		page = __alloc_pages_nodemask(gfp, order, nid, nodemask);
 		if (page)
 			return page;
 	} while (read_mems_allowed_retry(cpuset_mems_cookie));
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 717c5e301aa8..ba9e09817f37 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1596,11 +1596,9 @@ static struct page *new_node_page(struct page *page, unsigned long private,
 		gfp_mask |= __GFP_HIGHMEM;
 
 	if (!nodes_empty(nmask))
-		new_page = __alloc_pages_nodemask(gfp_mask, 0,
-					node_zonelist(nid, gfp_mask), &nmask);
+		new_page = __alloc_pages_nodemask(gfp_mask, 0, nid, &nmask);
 	if (!new_page)
-		new_page = __alloc_pages(gfp_mask, 0,
-					node_zonelist(nid, gfp_mask));
+		new_page = __alloc_pages(gfp_mask, 0, nid);
 
 	return new_page;
 }
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d77177c7283b..c60807625fd5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1669,9 +1669,9 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 	return NULL;
 }
 
-/* Return a zonelist indicated by gfp for node representing a mempolicy */
-static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
-	int nd)
+/* Return the node id preferred by the given mempolicy, or the given id */
+static int policy_node(gfp_t gfp, struct mempolicy *policy,
+								int nd)
 {
 	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
 		nd = policy->v.preferred_node;
@@ -1684,7 +1684,7 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
 		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
 	}
 
-	return node_zonelist(nd, gfp);
+	return nd;
 }
 
 /* Do dynamic interleaving for a process */
@@ -1791,38 +1791,37 @@ static inline unsigned interleave_nid(struct mempolicy *pol,
 
 #ifdef CONFIG_HUGETLBFS
 /*
- * huge_zonelist(@vma, @addr, @gfp_flags, @mpol)
+ * huge_node(@vma, @addr, @gfp_flags, @mpol)
  * @vma: virtual memory area whose policy is sought
  * @addr: address in @vma for shared policy lookup and interleave policy
  * @gfp_flags: for requested zone
  * @mpol: pointer to mempolicy pointer for reference counted mempolicy
  * @nodemask: pointer to nodemask pointer for MPOL_BIND nodemask
  *
- * Returns a zonelist suitable for a huge page allocation and a pointer
+ * Returns a nid suitable for a huge page allocation and a pointer
  * to the struct mempolicy for conditional unref after allocation.
  * If the effective policy is 'BIND, returns a pointer to the mempolicy's
  * @nodemask for filtering the zonelist.
  *
  * Must be protected by read_mems_allowed_begin()
  */
-struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
-				gfp_t gfp_flags, struct mempolicy **mpol,
-				nodemask_t **nodemask)
+int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
+				struct mempolicy **mpol, nodemask_t **nodemask)
 {
-	struct zonelist *zl;
+	int nid;
 
 	*mpol = get_vma_policy(vma, addr);
 	*nodemask = NULL;	/* assume !MPOL_BIND */
 
 	if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
-		zl = node_zonelist(interleave_nid(*mpol, vma, addr,
-				huge_page_shift(hstate_vma(vma))), gfp_flags);
+		nid = interleave_nid(*mpol, vma, addr,
+					huge_page_shift(hstate_vma(vma)));
 	} else {
-		zl = policy_zonelist(gfp_flags, *mpol, numa_node_id());
+		nid = policy_node(gfp_flags, *mpol, numa_node_id());
 		if ((*mpol)->mode == MPOL_BIND)
 			*nodemask = &(*mpol)->v.nodes;
 	}
-	return zl;
+	return nid;
 }
 
 /*
@@ -1924,12 +1923,10 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
 static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
 					unsigned nid)
 {
-	struct zonelist *zl;
 	struct page *page;
 
-	zl = node_zonelist(nid, gfp);
-	page = __alloc_pages(gfp, order, zl);
-	if (page && page_zone(page) == zonelist_zone(&zl->_zonerefs[0]))
+	page = __alloc_pages(gfp, order, nid);
+	if (page && page_to_nid(page) == nid)
 		inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
 	return page;
 }
@@ -1963,8 +1960,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 {
 	struct mempolicy *pol;
 	struct page *page;
+	int preferred_nid;
 	unsigned int cpuset_mems_cookie;
-	struct zonelist *zl;
 	nodemask_t *nmask;
 
 retry_cpuset:
@@ -2007,8 +2004,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	}
 
 	nmask = policy_nodemask(gfp, pol);
-	zl = policy_zonelist(gfp, pol, node);
-	page = __alloc_pages_nodemask(gfp, order, zl, nmask);
+	preferred_nid = policy_node(gfp, pol, node);
+	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
 	mpol_cond_put(pol);
 out:
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
@@ -2055,7 +2052,7 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
 	else
 		page = __alloc_pages_nodemask(gfp, order,
-				policy_zonelist(gfp, pol, numa_node_id()),
+				policy_node(gfp, pol, numa_node_id()),
 				policy_nodemask(gfp, pol));
 
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 43aa767c3188..0aceca1076dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3962,12 +3962,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 }
 
 static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist, nodemask_t *nodemask,
+		int preferred_nid, nodemask_t *nodemask,
 		struct alloc_context *ac, gfp_t *alloc_mask,
 		unsigned int *alloc_flags)
 {
 	ac->high_zoneidx = gfp_zone(gfp_mask);
-	ac->zonelist = zonelist;
+	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
 	ac->nodemask = nodemask;
 	ac->migratetype = gfpflags_to_migratetype(gfp_mask);
 
@@ -4012,8 +4012,8 @@ static inline void finalise_ac(gfp_t gfp_mask,
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
-			struct zonelist *zonelist, nodemask_t *nodemask)
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
+							nodemask_t *nodemask)
 {
 	struct page *page;
 	unsigned int alloc_flags = ALLOC_WMARK_LOW;
@@ -4021,7 +4021,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	struct alloc_context ac = { };
 
 	gfp_mask &= gfp_allowed_mask;
-	if (!prepare_alloc_pages(gfp_mask, order, zonelist, nodemask, &ac, &alloc_mask, &alloc_flags))
+	if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
 		return NULL;
 
 	finalise_ac(gfp_mask, order, &ac);
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 4/6] mm, mempolicy: simplify rebinding mempolicies when updating cpusets
  2017-05-17  8:11 [PATCH v2 0/6] cpuset/mempolicies related fixes and cleanups Vlastimil Babka
                   ` (2 preceding siblings ...)
  2017-05-17  8:11 ` [PATCH v2 3/6] mm, page_alloc: pass preferred nid instead of zonelist to allocator Vlastimil Babka
@ 2017-05-17  8:11 ` Vlastimil Babka
  2017-05-23  7:11   ` Michal Hocko
  2017-05-17  8:11 ` [PATCH v2 5/6] mm, cpuset: always use seqlock when changing task's nodemask Vlastimil Babka
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Vlastimil Babka @ 2017-05-17  8:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-api, linux-kernel, cgroups, Li Zefan,
	Michal Hocko, Mel Gorman, David Rientjes, Christoph Lameter,
	Hugh Dickins, Andrea Arcangeli, Anshuman Khandual,
	Kirill A. Shutemov, Vlastimil Babka

Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
cpuset's mems") has introduced a two-step protocol when rebinding task's
mempolicy due to cpuset update, in order to avoid a parallel allocation seeing
an empty effective nodemask and failing. Later, commit cc9a6c877661 ("cpuset:
mm: reduce large amounts of memory barrier related damage v3") introduced
a seqlock protection and removed the synchronization point between the two
update steps. At that point (or perhaps later), the two-step rebinding became
unnecessary. Currently it only makes sure that the update first adds new nodes
in step 1 and then removes nodes in step 2. Without memory barriers the effects
are questionable, and even then this cannot prevent a parallel zonelist
iteration checking the nodemask at each step to observe all nodes as unusable
for allocation. We now fully rely on the seqlock to prevent premature OOMs and
allocation failures.

We can thus remove the two-step update parts and simplify the code.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mempolicy.h      |   6 +--
 include/uapi/linux/mempolicy.h |   8 ----
 kernel/cgroup/cpuset.c         |   4 +-
 mm/mempolicy.c                 | 102 ++++++++---------------------------------
 4 files changed, 21 insertions(+), 99 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index ecb6cbeede5a..3a58b4be1b0c 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -142,8 +142,7 @@ bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
-extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
-				enum mpol_rebind_step step);
+extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new);
 extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
 
 extern int huge_node(struct vm_area_struct *vma,
@@ -260,8 +259,7 @@ static inline void numa_default_policy(void)
 }
 
 static inline void mpol_rebind_task(struct task_struct *tsk,
-				const nodemask_t *new,
-				enum mpol_rebind_step step)
+				const nodemask_t *new)
 {
 }
 
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 9cd8b21dddbe..2a4d89508fec 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -24,13 +24,6 @@ enum {
 	MPOL_MAX,	/* always last member of enum */
 };
 
-enum mpol_rebind_step {
-	MPOL_REBIND_ONCE,	/* do rebind work at once(not by two step) */
-	MPOL_REBIND_STEP1,	/* first step(set all the newly nodes) */
-	MPOL_REBIND_STEP2,	/* second step(clean all the disallowed nodes)*/
-	MPOL_REBIND_NSTEP,
-};
-
 /* Flags for set_mempolicy */
 #define MPOL_F_STATIC_NODES	(1 << 15)
 #define MPOL_F_RELATIVE_NODES	(1 << 14)
@@ -65,7 +58,6 @@ enum mpol_rebind_step {
  */
 #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
-#define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 #define MPOL_F_MORON	(1 << 4) /* Migrate On protnone Reference On Node */
 
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 0f41292be0fb..dfd5b420452d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1063,9 +1063,7 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
 	}
 
 	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
-	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
-
-	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2);
+	mpol_rebind_task(tsk, newmems);
 	tsk->mems_allowed = *newmems;
 
 	if (need_loop) {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index c60807625fd5..047181452040 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -146,22 +146,7 @@ struct mempolicy *get_task_policy(struct task_struct *p)
 
 static const struct mempolicy_operations {
 	int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
-	/*
-	 * If read-side task has no lock to protect task->mempolicy, write-side
-	 * task will rebind the task->mempolicy by two step. The first step is
-	 * setting all the newly nodes, and the second step is cleaning all the
-	 * disallowed nodes. In this way, we can avoid finding no node to alloc
-	 * page.
-	 * If we have a lock to protect task->mempolicy in read-side, we do
-	 * rebind directly.
-	 *
-	 * step:
-	 * 	MPOL_REBIND_ONCE - do rebind work at once
-	 * 	MPOL_REBIND_STEP1 - set all the newly nodes
-	 * 	MPOL_REBIND_STEP2 - clean all the disallowed nodes
-	 */
-	void (*rebind)(struct mempolicy *pol, const nodemask_t *nodes,
-			enum mpol_rebind_step step);
+	void (*rebind)(struct mempolicy *pol, const nodemask_t *nodes);
 } mpol_ops[MPOL_MAX];
 
 static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
@@ -304,19 +289,11 @@ void __mpol_put(struct mempolicy *p)
 	kmem_cache_free(policy_cache, p);
 }
 
-static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes,
-				enum mpol_rebind_step step)
+static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes)
 {
 }
 
-/*
- * step:
- * 	MPOL_REBIND_ONCE  - do rebind work at once
- * 	MPOL_REBIND_STEP1 - set all the newly nodes
- * 	MPOL_REBIND_STEP2 - clean all the disallowed nodes
- */
-static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes,
-				 enum mpol_rebind_step step)
+static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
 {
 	nodemask_t tmp;
 
@@ -325,35 +302,19 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes,
 	else if (pol->flags & MPOL_F_RELATIVE_NODES)
 		mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
 	else {
-		/*
-		 * if step == 1, we use ->w.cpuset_mems_allowed to cache the
-		 * result
-		 */
-		if (step == MPOL_REBIND_ONCE || step == MPOL_REBIND_STEP1) {
-			nodes_remap(tmp, pol->v.nodes,
-					pol->w.cpuset_mems_allowed, *nodes);
-			pol->w.cpuset_mems_allowed = step ? tmp : *nodes;
-		} else if (step == MPOL_REBIND_STEP2) {
-			tmp = pol->w.cpuset_mems_allowed;
-			pol->w.cpuset_mems_allowed = *nodes;
-		} else
-			BUG();
+		nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed,
+								*nodes);
+		pol->w.cpuset_mems_allowed = tmp;
 	}
 
 	if (nodes_empty(tmp))
 		tmp = *nodes;
 
-	if (step == MPOL_REBIND_STEP1)
-		nodes_or(pol->v.nodes, pol->v.nodes, tmp);
-	else if (step == MPOL_REBIND_ONCE || step == MPOL_REBIND_STEP2)
-		pol->v.nodes = tmp;
-	else
-		BUG();
+	pol->v.nodes = tmp;
 }
 
 static void mpol_rebind_preferred(struct mempolicy *pol,
-				  const nodemask_t *nodes,
-				  enum mpol_rebind_step step)
+						const nodemask_t *nodes)
 {
 	nodemask_t tmp;
 
@@ -379,42 +340,19 @@ static void mpol_rebind_preferred(struct mempolicy *pol,
 /*
  * mpol_rebind_policy - Migrate a policy to a different set of nodes
  *
- * If read-side task has no lock to protect task->mempolicy, write-side
- * task will rebind the task->mempolicy by two step. The first step is
- * setting all the newly nodes, and the second step is cleaning all the
- * disallowed nodes. In this way, we can avoid finding no node to alloc
- * page.
- * If we have a lock to protect task->mempolicy in read-side, we do
- * rebind directly.
- *
- * step:
- * 	MPOL_REBIND_ONCE  - do rebind work at once
- * 	MPOL_REBIND_STEP1 - set all the newly nodes
- * 	MPOL_REBIND_STEP2 - clean all the disallowed nodes
+ * Per-vma policies are protected by mmap_sem. Allocations using per-task
+ * policies are protected by task->mems_allowed_seq to prevent a premature
+ * OOM/allocation failure due to parallel nodemask modification.
  */
-static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask,
-				enum mpol_rebind_step step)
+static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask)
 {
 	if (!pol)
 		return;
-	if (!mpol_store_user_nodemask(pol) && step == MPOL_REBIND_ONCE &&
+	if (!mpol_store_user_nodemask(pol) &&
 	    nodes_equal(pol->w.cpuset_mems_allowed, *newmask))
 		return;
 
-	if (step == MPOL_REBIND_STEP1 && (pol->flags & MPOL_F_REBINDING))
-		return;
-
-	if (step == MPOL_REBIND_STEP2 && !(pol->flags & MPOL_F_REBINDING))
-		BUG();
-
-	if (step == MPOL_REBIND_STEP1)
-		pol->flags |= MPOL_F_REBINDING;
-	else if (step == MPOL_REBIND_STEP2)
-		pol->flags &= ~MPOL_F_REBINDING;
-	else if (step >= MPOL_REBIND_NSTEP)
-		BUG();
-
-	mpol_ops[pol->mode].rebind(pol, newmask, step);
+	mpol_ops[pol->mode].rebind(pol, newmask);
 }
 
 /*
@@ -424,10 +362,9 @@ static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask,
  * Called with task's alloc_lock held.
  */
 
-void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
-			enum mpol_rebind_step step)
+void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new)
 {
-	mpol_rebind_policy(tsk->mempolicy, new, step);
+	mpol_rebind_policy(tsk->mempolicy, new);
 }
 
 /*
@@ -442,7 +379,7 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
 
 	down_write(&mm->mmap_sem);
 	for (vma = mm->mmap; vma; vma = vma->vm_next)
-		mpol_rebind_policy(vma->vm_policy, new, MPOL_REBIND_ONCE);
+		mpol_rebind_policy(vma->vm_policy, new);
 	up_write(&mm->mmap_sem);
 }
 
@@ -2101,10 +2038,7 @@ struct mempolicy *__mpol_dup(struct mempolicy *old)
 
 	if (current_cpuset_is_being_rebound()) {
 		nodemask_t mems = cpuset_mems_allowed(current);
-		if (new->flags & MPOL_F_REBINDING)
-			mpol_rebind_policy(new, &mems, MPOL_REBIND_STEP2);
-		else
-			mpol_rebind_policy(new, &mems, MPOL_REBIND_ONCE);
+		mpol_rebind_policy(new, &mems);
 	}
 	atomic_set(&new->refcnt, 1);
 	return new;
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 5/6] mm, cpuset: always use seqlock when changing task's nodemask
  2017-05-17  8:11 [PATCH v2 0/6] cpuset/mempolicies related fixes and cleanups Vlastimil Babka
                   ` (3 preceding siblings ...)
  2017-05-17  8:11 ` [PATCH v2 4/6] mm, mempolicy: simplify rebinding mempolicies when updating cpusets Vlastimil Babka
@ 2017-05-17  8:11 ` Vlastimil Babka
  2017-05-23  7:15   ` Michal Hocko
  2017-05-17  8:11 ` [PATCH v2 6/6] mm, mempolicy: don't check cpuset seqlock where it doesn't matter Vlastimil Babka
  2017-05-17 16:24 ` [PATCH v2 0/6] cpuset/mempolicies related fixes and cleanups Michal Hocko
  6 siblings, 1 reply; 17+ messages in thread
From: Vlastimil Babka @ 2017-05-17  8:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-api, linux-kernel, cgroups, Li Zefan,
	Michal Hocko, Mel Gorman, David Rientjes, Christoph Lameter,
	Hugh Dickins, Andrea Arcangeli, Anshuman Khandual,
	Kirill A. Shutemov, Vlastimil Babka

When updating task's mems_allowed and rebinding its mempolicy due to cpuset's
mems being changed, we currently only take the seqlock for writing when either
the task has a mempolicy, or the new mems has no intersection with the old
mems. This should be enough to prevent a parallel allocation seeing no
available nodes, but the optimization is IMHO unnecessary (cpuset updates
should not be frequent), and we still potentially risk issues if the
intersection of new and old nodes has limited amount of free/reclaimable
memory. Let's just use the seqlock for all tasks.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 kernel/cgroup/cpuset.c | 29 ++++++++---------------------
 1 file changed, 8 insertions(+), 21 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index dfd5b420452d..26a1c360a481 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1038,38 +1038,25 @@ static void cpuset_post_attach(void)
  * @tsk: the task to change
  * @newmems: new nodes that the task will be set
  *
- * In order to avoid seeing no nodes if the old and new nodes are disjoint,
- * we structure updates as setting all new allowed nodes, then clearing newly
- * disallowed ones.
+ * We use the mems_allowed_seq seqlock to safely update both tsk->mems_allowed
+ * and rebind an eventual tasks' mempolicy. If the task is allocating in
+ * parallel, it might temporarily see an empty intersection, which results in
+ * a seqlock check and retry before OOM or allocation failure.
  */
 static void cpuset_change_task_nodemask(struct task_struct *tsk,
 					nodemask_t *newmems)
 {
-	bool need_loop;
-
 	task_lock(tsk);
-	/*
-	 * Determine if a loop is necessary if another thread is doing
-	 * read_mems_allowed_begin().  If at least one node remains unchanged and
-	 * tsk does not have a mempolicy, then an empty nodemask will not be
-	 * possible when mems_allowed is larger than a word.
-	 */
-	need_loop = task_has_mempolicy(tsk) ||
-			!nodes_intersects(*newmems, tsk->mems_allowed);
 
-	if (need_loop) {
-		local_irq_disable();
-		write_seqcount_begin(&tsk->mems_allowed_seq);
-	}
+	local_irq_disable();
+	write_seqcount_begin(&tsk->mems_allowed_seq);
 
 	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
 	mpol_rebind_task(tsk, newmems);
 	tsk->mems_allowed = *newmems;
 
-	if (need_loop) {
-		write_seqcount_end(&tsk->mems_allowed_seq);
-		local_irq_enable();
-	}
+	write_seqcount_end(&tsk->mems_allowed_seq);
+	local_irq_enable();
 
 	task_unlock(tsk);
 }
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 6/6] mm, mempolicy: don't check cpuset seqlock where it doesn't matter
  2017-05-17  8:11 [PATCH v2 0/6] cpuset/mempolicies related fixes and cleanups Vlastimil Babka
                   ` (4 preceding siblings ...)
  2017-05-17  8:11 ` [PATCH v2 5/6] mm, cpuset: always use seqlock when changing task's nodemask Vlastimil Babka
@ 2017-05-17  8:11 ` Vlastimil Babka
  2017-05-23  7:16   ` Michal Hocko
  2017-05-17 16:24 ` [PATCH v2 0/6] cpuset/mempolicies related fixes and cleanups Michal Hocko
  6 siblings, 1 reply; 17+ messages in thread
From: Vlastimil Babka @ 2017-05-17  8:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-api, linux-kernel, cgroups, Li Zefan,
	Michal Hocko, Mel Gorman, David Rientjes, Christoph Lameter,
	Hugh Dickins, Andrea Arcangeli, Anshuman Khandual,
	Kirill A. Shutemov, Vlastimil Babka

Two wrappers of __alloc_pages_nodemask() are checking task->mems_allowed_seq
themselves to retry allocation that has raced with a cpuset update. This has
been shown to be ineffective in preventing premature OOM's which can happen in
__alloc_pages_slowpath() long before it returns back to the wrappers to detect
the race at that level. Previous patches have made __alloc_pages_slowpath()
more robust, so we can now simply remove the seqlock checking in the wrappers
to prevent further wrong impression that it can actually help.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mempolicy.c | 16 ----------------
 1 file changed, 16 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 047181452040..7d8e56214ac0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1898,12 +1898,9 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	struct mempolicy *pol;
 	struct page *page;
 	int preferred_nid;
-	unsigned int cpuset_mems_cookie;
 	nodemask_t *nmask;
 
-retry_cpuset:
 	pol = get_vma_policy(vma, addr);
-	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	if (pol->mode == MPOL_INTERLEAVE) {
 		unsigned nid;
@@ -1945,8 +1942,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
 	mpol_cond_put(pol);
 out:
-	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
-		goto retry_cpuset;
 	return page;
 }
 
@@ -1964,23 +1959,15 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
  *	Allocate a page from the kernel page pool.  When not in
  *	interrupt context and apply the current process NUMA policy.
  *	Returns NULL when no page can be allocated.
- *
- *	Don't call cpuset_update_task_memory_state() unless
- *	1) it's ok to take cpuset_sem (can WAIT), and
- *	2) allocating for current task (not interrupt).
  */
 struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
 	struct mempolicy *pol = &default_policy;
 	struct page *page;
-	unsigned int cpuset_mems_cookie;
 
 	if (!in_interrupt() && !(gfp & __GFP_THISNODE))
 		pol = get_task_policy(current);
 
-retry_cpuset:
-	cpuset_mems_cookie = read_mems_allowed_begin();
-
 	/*
 	 * No reference counting needed for current->mempolicy
 	 * nor system default_policy
@@ -1992,9 +1979,6 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 				policy_node(gfp, pol, numa_node_id()),
 				policy_nodemask(gfp, pol));
 
-	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
-		goto retry_cpuset;
-
 	return page;
 }
 EXPORT_SYMBOL(alloc_pages_current);
-- 
2.12.2

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask()
  2017-05-17  8:11 ` [PATCH v2 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask() Vlastimil Babka
@ 2017-05-17 15:07   ` Christoph Lameter
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Lameter @ 2017-05-17 15:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-api, linux-kernel, cgroups,
	Li Zefan, Michal Hocko, Mel Gorman, David Rientjes, Hugh Dickins,
	Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov

On Wed, 17 May 2017, Vlastimil Babka wrote:

> The task->il_next variable stores the next allocation node id for task's
> MPOL_INTERLEAVE policy. mpol_rebind_nodemask() updates interleave and
> bind mempolicies due to changing cpuset mems. Currently it also tries to
> make sure that current->il_next is valid within the updated nodemask. This is
> bogus, because 1) we are updating potentially any task's mempolicy, not just
> current, and 2) we might be updating a per-vma mempolicy, not task one.

Reviewed-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/6] mm, page_alloc: pass preferred nid instead of zonelist to allocator
  2017-05-17  8:11 ` [PATCH v2 3/6] mm, page_alloc: pass preferred nid instead of zonelist to allocator Vlastimil Babka
@ 2017-05-17 15:19   ` Christoph Lameter
  2017-05-18 10:25     ` Vlastimil Babka
  2017-05-19 11:59   ` Michal Hocko
  1 sibling, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2017-05-17 15:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-api, linux-kernel, cgroups,
	Li Zefan, Michal Hocko, Mel Gorman, David Rientjes, Hugh Dickins,
	Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov,
	Dimitri Sivanich

On Wed, 17 May 2017, Vlastimil Babka wrote:

>  struct page *
> -__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> -		       struct zonelist *zonelist, nodemask_t *nodemask);
> +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> +							nodemask_t *nodemask);
>
>  static inline struct page *
> -__alloc_pages(gfp_t gfp_mask, unsigned int order,
> -		struct zonelist *zonelist)
> +__alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
>  {
> -	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
> +	return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
>  }

Maybe use nid instead of preferred_nid like in __alloc_pages? Otherwise
there may be confusion with the MPOL_PREFER policy.

> @@ -1963,8 +1960,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  {
>  	struct mempolicy *pol;
>  	struct page *page;
> +	int preferred_nid;
>  	unsigned int cpuset_mems_cookie;
> -	struct zonelist *zl;
>  	nodemask_t *nmask;

Same here.

> @@ -4012,8 +4012,8 @@ static inline void finalise_ac(gfp_t gfp_mask,
>   * This is the 'heart' of the zoned buddy allocator.
>   */
>  struct page *
> -__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> -			struct zonelist *zonelist, nodemask_t *nodemask)
> +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> +							nodemask_t *nodemask)
>  {

and here

This looks clean to me. Still feel a bit uneasy about this since I do
remember that we had a reason to use zonelists instead of nodes back then
but cannot remember what that reason was....

CCing Dimitri at SGI. This may break a lot of legacy SGIapps. If you read
this Dimitri then please review this patchset and the discussions around
it.

Reviewed-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 0/6] cpuset/mempolicies related fixes and cleanups
  2017-05-17  8:11 [PATCH v2 0/6] cpuset/mempolicies related fixes and cleanups Vlastimil Babka
                   ` (5 preceding siblings ...)
  2017-05-17  8:11 ` [PATCH v2 6/6] mm, mempolicy: don't check cpuset seqlock where it doesn't matter Vlastimil Babka
@ 2017-05-17 16:24 ` Michal Hocko
  6 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2017-05-17 16:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-api, linux-kernel, cgroups,
	Li Zefan, Mel Gorman, David Rientjes, Christoph Lameter,
	Hugh Dickins, Andrea Arcangeli, Anshuman Khandual,
	Kirill A. Shutemov

On Wed 17-05-17 10:11:34, Vlastimil Babka wrote:
> Changes since RFC v1 [3]:
> 
> - Reworked patch 2 after discussion with Christoph Lameter.
> - Fix bug in patch 5 spotted by Hillf Danton.
> - Rebased to mmotm-2017-05-12-15-53
> 
> I would like to stress that this patchset aims to fix issues and cleanup the
> code *within the existing documented semantics*, i.e. patch 1 ignores mempolicy
> restrictions if the set of allowed nodes has no intersection with set of nodes
> allowed by cpuset. I believe discussing potential changes of the semantics can
> be better done once we have a baseline with no known bugs of the current
> semantics.
> 
> ===
> 
> I've recently summarized the cpuset/mempolicy issues in a LSF/MM proposal [1]
> and the discussion itself [2]. I've been trying to rewrite the handling as
> proposed, with the idea that changing semantics to make all mempolicies static
> wrt cpuset updates (and discarding the relative and default modes) can be tried
> on top, as there's a high risk of being rejected/reverted because somebody
> might still care about the removed modes.
> 
> However I haven't yet figured out how to properly:
> 
> 1) make mempolicies swappable instead of rebinding in place. I thought mbind()
> already works that way and uses refcounting to avoid use-after-free of the old
> policy by a parallel allocation, but turns out true refcounting is only done
> for shared (shmem) mempolicies, and the actual protection for mbind() comes
> from mmap_sem. Extending the refcounting means more overhead in allocator hot
> path.

But that overhead would be there only if any policy is in place, right?
Do you think it would be possible to use per cpu ref counting and thus
reduce the overhead? We use this trick in the css ref. counting in the
memcg and the overhead is not measurable. It would be certainly better
to use the same mechanism for shared and anon mappings whenever
possible.

> Also swapping whole mempolicies means that we have to allocate the new
> ones, which can fail, and reverting of the partially done work also means
> allocating (note that mbind() doesn't care and will just leave part of the
> range updated and part not updated when returning -ENOMEM...).

This is nasty but we already have to deal with half-the-way
initialization already so why cannot we use the same approach? I am
sorry if I am asking something trivial but I have already flushed all
details from memory so have to re-read it again.

> 2) make cpuset's task->mems_allowed also swappable (after converting it from
> nodemask to zonelist, which is the easy part) for mostly the same reasons.
> 
> The good news is that while trying to do the above, I've at least figured out
> how to hopefully close the remaining premature OOM's, and do a buch of cleanups
> on top, removing quite some of the code that was also supposed to prevent the
> cpuset update races, but doesn't work anymore nowadays. This should fix the
> most pressing concerns with this topic and give us a better baseline before
> either proceeding with the original proposal, or pushing a change of semantics
> that removes the problem 1) above. I'd be then fine with trying to change the
> semantic first and rewrite later.

the diffstat definitely looks promissing. I will have a look tomorrow
(unless something unexpected jums in again).

> Patchset is based on next-20170411 and has been tested with the LTP cpuset01
> stress test.
> 
> [1] https://lkml.kernel.org/r/4c44a589-5fd8-08d0-892c-e893bb525b71@suse.cz
> [2] https://lwn.net/Articles/717797/
> [3] https://marc.info/?l=linux-mm&m=149191957922828&w=2
> 
> Vlastimil Babka (6):
>   mm, page_alloc: fix more premature OOM due to race with cpuset update
>   mm, mempolicy: stop adjusting current->il_next in
>     mpol_rebind_nodemask()
>   mm, page_alloc: pass preferred nid instead of zonelist to allocator
>   mm, mempolicy: simplify rebinding mempolicies when updating cpusets
>   mm, cpuset: always use seqlock when changing task's nodemask
>   mm, mempolicy: don't check cpuset seqlock where it doesn't matter
> 
>  include/linux/gfp.h            |  11 ++-
>  include/linux/mempolicy.h      |  12 ++-
>  include/linux/sched.h          |   2 +-
>  include/uapi/linux/mempolicy.h |   8 --
>  kernel/cgroup/cpuset.c         |  33 ++------
>  mm/hugetlb.c                   |  15 ++--
>  mm/memory_hotplug.c            |   6 +-
>  mm/mempolicy.c                 | 181 ++++++++++-------------------------------
>  mm/page_alloc.c                |  61 ++++++++++----
>  9 files changed, 118 insertions(+), 211 deletions(-)
> 
> -- 
> 2.12.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/6] mm, page_alloc: pass preferred nid instead of zonelist to allocator
  2017-05-17 15:19   ` Christoph Lameter
@ 2017-05-18 10:25     ` Vlastimil Babka
  0 siblings, 0 replies; 17+ messages in thread
From: Vlastimil Babka @ 2017-05-18 10:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, linux-mm, linux-api, linux-kernel, cgroups,
	Li Zefan, Michal Hocko, Mel Gorman, David Rientjes, Hugh Dickins,
	Andrea Arcangeli, Anshuman Khandual, Kirill A. Shutemov,
	Dimitri Sivanich

On 05/17/2017 05:19 PM, Christoph Lameter wrote:
> On Wed, 17 May 2017, Vlastimil Babka wrote:
> 
>>  struct page *
>> -__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>> -		       struct zonelist *zonelist, nodemask_t *nodemask);
>> +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
>> +							nodemask_t *nodemask);
>>
>>  static inline struct page *
>> -__alloc_pages(gfp_t gfp_mask, unsigned int order,
>> -		struct zonelist *zonelist)
>> +__alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
>>  {
>> -	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
>> +	return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
>>  }
> 
> Maybe use nid instead of preferred_nid like in __alloc_pages? Otherwise
> there may be confusion with the MPOL_PREFER policy.

I'll think about that.

>> @@ -1963,8 +1960,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>>  {
>>  	struct mempolicy *pol;
>>  	struct page *page;
>> +	int preferred_nid;
>>  	unsigned int cpuset_mems_cookie;
>> -	struct zonelist *zl;
>>  	nodemask_t *nmask;
> 
> Same here.
> 
>> @@ -4012,8 +4012,8 @@ static inline void finalise_ac(gfp_t gfp_mask,
>>   * This is the 'heart' of the zoned buddy allocator.
>>   */
>>  struct page *
>> -__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>> -			struct zonelist *zonelist, nodemask_t *nodemask)
>> +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
>> +							nodemask_t *nodemask)
>>  {
> 
> and here
> 
> This looks clean to me. Still feel a bit uneasy about this since I do
> remember that we had a reason to use zonelists instead of nodes back then
> but cannot remember what that reason was....

My history digging showed me that mempolicies used to have a custom
zonelist attached, not nodemask. So I supposed that's why.

> CCing Dimitri at SGI. This may break a lot of legacy SGIapps. If you read
> this Dimitri then please review this patchset and the discussions around
> it.

Break how? This shouldn't break any apps AFAICS, just out-of-tree kernel
patches/modules as usual when APIs change.

> Reviewed-by: Christoph Lameter <cl@linux.com>

Thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
  2017-05-17  8:11 ` [PATCH v2 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update Vlastimil Babka
@ 2017-05-19 11:51   ` Michal Hocko
  2017-05-23 11:32     ` Vlastimil Babka
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2017-05-19 11:51 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-api, linux-kernel, cgroups,
	Li Zefan, Mel Gorman, David Rientjes, Christoph Lameter,
	Hugh Dickins, Andrea Arcangeli, Anshuman Khandual,
	Kirill A. Shutemov

On Wed 17-05-17 10:11:35, Vlastimil Babka wrote:
> Commit e47483bca2cc ("mm, page_alloc: fix premature OOM when racing with cpuset
> mems update") has fixed known recent regressions found by LTP's cpuset01
> testcase. I have however found that by modifying the testcase to use per-vma
> mempolicies via bind(2) instead of per-task mempolicies via set_mempolicy(2),
> the premature OOM still happens and the issue is much older.
> 
> The root of the problem is that the cpuset's mems_allowed and mempolicy's
> nodemask can temporarily have no intersection, thus get_page_from_freelist()
> cannot find any usable zone. The current semantic for empty intersection is to
> ignore mempolicy's nodemask and honour cpuset restrictions. This is checked in
> node_zonelist(), but the racy update can happen after we already passed the
> check. Such races should be protected by the seqlock task->mems_allowed_seq,
> but it doesn't work here, because 1) mpol_rebind_mm() does not happen under
> seqlock for write, and doing so would lead to deadlock, as it takes mmap_sem
> for write, while the allocation can have mmap_sem for read when it's taking the
> seqlock for read. And 2) the seqlock cookie of callers of node_zonelist()
> (alloc_pages_vma() and alloc_pages_current()) is different than the one of
> __alloc_pages_slowpath(), so there's still a potential race window.
> 
> This patch fixes the issue by having __alloc_pages_slowpath() check for empty
> intersection of cpuset and ac->nodemask before OOM or allocation failure. If
> it's indeed empty, the nodemask is ignored and allocation retried, which mimics
> node_zonelist(). This works fine, because almost all callers of
> __alloc_pages_nodemask are obtaining the nodemask via node_zonelist(). The only
> exception is new_node_page() from hotplug, where the potential violation of
> nodemask isn't an issue, as there's already a fallback allocation attempt
> without any nodemask. If there's a future caller that needs to have its specific
> nodemask honoured over task's cpuset restrictions, we'll have to e.g. add a gfp
> flag for that.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Do we want this backported to the stable tree?

OK I do agree this makes some sense as a quick and easy to backport
workaround.

Acked-by: Michal Hocko <mhocko@suse.com?

> ---
>  mm/page_alloc.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 38 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index beb2827fd5de..43aa767c3188 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3661,6 +3661,39 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  	return false;
>  }
>  
> +static inline bool
> +check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac)
> +{
> +	/*
> +	 * It's possible that cpuset's mems_allowed and the nodemask from
> +	 * mempolicy don't intersect. This should be normally dealt with by
> +	 * policy_nodemask(), but it's possible to race with cpuset update in
> +	 * such a way the check therein was true, and then it became false
> +	 * before we got our cpuset_mems_cookie here.
> +	 * This assumes that for all allocations, ac->nodemask can come only
> +	 * from MPOL_BIND mempolicy (whose documented semantics is to be ignored
> +	 * when it does not intersect with the cpuset restrictions) or the
> +	 * caller can deal with a violated nodemask.
> +	 */
> +	if (cpusets_enabled() && ac->nodemask &&
> +			!cpuset_nodemask_valid_mems_allowed(ac->nodemask)) {
> +		ac->nodemask = NULL;
> +		return true;
> +	}
> +
> +	/*
> +	 * When updating a task's mems_allowed or mempolicy nodemask, it is
> +	 * possible to race with parallel threads in such a way that our
> +	 * allocation can fail while the mask is being updated. If we are about
> +	 * to fail, check if the cpuset changed during allocation and if so,
> +	 * retry.
> +	 */
> +	if (read_mems_allowed_retry(cpuset_mems_cookie))
> +		return true;
> +
> +	return false;
> +}
> +
>  static inline struct page *
>  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  						struct alloc_context *ac)
> @@ -3856,11 +3889,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  				&compaction_retries))
>  		goto retry;
>  
> -	/*
> -	 * It's possible we raced with cpuset update so the OOM would be
> -	 * premature (see below the nopage: label for full explanation).
> -	 */
> -	if (read_mems_allowed_retry(cpuset_mems_cookie))
> +
> +	/* Deal with possible cpuset update races before we start OOM killing */
> +	if (check_retry_cpuset(cpuset_mems_cookie, ac))
>  		goto retry_cpuset;
>  
>  	/* Reclaim has failed us, start killing things */
> @@ -3879,14 +3910,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	}
>  
>  nopage:
> -	/*
> -	 * When updating a task's mems_allowed or mempolicy nodemask, it is
> -	 * possible to race with parallel threads in such a way that our
> -	 * allocation can fail while the mask is being updated. If we are about
> -	 * to fail, check if the cpuset changed during allocation and if so,
> -	 * retry.
> -	 */
> -	if (read_mems_allowed_retry(cpuset_mems_cookie))
> +	/* Deal with possible cpuset update races before we fail */
> +	if (check_retry_cpuset(cpuset_mems_cookie, ac))
>  		goto retry_cpuset;
>  
>  	/*
> -- 
> 2.12.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/6] mm, page_alloc: pass preferred nid instead of zonelist to allocator
  2017-05-17  8:11 ` [PATCH v2 3/6] mm, page_alloc: pass preferred nid instead of zonelist to allocator Vlastimil Babka
  2017-05-17 15:19   ` Christoph Lameter
@ 2017-05-19 11:59   ` Michal Hocko
  1 sibling, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2017-05-19 11:59 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-api, linux-kernel, cgroups,
	Li Zefan, Mel Gorman, David Rientjes, Christoph Lameter,
	Hugh Dickins, Andrea Arcangeli, Anshuman Khandual,
	Kirill A. Shutemov

On Wed 17-05-17 10:11:37, Vlastimil Babka wrote:
> The main allocator function __alloc_pages_nodemask() takes a zonelist pointer
> as one of its parameters. All of its callers directly or indirectly obtain the
> zonelist via node_zonelist() using a preferred node id and gfp_mask. We can
> make the code a bit simpler by doing the zonelist lookup in
> __alloc_pages_nodemask(), passing it a preferred node id instead (gfp_mask is
> already another parameter).
> 
> There are some code size benefits thanks to removal of inlined node_zonelist():
> 
> bloat-o-meter add/remove: 2/2 grow/shrink: 4/36 up/down: 399/-1351 (-952)
> 
> This will also make things simpler if we proceed with converting cpusets to
> zonelists.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Makes sense to me
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/gfp.h       | 11 +++++------
>  include/linux/mempolicy.h |  6 +++---
>  mm/hugetlb.c              | 15 +++++++++------
>  mm/memory_hotplug.c       |  6 ++----
>  mm/mempolicy.c            | 41 +++++++++++++++++++----------------------
>  mm/page_alloc.c           | 10 +++++-----
>  6 files changed, 43 insertions(+), 46 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 2b1a44f5bdb6..666af3c39d00 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -432,14 +432,13 @@ static inline void arch_alloc_page(struct page *page, int order) { }
>  #endif
>  
>  struct page *
> -__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> -		       struct zonelist *zonelist, nodemask_t *nodemask);
> +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> +							nodemask_t *nodemask);
>  
>  static inline struct page *
> -__alloc_pages(gfp_t gfp_mask, unsigned int order,
> -		struct zonelist *zonelist)
> +__alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
>  {
> -	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
> +	return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
>  }
>  
>  /*
> @@ -452,7 +451,7 @@ __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
>  	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
>  	VM_WARN_ON(!node_online(nid));
>  
> -	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> +	return __alloc_pages(gfp_mask, order, nid);
>  }
>  
>  /*
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 5f4d8281832b..ecb6cbeede5a 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -146,7 +146,7 @@ extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
>  				enum mpol_rebind_step step);
>  extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
>  
> -extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> +extern int huge_node(struct vm_area_struct *vma,
>  				unsigned long addr, gfp_t gfp_flags,
>  				struct mempolicy **mpol, nodemask_t **nodemask);
>  extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
> @@ -269,13 +269,13 @@ static inline void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
>  {
>  }
>  
> -static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> +static inline int huge_node(struct vm_area_struct *vma,
>  				unsigned long addr, gfp_t gfp_flags,
>  				struct mempolicy **mpol, nodemask_t **nodemask)
>  {
>  	*mpol = NULL;
>  	*nodemask = NULL;
> -	return node_zonelist(0, gfp_flags);
> +	return 0;
>  }
>  
>  static inline bool init_nodemask_of_mempolicy(nodemask_t *m)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index e5828875f7bb..9f1f399bb913 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -904,6 +904,8 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>  	struct page *page = NULL;
>  	struct mempolicy *mpol;
>  	nodemask_t *nodemask;
> +	gfp_t gfp_mask;
> +	int nid;
>  	struct zonelist *zonelist;
>  	struct zone *zone;
>  	struct zoneref *z;
> @@ -924,12 +926,13 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>  
>  retry_cpuset:
>  	cpuset_mems_cookie = read_mems_allowed_begin();
> -	zonelist = huge_zonelist(vma, address,
> -					htlb_alloc_mask(h), &mpol, &nodemask);
> +	gfp_mask = htlb_alloc_mask(h);
> +	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
> +	zonelist = node_zonelist(nid, gfp_mask);
>  
>  	for_each_zone_zonelist_nodemask(zone, z, zonelist,
>  						MAX_NR_ZONES - 1, nodemask) {
> -		if (cpuset_zone_allowed(zone, htlb_alloc_mask(h))) {
> +		if (cpuset_zone_allowed(zone, gfp_mask)) {
>  			page = dequeue_huge_page_node(h, zone_to_nid(zone));
>  			if (page) {
>  				if (avoid_reserve)
> @@ -1545,13 +1548,13 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
>  	do {
>  		struct page *page;
>  		struct mempolicy *mpol;
> -		struct zonelist *zl;
> +		int nid;
>  		nodemask_t *nodemask;
>  
>  		cpuset_mems_cookie = read_mems_allowed_begin();
> -		zl = huge_zonelist(vma, addr, gfp, &mpol, &nodemask);
> +		nid = huge_node(vma, addr, gfp, &mpol, &nodemask);
>  		mpol_cond_put(mpol);
> -		page = __alloc_pages_nodemask(gfp, order, zl, nodemask);
> +		page = __alloc_pages_nodemask(gfp, order, nid, nodemask);
>  		if (page)
>  			return page;
>  	} while (read_mems_allowed_retry(cpuset_mems_cookie));
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 717c5e301aa8..ba9e09817f37 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1596,11 +1596,9 @@ static struct page *new_node_page(struct page *page, unsigned long private,
>  		gfp_mask |= __GFP_HIGHMEM;
>  
>  	if (!nodes_empty(nmask))
> -		new_page = __alloc_pages_nodemask(gfp_mask, 0,
> -					node_zonelist(nid, gfp_mask), &nmask);
> +		new_page = __alloc_pages_nodemask(gfp_mask, 0, nid, &nmask);
>  	if (!new_page)
> -		new_page = __alloc_pages(gfp_mask, 0,
> -					node_zonelist(nid, gfp_mask));
> +		new_page = __alloc_pages(gfp_mask, 0, nid);
>  
>  	return new_page;
>  }
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index d77177c7283b..c60807625fd5 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1669,9 +1669,9 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
>  	return NULL;
>  }
>  
> -/* Return a zonelist indicated by gfp for node representing a mempolicy */
> -static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
> -	int nd)
> +/* Return the node id preferred by the given mempolicy, or the given id */
> +static int policy_node(gfp_t gfp, struct mempolicy *policy,
> +								int nd)
>  {
>  	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
>  		nd = policy->v.preferred_node;
> @@ -1684,7 +1684,7 @@ static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
>  		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
>  	}
>  
> -	return node_zonelist(nd, gfp);
> +	return nd;
>  }
>  
>  /* Do dynamic interleaving for a process */
> @@ -1791,38 +1791,37 @@ static inline unsigned interleave_nid(struct mempolicy *pol,
>  
>  #ifdef CONFIG_HUGETLBFS
>  /*
> - * huge_zonelist(@vma, @addr, @gfp_flags, @mpol)
> + * huge_node(@vma, @addr, @gfp_flags, @mpol)
>   * @vma: virtual memory area whose policy is sought
>   * @addr: address in @vma for shared policy lookup and interleave policy
>   * @gfp_flags: for requested zone
>   * @mpol: pointer to mempolicy pointer for reference counted mempolicy
>   * @nodemask: pointer to nodemask pointer for MPOL_BIND nodemask
>   *
> - * Returns a zonelist suitable for a huge page allocation and a pointer
> + * Returns a nid suitable for a huge page allocation and a pointer
>   * to the struct mempolicy for conditional unref after allocation.
>   * If the effective policy is 'BIND, returns a pointer to the mempolicy's
>   * @nodemask for filtering the zonelist.
>   *
>   * Must be protected by read_mems_allowed_begin()
>   */
> -struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
> -				gfp_t gfp_flags, struct mempolicy **mpol,
> -				nodemask_t **nodemask)
> +int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
> +				struct mempolicy **mpol, nodemask_t **nodemask)
>  {
> -	struct zonelist *zl;
> +	int nid;
>  
>  	*mpol = get_vma_policy(vma, addr);
>  	*nodemask = NULL;	/* assume !MPOL_BIND */
>  
>  	if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
> -		zl = node_zonelist(interleave_nid(*mpol, vma, addr,
> -				huge_page_shift(hstate_vma(vma))), gfp_flags);
> +		nid = interleave_nid(*mpol, vma, addr,
> +					huge_page_shift(hstate_vma(vma)));
>  	} else {
> -		zl = policy_zonelist(gfp_flags, *mpol, numa_node_id());
> +		nid = policy_node(gfp_flags, *mpol, numa_node_id());
>  		if ((*mpol)->mode == MPOL_BIND)
>  			*nodemask = &(*mpol)->v.nodes;
>  	}
> -	return zl;
> +	return nid;
>  }
>  
>  /*
> @@ -1924,12 +1923,10 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
>  static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
>  					unsigned nid)
>  {
> -	struct zonelist *zl;
>  	struct page *page;
>  
> -	zl = node_zonelist(nid, gfp);
> -	page = __alloc_pages(gfp, order, zl);
> -	if (page && page_zone(page) == zonelist_zone(&zl->_zonerefs[0]))
> +	page = __alloc_pages(gfp, order, nid);
> +	if (page && page_to_nid(page) == nid)
>  		inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
>  	return page;
>  }
> @@ -1963,8 +1960,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  {
>  	struct mempolicy *pol;
>  	struct page *page;
> +	int preferred_nid;
>  	unsigned int cpuset_mems_cookie;
> -	struct zonelist *zl;
>  	nodemask_t *nmask;
>  
>  retry_cpuset:
> @@ -2007,8 +2004,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  	}
>  
>  	nmask = policy_nodemask(gfp, pol);
> -	zl = policy_zonelist(gfp, pol, node);
> -	page = __alloc_pages_nodemask(gfp, order, zl, nmask);
> +	preferred_nid = policy_node(gfp, pol, node);
> +	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
>  	mpol_cond_put(pol);
>  out:
>  	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
> @@ -2055,7 +2052,7 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
>  		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
>  	else
>  		page = __alloc_pages_nodemask(gfp, order,
> -				policy_zonelist(gfp, pol, numa_node_id()),
> +				policy_node(gfp, pol, numa_node_id()),
>  				policy_nodemask(gfp, pol));
>  
>  	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 43aa767c3188..0aceca1076dc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3962,12 +3962,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  }
>  
>  static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> -		struct zonelist *zonelist, nodemask_t *nodemask,
> +		int preferred_nid, nodemask_t *nodemask,
>  		struct alloc_context *ac, gfp_t *alloc_mask,
>  		unsigned int *alloc_flags)
>  {
>  	ac->high_zoneidx = gfp_zone(gfp_mask);
> -	ac->zonelist = zonelist;
> +	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
>  	ac->nodemask = nodemask;
>  	ac->migratetype = gfpflags_to_migratetype(gfp_mask);
>  
> @@ -4012,8 +4012,8 @@ static inline void finalise_ac(gfp_t gfp_mask,
>   * This is the 'heart' of the zoned buddy allocator.
>   */
>  struct page *
> -__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> -			struct zonelist *zonelist, nodemask_t *nodemask)
> +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> +							nodemask_t *nodemask)
>  {
>  	struct page *page;
>  	unsigned int alloc_flags = ALLOC_WMARK_LOW;
> @@ -4021,7 +4021,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  	struct alloc_context ac = { };
>  
>  	gfp_mask &= gfp_allowed_mask;
> -	if (!prepare_alloc_pages(gfp_mask, order, zonelist, nodemask, &ac, &alloc_mask, &alloc_flags))
> +	if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
>  		return NULL;
>  
>  	finalise_ac(gfp_mask, order, &ac);
> -- 
> 2.12.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 4/6] mm, mempolicy: simplify rebinding mempolicies when updating cpusets
  2017-05-17  8:11 ` [PATCH v2 4/6] mm, mempolicy: simplify rebinding mempolicies when updating cpusets Vlastimil Babka
@ 2017-05-23  7:11   ` Michal Hocko
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2017-05-23  7:11 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-api, linux-kernel, cgroups,
	Li Zefan, Mel Gorman, David Rientjes, Christoph Lameter,
	Hugh Dickins, Andrea Arcangeli, Anshuman Khandual,
	Kirill A. Shutemov

On Wed 17-05-17 10:11:38, Vlastimil Babka wrote:
> Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
> cpuset's mems") has introduced a two-step protocol when rebinding task's
> mempolicy due to cpuset update, in order to avoid a parallel allocation seeing
> an empty effective nodemask and failing. Later, commit cc9a6c877661 ("cpuset:
> mm: reduce large amounts of memory barrier related damage v3") introduced
> a seqlock protection and removed the synchronization point between the two
> update steps. At that point (or perhaps later), the two-step rebinding became
> unnecessary. Currently it only makes sure that the update first adds new nodes
> in step 1 and then removes nodes in step 2. Without memory barriers the effects
> are questionable, and even then this cannot prevent a parallel zonelist
> iteration checking the nodemask at each step to observe all nodes as unusable
> for allocation. We now fully rely on the seqlock to prevent premature OOMs and
> allocation failures.
> 
> We can thus remove the two-step update parts and simplify the code.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

\o/ This code has been just piling up more complicated code on top of
the code which shouldn't have existed in the first place so I am very
happy to see it go. I hope we can go without rebinding altogether
longterm. Chaging data under feets just asks for problems and this is a
nice example of where it goes.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mempolicy.h      |   6 +--
>  include/uapi/linux/mempolicy.h |   8 ----
>  kernel/cgroup/cpuset.c         |   4 +-
>  mm/mempolicy.c                 | 102 ++++++++---------------------------------
>  4 files changed, 21 insertions(+), 99 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index ecb6cbeede5a..3a58b4be1b0c 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -142,8 +142,7 @@ bool vma_policy_mof(struct vm_area_struct *vma);
>  
>  extern void numa_default_policy(void);
>  extern void numa_policy_init(void);
> -extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
> -				enum mpol_rebind_step step);
> +extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new);
>  extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
>  
>  extern int huge_node(struct vm_area_struct *vma,
> @@ -260,8 +259,7 @@ static inline void numa_default_policy(void)
>  }
>  
>  static inline void mpol_rebind_task(struct task_struct *tsk,
> -				const nodemask_t *new,
> -				enum mpol_rebind_step step)
> +				const nodemask_t *new)
>  {
>  }
>  
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index 9cd8b21dddbe..2a4d89508fec 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -24,13 +24,6 @@ enum {
>  	MPOL_MAX,	/* always last member of enum */
>  };
>  
> -enum mpol_rebind_step {
> -	MPOL_REBIND_ONCE,	/* do rebind work at once(not by two step) */
> -	MPOL_REBIND_STEP1,	/* first step(set all the newly nodes) */
> -	MPOL_REBIND_STEP2,	/* second step(clean all the disallowed nodes)*/
> -	MPOL_REBIND_NSTEP,
> -};
> -
>  /* Flags for set_mempolicy */
>  #define MPOL_F_STATIC_NODES	(1 << 15)
>  #define MPOL_F_RELATIVE_NODES	(1 << 14)
> @@ -65,7 +58,6 @@ enum mpol_rebind_step {
>   */
>  #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
>  #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
> -#define MPOL_F_REBINDING (1 << 2)	/* identify policies in rebinding */
>  #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
>  #define MPOL_F_MORON	(1 << 4) /* Migrate On protnone Reference On Node */
>  
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 0f41292be0fb..dfd5b420452d 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1063,9 +1063,7 @@ static void cpuset_change_task_nodemask(struct task_struct *tsk,
>  	}
>  
>  	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
> -	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP1);
> -
> -	mpol_rebind_task(tsk, newmems, MPOL_REBIND_STEP2);
> +	mpol_rebind_task(tsk, newmems);
>  	tsk->mems_allowed = *newmems;
>  
>  	if (need_loop) {
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index c60807625fd5..047181452040 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -146,22 +146,7 @@ struct mempolicy *get_task_policy(struct task_struct *p)
>  
>  static const struct mempolicy_operations {
>  	int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
> -	/*
> -	 * If read-side task has no lock to protect task->mempolicy, write-side
> -	 * task will rebind the task->mempolicy by two step. The first step is
> -	 * setting all the newly nodes, and the second step is cleaning all the
> -	 * disallowed nodes. In this way, we can avoid finding no node to alloc
> -	 * page.
> -	 * If we have a lock to protect task->mempolicy in read-side, we do
> -	 * rebind directly.
> -	 *
> -	 * step:
> -	 * 	MPOL_REBIND_ONCE - do rebind work at once
> -	 * 	MPOL_REBIND_STEP1 - set all the newly nodes
> -	 * 	MPOL_REBIND_STEP2 - clean all the disallowed nodes
> -	 */
> -	void (*rebind)(struct mempolicy *pol, const nodemask_t *nodes,
> -			enum mpol_rebind_step step);
> +	void (*rebind)(struct mempolicy *pol, const nodemask_t *nodes);
>  } mpol_ops[MPOL_MAX];
>  
>  static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
> @@ -304,19 +289,11 @@ void __mpol_put(struct mempolicy *p)
>  	kmem_cache_free(policy_cache, p);
>  }
>  
> -static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes,
> -				enum mpol_rebind_step step)
> +static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes)
>  {
>  }
>  
> -/*
> - * step:
> - * 	MPOL_REBIND_ONCE  - do rebind work at once
> - * 	MPOL_REBIND_STEP1 - set all the newly nodes
> - * 	MPOL_REBIND_STEP2 - clean all the disallowed nodes
> - */
> -static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes,
> -				 enum mpol_rebind_step step)
> +static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
>  {
>  	nodemask_t tmp;
>  
> @@ -325,35 +302,19 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes,
>  	else if (pol->flags & MPOL_F_RELATIVE_NODES)
>  		mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
>  	else {
> -		/*
> -		 * if step == 1, we use ->w.cpuset_mems_allowed to cache the
> -		 * result
> -		 */
> -		if (step == MPOL_REBIND_ONCE || step == MPOL_REBIND_STEP1) {
> -			nodes_remap(tmp, pol->v.nodes,
> -					pol->w.cpuset_mems_allowed, *nodes);
> -			pol->w.cpuset_mems_allowed = step ? tmp : *nodes;
> -		} else if (step == MPOL_REBIND_STEP2) {
> -			tmp = pol->w.cpuset_mems_allowed;
> -			pol->w.cpuset_mems_allowed = *nodes;
> -		} else
> -			BUG();
> +		nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed,
> +								*nodes);
> +		pol->w.cpuset_mems_allowed = tmp;
>  	}
>  
>  	if (nodes_empty(tmp))
>  		tmp = *nodes;
>  
> -	if (step == MPOL_REBIND_STEP1)
> -		nodes_or(pol->v.nodes, pol->v.nodes, tmp);
> -	else if (step == MPOL_REBIND_ONCE || step == MPOL_REBIND_STEP2)
> -		pol->v.nodes = tmp;
> -	else
> -		BUG();
> +	pol->v.nodes = tmp;
>  }
>  
>  static void mpol_rebind_preferred(struct mempolicy *pol,
> -				  const nodemask_t *nodes,
> -				  enum mpol_rebind_step step)
> +						const nodemask_t *nodes)
>  {
>  	nodemask_t tmp;
>  
> @@ -379,42 +340,19 @@ static void mpol_rebind_preferred(struct mempolicy *pol,
>  /*
>   * mpol_rebind_policy - Migrate a policy to a different set of nodes
>   *
> - * If read-side task has no lock to protect task->mempolicy, write-side
> - * task will rebind the task->mempolicy by two step. The first step is
> - * setting all the newly nodes, and the second step is cleaning all the
> - * disallowed nodes. In this way, we can avoid finding no node to alloc
> - * page.
> - * If we have a lock to protect task->mempolicy in read-side, we do
> - * rebind directly.
> - *
> - * step:
> - * 	MPOL_REBIND_ONCE  - do rebind work at once
> - * 	MPOL_REBIND_STEP1 - set all the newly nodes
> - * 	MPOL_REBIND_STEP2 - clean all the disallowed nodes
> + * Per-vma policies are protected by mmap_sem. Allocations using per-task
> + * policies are protected by task->mems_allowed_seq to prevent a premature
> + * OOM/allocation failure due to parallel nodemask modification.
>   */
> -static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask,
> -				enum mpol_rebind_step step)
> +static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask)
>  {
>  	if (!pol)
>  		return;
> -	if (!mpol_store_user_nodemask(pol) && step == MPOL_REBIND_ONCE &&
> +	if (!mpol_store_user_nodemask(pol) &&
>  	    nodes_equal(pol->w.cpuset_mems_allowed, *newmask))
>  		return;
>  
> -	if (step == MPOL_REBIND_STEP1 && (pol->flags & MPOL_F_REBINDING))
> -		return;
> -
> -	if (step == MPOL_REBIND_STEP2 && !(pol->flags & MPOL_F_REBINDING))
> -		BUG();
> -
> -	if (step == MPOL_REBIND_STEP1)
> -		pol->flags |= MPOL_F_REBINDING;
> -	else if (step == MPOL_REBIND_STEP2)
> -		pol->flags &= ~MPOL_F_REBINDING;
> -	else if (step >= MPOL_REBIND_NSTEP)
> -		BUG();
> -
> -	mpol_ops[pol->mode].rebind(pol, newmask, step);
> +	mpol_ops[pol->mode].rebind(pol, newmask);
>  }
>  
>  /*
> @@ -424,10 +362,9 @@ static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask,
>   * Called with task's alloc_lock held.
>   */
>  
> -void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
> -			enum mpol_rebind_step step)
> +void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new)
>  {
> -	mpol_rebind_policy(tsk->mempolicy, new, step);
> +	mpol_rebind_policy(tsk->mempolicy, new);
>  }
>  
>  /*
> @@ -442,7 +379,7 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
>  
>  	down_write(&mm->mmap_sem);
>  	for (vma = mm->mmap; vma; vma = vma->vm_next)
> -		mpol_rebind_policy(vma->vm_policy, new, MPOL_REBIND_ONCE);
> +		mpol_rebind_policy(vma->vm_policy, new);
>  	up_write(&mm->mmap_sem);
>  }
>  
> @@ -2101,10 +2038,7 @@ struct mempolicy *__mpol_dup(struct mempolicy *old)
>  
>  	if (current_cpuset_is_being_rebound()) {
>  		nodemask_t mems = cpuset_mems_allowed(current);
> -		if (new->flags & MPOL_F_REBINDING)
> -			mpol_rebind_policy(new, &mems, MPOL_REBIND_STEP2);
> -		else
> -			mpol_rebind_policy(new, &mems, MPOL_REBIND_ONCE);
> +		mpol_rebind_policy(new, &mems);
>  	}
>  	atomic_set(&new->refcnt, 1);
>  	return new;
> -- 
> 2.12.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 5/6] mm, cpuset: always use seqlock when changing task's nodemask
  2017-05-17  8:11 ` [PATCH v2 5/6] mm, cpuset: always use seqlock when changing task's nodemask Vlastimil Babka
@ 2017-05-23  7:15   ` Michal Hocko
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2017-05-23  7:15 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-api, linux-kernel, cgroups,
	Li Zefan, Mel Gorman, David Rientjes, Christoph Lameter,
	Hugh Dickins, Andrea Arcangeli, Anshuman Khandual,
	Kirill A. Shutemov

On Wed 17-05-17 10:11:39, Vlastimil Babka wrote:
> When updating task's mems_allowed and rebinding its mempolicy due to cpuset's
> mems being changed, we currently only take the seqlock for writing when either
> the task has a mempolicy, or the new mems has no intersection with the old
> mems. This should be enough to prevent a parallel allocation seeing no
> available nodes, but the optimization is IMHO unnecessary (cpuset updates
> should not be frequent), and we still potentially risk issues if the
> intersection of new and old nodes has limited amount of free/reclaimable
> memory. Let's just use the seqlock for all tasks.

Agreed

> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  kernel/cgroup/cpuset.c | 29 ++++++++---------------------
>  1 file changed, 8 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index dfd5b420452d..26a1c360a481 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1038,38 +1038,25 @@ static void cpuset_post_attach(void)
>   * @tsk: the task to change
>   * @newmems: new nodes that the task will be set
>   *
> - * In order to avoid seeing no nodes if the old and new nodes are disjoint,
> - * we structure updates as setting all new allowed nodes, then clearing newly
> - * disallowed ones.
> + * We use the mems_allowed_seq seqlock to safely update both tsk->mems_allowed
> + * and rebind an eventual tasks' mempolicy. If the task is allocating in
> + * parallel, it might temporarily see an empty intersection, which results in
> + * a seqlock check and retry before OOM or allocation failure.
>   */
>  static void cpuset_change_task_nodemask(struct task_struct *tsk,
>  					nodemask_t *newmems)
>  {
> -	bool need_loop;
> -
>  	task_lock(tsk);
> -	/*
> -	 * Determine if a loop is necessary if another thread is doing
> -	 * read_mems_allowed_begin().  If at least one node remains unchanged and
> -	 * tsk does not have a mempolicy, then an empty nodemask will not be
> -	 * possible when mems_allowed is larger than a word.
> -	 */
> -	need_loop = task_has_mempolicy(tsk) ||
> -			!nodes_intersects(*newmems, tsk->mems_allowed);
>  
> -	if (need_loop) {
> -		local_irq_disable();
> -		write_seqcount_begin(&tsk->mems_allowed_seq);
> -	}
> +	local_irq_disable();
> +	write_seqcount_begin(&tsk->mems_allowed_seq);
>  
>  	nodes_or(tsk->mems_allowed, tsk->mems_allowed, *newmems);
>  	mpol_rebind_task(tsk, newmems);
>  	tsk->mems_allowed = *newmems;
>  
> -	if (need_loop) {
> -		write_seqcount_end(&tsk->mems_allowed_seq);
> -		local_irq_enable();
> -	}
> +	write_seqcount_end(&tsk->mems_allowed_seq);
> +	local_irq_enable();
>  
>  	task_unlock(tsk);
>  }
> -- 
> 2.12.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 6/6] mm, mempolicy: don't check cpuset seqlock where it doesn't matter
  2017-05-17  8:11 ` [PATCH v2 6/6] mm, mempolicy: don't check cpuset seqlock where it doesn't matter Vlastimil Babka
@ 2017-05-23  7:16   ` Michal Hocko
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2017-05-23  7:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-mm, linux-api, linux-kernel, cgroups,
	Li Zefan, Mel Gorman, David Rientjes, Christoph Lameter,
	Hugh Dickins, Andrea Arcangeli, Anshuman Khandual,
	Kirill A. Shutemov

On Wed 17-05-17 10:11:40, Vlastimil Babka wrote:
> Two wrappers of __alloc_pages_nodemask() are checking task->mems_allowed_seq
> themselves to retry allocation that has raced with a cpuset update. This has
> been shown to be ineffective in preventing premature OOM's which can happen in
> __alloc_pages_slowpath() long before it returns back to the wrappers to detect
> the race at that level. Previous patches have made __alloc_pages_slowpath()
> more robust, so we can now simply remove the seqlock checking in the wrappers
> to prevent further wrong impression that it can actually help.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/mempolicy.c | 16 ----------------
>  1 file changed, 16 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 047181452040..7d8e56214ac0 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1898,12 +1898,9 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  	struct mempolicy *pol;
>  	struct page *page;
>  	int preferred_nid;
> -	unsigned int cpuset_mems_cookie;
>  	nodemask_t *nmask;
>  
> -retry_cpuset:
>  	pol = get_vma_policy(vma, addr);
> -	cpuset_mems_cookie = read_mems_allowed_begin();
>  
>  	if (pol->mode == MPOL_INTERLEAVE) {
>  		unsigned nid;
> @@ -1945,8 +1942,6 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
>  	mpol_cond_put(pol);
>  out:
> -	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
> -		goto retry_cpuset;
>  	return page;
>  }
>  
> @@ -1964,23 +1959,15 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>   *	Allocate a page from the kernel page pool.  When not in
>   *	interrupt context and apply the current process NUMA policy.
>   *	Returns NULL when no page can be allocated.
> - *
> - *	Don't call cpuset_update_task_memory_state() unless
> - *	1) it's ok to take cpuset_sem (can WAIT), and
> - *	2) allocating for current task (not interrupt).
>   */
>  struct page *alloc_pages_current(gfp_t gfp, unsigned order)
>  {
>  	struct mempolicy *pol = &default_policy;
>  	struct page *page;
> -	unsigned int cpuset_mems_cookie;
>  
>  	if (!in_interrupt() && !(gfp & __GFP_THISNODE))
>  		pol = get_task_policy(current);
>  
> -retry_cpuset:
> -	cpuset_mems_cookie = read_mems_allowed_begin();
> -
>  	/*
>  	 * No reference counting needed for current->mempolicy
>  	 * nor system default_policy
> @@ -1992,9 +1979,6 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
>  				policy_node(gfp, pol, numa_node_id()),
>  				policy_nodemask(gfp, pol));
>  
> -	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
> -		goto retry_cpuset;
> -
>  	return page;
>  }
>  EXPORT_SYMBOL(alloc_pages_current);
> -- 
> 2.12.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update
  2017-05-19 11:51   ` Michal Hocko
@ 2017-05-23 11:32     ` Vlastimil Babka
  0 siblings, 0 replies; 17+ messages in thread
From: Vlastimil Babka @ 2017-05-23 11:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-api, linux-kernel, cgroups,
	Li Zefan, Mel Gorman, David Rientjes, Christoph Lameter,
	Hugh Dickins, Andrea Arcangeli, Anshuman Khandual,
	Kirill A. Shutemov

On 05/19/2017 01:51 PM, Michal Hocko wrote:
> On Wed 17-05-17 10:11:35, Vlastimil Babka wrote:
>> Commit e47483bca2cc ("mm, page_alloc: fix premature OOM when racing with cpuset
>> mems update") has fixed known recent regressions found by LTP's cpuset01
>> testcase. I have however found that by modifying the testcase to use per-vma
>> mempolicies via bind(2) instead of per-task mempolicies via set_mempolicy(2),
>> the premature OOM still happens and the issue is much older.
>>
>> The root of the problem is that the cpuset's mems_allowed and mempolicy's
>> nodemask can temporarily have no intersection, thus get_page_from_freelist()
>> cannot find any usable zone. The current semantic for empty intersection is to
>> ignore mempolicy's nodemask and honour cpuset restrictions. This is checked in
>> node_zonelist(), but the racy update can happen after we already passed the
>> check. Such races should be protected by the seqlock task->mems_allowed_seq,
>> but it doesn't work here, because 1) mpol_rebind_mm() does not happen under
>> seqlock for write, and doing so would lead to deadlock, as it takes mmap_sem
>> for write, while the allocation can have mmap_sem for read when it's taking the
>> seqlock for read. And 2) the seqlock cookie of callers of node_zonelist()
>> (alloc_pages_vma() and alloc_pages_current()) is different than the one of
>> __alloc_pages_slowpath(), so there's still a potential race window.
>>
>> This patch fixes the issue by having __alloc_pages_slowpath() check for empty
>> intersection of cpuset and ac->nodemask before OOM or allocation failure. If
>> it's indeed empty, the nodemask is ignored and allocation retried, which mimics
>> node_zonelist(). This works fine, because almost all callers of
>> __alloc_pages_nodemask are obtaining the nodemask via node_zonelist(). The only
>> exception is new_node_page() from hotplug, where the potential violation of
>> nodemask isn't an issue, as there's already a fallback allocation attempt
>> without any nodemask. If there's a future caller that needs to have its specific
>> nodemask honoured over task's cpuset restrictions, we'll have to e.g. add a gfp
>> flag for that.
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Do we want this backported to the stable tree?

I'm not aware of any external report and the problem is there for a long
time.

> OK I do agree this makes some sense as a quick and easy to backport
> workaround.

It might not be that straightforward, the __alloc_pages* stuff has been
through a lot of changes recently, and e.g. the handling of
cpuset_mems_cookie has moved to __alloc_pages_slowpath() in the last
version or two.

So I'm not very enthusiastic about stable here.

> Acked-by: Michal Hocko <mhocko@suse.com?

Thanks!

> 
>> ---
>>  mm/page_alloc.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
>>  1 file changed, 38 insertions(+), 13 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index beb2827fd5de..43aa767c3188 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3661,6 +3661,39 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>>  	return false;
>>  }
>>  
>> +static inline bool
>> +check_retry_cpuset(int cpuset_mems_cookie, struct alloc_context *ac)
>> +{
>> +	/*
>> +	 * It's possible that cpuset's mems_allowed and the nodemask from
>> +	 * mempolicy don't intersect. This should be normally dealt with by
>> +	 * policy_nodemask(), but it's possible to race with cpuset update in
>> +	 * such a way the check therein was true, and then it became false
>> +	 * before we got our cpuset_mems_cookie here.
>> +	 * This assumes that for all allocations, ac->nodemask can come only
>> +	 * from MPOL_BIND mempolicy (whose documented semantics is to be ignored
>> +	 * when it does not intersect with the cpuset restrictions) or the
>> +	 * caller can deal with a violated nodemask.
>> +	 */
>> +	if (cpusets_enabled() && ac->nodemask &&
>> +			!cpuset_nodemask_valid_mems_allowed(ac->nodemask)) {
>> +		ac->nodemask = NULL;
>> +		return true;
>> +	}
>> +
>> +	/*
>> +	 * When updating a task's mems_allowed or mempolicy nodemask, it is
>> +	 * possible to race with parallel threads in such a way that our
>> +	 * allocation can fail while the mask is being updated. If we are about
>> +	 * to fail, check if the cpuset changed during allocation and if so,
>> +	 * retry.
>> +	 */
>> +	if (read_mems_allowed_retry(cpuset_mems_cookie))
>> +		return true;
>> +
>> +	return false;
>> +}
>> +
>>  static inline struct page *
>>  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>  						struct alloc_context *ac)
>> @@ -3856,11 +3889,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>  				&compaction_retries))
>>  		goto retry;
>>  
>> -	/*
>> -	 * It's possible we raced with cpuset update so the OOM would be
>> -	 * premature (see below the nopage: label for full explanation).
>> -	 */
>> -	if (read_mems_allowed_retry(cpuset_mems_cookie))
>> +
>> +	/* Deal with possible cpuset update races before we start OOM killing */
>> +	if (check_retry_cpuset(cpuset_mems_cookie, ac))
>>  		goto retry_cpuset;
>>  
>>  	/* Reclaim has failed us, start killing things */
>> @@ -3879,14 +3910,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>  	}
>>  
>>  nopage:
>> -	/*
>> -	 * When updating a task's mems_allowed or mempolicy nodemask, it is
>> -	 * possible to race with parallel threads in such a way that our
>> -	 * allocation can fail while the mask is being updated. If we are about
>> -	 * to fail, check if the cpuset changed during allocation and if so,
>> -	 * retry.
>> -	 */
>> -	if (read_mems_allowed_retry(cpuset_mems_cookie))
>> +	/* Deal with possible cpuset update races before we fail */
>> +	if (check_retry_cpuset(cpuset_mems_cookie, ac))
>>  		goto retry_cpuset;
>>  
>>  	/*
>> -- 
>> 2.12.2
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-05-23 11:32 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-17  8:11 [PATCH v2 0/6] cpuset/mempolicies related fixes and cleanups Vlastimil Babka
2017-05-17  8:11 ` [PATCH v2 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update Vlastimil Babka
2017-05-19 11:51   ` Michal Hocko
2017-05-23 11:32     ` Vlastimil Babka
2017-05-17  8:11 ` [PATCH v2 2/6] mm, mempolicy: stop adjusting current->il_next in mpol_rebind_nodemask() Vlastimil Babka
2017-05-17 15:07   ` Christoph Lameter
2017-05-17  8:11 ` [PATCH v2 3/6] mm, page_alloc: pass preferred nid instead of zonelist to allocator Vlastimil Babka
2017-05-17 15:19   ` Christoph Lameter
2017-05-18 10:25     ` Vlastimil Babka
2017-05-19 11:59   ` Michal Hocko
2017-05-17  8:11 ` [PATCH v2 4/6] mm, mempolicy: simplify rebinding mempolicies when updating cpusets Vlastimil Babka
2017-05-23  7:11   ` Michal Hocko
2017-05-17  8:11 ` [PATCH v2 5/6] mm, cpuset: always use seqlock when changing task's nodemask Vlastimil Babka
2017-05-23  7:15   ` Michal Hocko
2017-05-17  8:11 ` [PATCH v2 6/6] mm, mempolicy: don't check cpuset seqlock where it doesn't matter Vlastimil Babka
2017-05-23  7:16   ` Michal Hocko
2017-05-17 16:24 ` [PATCH v2 0/6] cpuset/mempolicies related fixes and cleanups Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).