linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [v3 PATCH 0/3] mm/mempolicy: some fix and semantics cleanup
@ 2021-05-31 14:05 Feng Tang
  2021-05-31 14:05 ` [v3 PATCH 1/3] mm/mempolicy: cleanup nodemask intersection check for oom Feng Tang
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Feng Tang @ 2021-05-31 14:05 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Dave Hansen, Ben Widawsky
  Cc: linux-kernel, Andrea Arcangeli, Mel Gorman, Mike Kravetz,
	Randy Dunlap, Vlastimil Babka, Andi Kleen, Dan Williams,
	ying.huang, Feng Tang

Hi All,

We've posted v4 patchset introducing a new "perfer-many" memory policy
https://lore.kernel.org/lkml/1615952410-36895-1-git-send-email-feng.tang@intel.com/ ,
for which Michal Hocko gave many comments while pointing out some
problems, and we also found some semantics confusion about 'prefer'
and 'local' policy, as well as some duplicated code. This patchset
tries to address them. Please help to review, thanks!

The patchset has been run with some sanity test like 'stress-ng'
and 'ltp', and no problem found.

Thanks,
Feng

Changelogs:
    v3:
      * fix logic of mpol_rebind_preferred() (Michal Hocko)

    v2:
      * rename mempolicy_nodemask_intersects() to
        mempolicy_in_oom_domain() and correct commit log (Michal Hocko)
      * change the mpol syscall param sanity check (Michal Hocko) 
      * combine the 3/4 and 4/4 in v1 into one patch,
        and further clean the logic (Michal Hocko)

    v1:
      * use helper func instead of macro for patch 2/4 (David Rientjes)
      * fix a possible null pointer case in patch 3/4 		
      * update commit log for 1/4
      
    RFC v2:
      * add for oom check fix patch 1/4
      * add the unification patch for mpol preprocess 2/4

Feng Tang (3):
  mm/mempolicy: cleanup nodemask intersection check for oom
  mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED
    policy
  mm/mempolicy: unify the parameter sanity check for mbind and
    set_mempolicy

 include/linux/mempolicy.h      |   2 +-
 include/uapi/linux/mempolicy.h |   1 -
 mm/mempolicy.c                 | 212 ++++++++++++++++++-----------------------
 mm/oom_kill.c                  |   2 +-
 4 files changed, 95 insertions(+), 122 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [v3 PATCH 1/3] mm/mempolicy: cleanup nodemask intersection check for oom
  2021-05-31 14:05 [v3 PATCH 0/3] mm/mempolicy: some fix and semantics cleanup Feng Tang
@ 2021-05-31 14:05 ` Feng Tang
  2021-06-01  8:19   ` Michal Hocko
  2021-05-31 14:05 ` [v3 PATCH 2/3] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy Feng Tang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Feng Tang @ 2021-05-31 14:05 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Dave Hansen, Ben Widawsky
  Cc: linux-kernel, Andrea Arcangeli, Mel Gorman, Mike Kravetz,
	Randy Dunlap, Vlastimil Babka, Andi Kleen, Dan Williams,
	ying.huang, Feng Tang

mempolicy_nodemask_intersects() is used in oom case to check if a
task may have memory allocated on some memory nodes.

As it's only used by OOM check, rename it to mempolicy_in_oom_domain()
to reduce confusion.

As only for 'bind' policy, the nodemask is a force requirement for
from where to allocate memory, only do the intesection check for it,
and return true for all other policies.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 include/linux/mempolicy.h |  2 +-
 mm/mempolicy.c            | 34 +++++++++-------------------------
 mm/oom_kill.c             |  2 +-
 3 files changed, 11 insertions(+), 27 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5f1c74d..8773c55 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -150,7 +150,7 @@ extern int huge_node(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
 extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
-extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+extern bool mempolicy_in_oom_domain(struct task_struct *tsk,
 				const nodemask_t *mask);
 extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy);
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d79fa29..6795a6a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2094,16 +2094,16 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 #endif
 
 /*
- * mempolicy_nodemask_intersects
+ * mempolicy_in_oom_domain
  *
- * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
- * policy.  Otherwise, check for intersection between mask and the policy
- * nodemask for 'bind' or 'interleave' policy.  For 'preferred' or 'local'
- * policy, always return true since it may allocate elsewhere on fallback.
+ * If tsk's mempolicy is "bind", check for intersection between mask and
+ * the policy nodemask. Otherwise, return true for all other policies
+ * including "interleave", as a tsk with "interleave" policy may have
+ * memory allocated from all nodes in system.
  *
  * Takes task_lock(tsk) to prevent freeing of its mempolicy.
  */
-bool mempolicy_nodemask_intersects(struct task_struct *tsk,
+bool mempolicy_in_oom_domain(struct task_struct *tsk,
 					const nodemask_t *mask)
 {
 	struct mempolicy *mempolicy;
@@ -2111,29 +2111,13 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
 
 	if (!mask)
 		return ret;
+
 	task_lock(tsk);
 	mempolicy = tsk->mempolicy;
-	if (!mempolicy)
-		goto out;
-
-	switch (mempolicy->mode) {
-	case MPOL_PREFERRED:
-		/*
-		 * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to
-		 * allocate from, they may fallback to other nodes when oom.
-		 * Thus, it's possible for tsk to have allocated memory from
-		 * nodes in mask.
-		 */
-		break;
-	case MPOL_BIND:
-	case MPOL_INTERLEAVE:
+	if (mempolicy && mempolicy->mode == MPOL_BIND)
 		ret = nodes_intersects(mempolicy->v.nodes, *mask);
-		break;
-	default:
-		BUG();
-	}
-out:
 	task_unlock(tsk);
+
 	return ret;
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index eefd3f5..fcc29e9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -104,7 +104,7 @@ static bool oom_cpuset_eligible(struct task_struct *start,
 			 * mempolicy intersects current, otherwise it may be
 			 * needlessly killed.
 			 */
-			ret = mempolicy_nodemask_intersects(tsk, mask);
+			ret = mempolicy_in_oom_domain(tsk, mask);
 		} else {
 			/*
 			 * This is not a mempolicy constrained oom, so only
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [v3 PATCH 2/3] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy
  2021-05-31 14:05 [v3 PATCH 0/3] mm/mempolicy: some fix and semantics cleanup Feng Tang
  2021-05-31 14:05 ` [v3 PATCH 1/3] mm/mempolicy: cleanup nodemask intersection check for oom Feng Tang
@ 2021-05-31 14:05 ` Feng Tang
  2021-06-01  8:44   ` Michal Hocko
  2021-05-31 14:05 ` [v3 PATCH 3/3] mm/mempolicy: unify the parameter sanity check for mbind and set_mempolicy Feng Tang
  2021-05-31 21:41 ` [v3 PATCH 0/3] mm/mempolicy: some fix and semantics cleanup Andrew Morton
  3 siblings, 1 reply; 13+ messages in thread
From: Feng Tang @ 2021-05-31 14:05 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Dave Hansen, Ben Widawsky
  Cc: linux-kernel, Andrea Arcangeli, Mel Gorman, Mike Kravetz,
	Randy Dunlap, Vlastimil Babka, Andi Kleen, Dan Williams,
	ying.huang, Feng Tang

MPOL_LOCAL policy has been setup as a real policy, but it is still
handled like a faked POL_PREFERRED policy with one internal
MPOL_F_LOCAL flag bit set, and there are many places having to
judge the real 'prefer' or the 'local' policy, which are quite
confusing.

In current code, there are 4 cases that MPOL_LOCAL are used:
1. user specifies 'local' policy
2. user specifies 'prefer' policy, but with empty nodemask
3. system 'default' policy is used
4. 'prefer' policy + valid 'preferred' node with MPOL_F_STATIC_NODES
   flag set, and when it is 'rebind' to a nodemask which doesn't
   contains the 'preferred' node, it will perform as 'local' policy

So make 'local' a real policy instead of a fake 'prefer' one, and
kill MPOL_F_LOCAL bit, which can greatly reduce the confusion for
code reading.

For case 4, the logic of mpol_rebind_preferred() is confusing, as
Michal Hocko pointed out:

 "
 I do believe that rebinding preferred policy is just bogus and
 it should be dropped altogether on the ground that a preference
 is a mere hint from userspace where to start the allocation.
 Unless I am missing something cpusets will be always authoritative
 for the final placement. The preferred node just acts as a starting
 point and it should be really preserved when cpusets changes.
 Otherwise we have a very subtle behavior corner cases.
 "
So dump all the tricky transformation between 'prefer' and 'local',
and just record the new nodemask of rebinding.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 include/uapi/linux/mempolicy.h |   1 -
 mm/mempolicy.c                 | 131 +++++++++++++++++------------------------
 2 files changed, 55 insertions(+), 77 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 4832fd0..19a00bc 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -60,7 +60,6 @@ enum {
  * are never OR'ed into the mode in mempolicy API arguments.
  */
 #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
-#define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 #define MPOL_F_MORON	(1 << 4) /* Migrate On protnone Reference On Node */
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 6795a6a..c337bd7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -121,8 +121,7 @@ enum zone_type policy_zone = 0;
  */
 static struct mempolicy default_policy = {
 	.refcnt = ATOMIC_INIT(1), /* never free it */
-	.mode = MPOL_PREFERRED,
-	.flags = MPOL_F_LOCAL,
+	.mode = MPOL_LOCAL,
 };
 
 static struct mempolicy preferred_node_policy[MAX_NUMNODES];
@@ -200,12 +199,9 @@ static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes)
 
 static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
 {
-	if (!nodes)
-		pol->flags |= MPOL_F_LOCAL;	/* local allocation */
-	else if (nodes_empty(*nodes))
-		return -EINVAL;			/*  no allowed nodes */
-	else
-		pol->v.preferred_node = first_node(*nodes);
+	if (nodes_empty(*nodes))
+		return -EINVAL;
+	pol->v.preferred_node = first_node(*nodes);
 	return 0;
 }
 
@@ -217,6 +213,11 @@ static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
 	return 0;
 }
 
+static int mpol_new_local(struct mempolicy *pol, const nodemask_t *nodes)
+{
+	return 0;
+}
+
 /*
  * mpol_set_nodemask is called after mpol_new() to set up the nodemask, if
  * any, for the new policy.  mpol_new() has already validated the nodes
@@ -239,25 +240,19 @@ static int mpol_set_nodemask(struct mempolicy *pol,
 		  cpuset_current_mems_allowed, node_states[N_MEMORY]);
 
 	VM_BUG_ON(!nodes);
-	if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
-		nodes = NULL;	/* explicit local allocation */
-	else {
-		if (pol->flags & MPOL_F_RELATIVE_NODES)
-			mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1);
-		else
-			nodes_and(nsc->mask2, *nodes, nsc->mask1);
 
-		if (mpol_store_user_nodemask(pol))
-			pol->w.user_nodemask = *nodes;
-		else
-			pol->w.cpuset_mems_allowed =
-						cpuset_current_mems_allowed;
-	}
+	if (pol->flags & MPOL_F_RELATIVE_NODES)
+		mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1);
+	else
+		nodes_and(nsc->mask2, *nodes, nsc->mask1);
 
-	if (nodes)
-		ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
+	if (mpol_store_user_nodemask(pol))
+		pol->w.user_nodemask = *nodes;
 	else
-		ret = mpol_ops[pol->mode].create(pol, NULL);
+		pol->w.cpuset_mems_allowed =
+					cpuset_current_mems_allowed;
+
+	ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
 	return ret;
 }
 
@@ -290,13 +285,14 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 			if (((flags & MPOL_F_STATIC_NODES) ||
 			     (flags & MPOL_F_RELATIVE_NODES)))
 				return ERR_PTR(-EINVAL);
+
+			mode = MPOL_LOCAL;
 		}
 	} else if (mode == MPOL_LOCAL) {
 		if (!nodes_empty(*nodes) ||
 		    (flags & MPOL_F_STATIC_NODES) ||
 		    (flags & MPOL_F_RELATIVE_NODES))
 			return ERR_PTR(-EINVAL);
-		mode = MPOL_PREFERRED;
 	} else if (nodes_empty(*nodes))
 		return ERR_PTR(-EINVAL);
 	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
@@ -344,25 +340,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
 static void mpol_rebind_preferred(struct mempolicy *pol,
 						const nodemask_t *nodes)
 {
-	nodemask_t tmp;
-
-	if (pol->flags & MPOL_F_STATIC_NODES) {
-		int node = first_node(pol->w.user_nodemask);
-
-		if (node_isset(node, *nodes)) {
-			pol->v.preferred_node = node;
-			pol->flags &= ~MPOL_F_LOCAL;
-		} else
-			pol->flags |= MPOL_F_LOCAL;
-	} else if (pol->flags & MPOL_F_RELATIVE_NODES) {
-		mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
-		pol->v.preferred_node = first_node(tmp);
-	} else if (!(pol->flags & MPOL_F_LOCAL)) {
-		pol->v.preferred_node = node_remap(pol->v.preferred_node,
-						   pol->w.cpuset_mems_allowed,
-						   *nodes);
-		pol->w.cpuset_mems_allowed = *nodes;
-	}
+	pol->w.cpuset_mems_allowed = *nodes;
 }
 
 /*
@@ -376,7 +354,7 @@ static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask)
 {
 	if (!pol)
 		return;
-	if (!mpol_store_user_nodemask(pol) && !(pol->flags & MPOL_F_LOCAL) &&
+	if (!mpol_store_user_nodemask(pol) &&
 	    nodes_equal(pol->w.cpuset_mems_allowed, *newmask))
 		return;
 
@@ -427,6 +405,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 		.create = mpol_new_bind,
 		.rebind = mpol_rebind_nodemask,
 	},
+	[MPOL_LOCAL] = {
+		.create = mpol_new_local,
+		.rebind = mpol_rebind_default,
+	},
 };
 
 static int migrate_page_add(struct page *page, struct list_head *pagelist,
@@ -919,10 +901,12 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
 	case MPOL_INTERLEAVE:
 		*nodes = p->v.nodes;
 		break;
+	case MPOL_LOCAL:
+		/* return empty node mask for local allocation */
+		break;
+
 	case MPOL_PREFERRED:
-		if (!(p->flags & MPOL_F_LOCAL))
-			node_set(p->v.preferred_node, *nodes);
-		/* else return empty node mask for local allocation */
+		node_set(p->v.preferred_node, *nodes);
 		break;
 	default:
 		BUG();
@@ -1894,9 +1878,9 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 /* Return the node id preferred by the given mempolicy, or the given id */
 static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
 {
-	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
+	if (policy->mode == MPOL_PREFERRED) {
 		nd = policy->v.preferred_node;
-	else {
+	} else {
 		/*
 		 * __GFP_THISNODE shouldn't even be used with the bind policy
 		 * because we might easily break the expectation to stay on the
@@ -1933,14 +1917,11 @@ unsigned int mempolicy_slab_node(void)
 		return node;
 
 	policy = current->mempolicy;
-	if (!policy || policy->flags & MPOL_F_LOCAL)
+	if (!policy)
 		return node;
 
 	switch (policy->mode) {
 	case MPOL_PREFERRED:
-		/*
-		 * handled MPOL_F_LOCAL above
-		 */
 		return policy->v.preferred_node;
 
 	case MPOL_INTERLEAVE:
@@ -1960,6 +1941,8 @@ unsigned int mempolicy_slab_node(void)
 							&policy->v.nodes);
 		return z->zone ? zone_to_nid(z->zone) : node;
 	}
+	case MPOL_LOCAL:
+		return node;
 
 	default:
 		BUG();
@@ -2072,16 +2055,18 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 	mempolicy = current->mempolicy;
 	switch (mempolicy->mode) {
 	case MPOL_PREFERRED:
-		if (mempolicy->flags & MPOL_F_LOCAL)
-			nid = numa_node_id();
-		else
-			nid = mempolicy->v.preferred_node;
+		nid = mempolicy->v.preferred_node;
 		init_nodemask_of_node(mask, nid);
 		break;
 
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
-		*mask =  mempolicy->v.nodes;
+		*mask = mempolicy->v.nodes;
+		break;
+
+	case MPOL_LOCAL:
+		nid = numa_node_id();
+		init_nodemask_of_node(mask, nid);
 		break;
 
 	default:
@@ -2188,7 +2173,7 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		 * If the policy is interleave, or does not allow the current
 		 * node in its nodemask, we allocate the standard way.
 		 */
-		if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL))
+		if (pol->mode == MPOL_PREFERRED)
 			hpage_node = pol->v.preferred_node;
 
 		nmask = policy_nodemask(gfp, pol);
@@ -2324,10 +2309,9 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 	case MPOL_INTERLEAVE:
 		return !!nodes_equal(a->v.nodes, b->v.nodes);
 	case MPOL_PREFERRED:
-		/* a's ->flags is the same as b's */
-		if (a->flags & MPOL_F_LOCAL)
-			return true;
 		return a->v.preferred_node == b->v.preferred_node;
+	case MPOL_LOCAL:
+		return true;
 	default:
 		BUG();
 		return false;
@@ -2465,10 +2449,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		break;
 
 	case MPOL_PREFERRED:
-		if (pol->flags & MPOL_F_LOCAL)
-			polnid = numa_node_id();
-		else
-			polnid = pol->v.preferred_node;
+		polnid = pol->v.preferred_node;
+		break;
+
+	case MPOL_LOCAL:
+		polnid = numa_node_id();
 		break;
 
 	case MPOL_BIND:
@@ -2835,9 +2820,6 @@ void numa_default_policy(void)
  * Parse and format mempolicy from/to strings
  */
 
-/*
- * "local" is implemented internally by MPOL_PREFERRED with MPOL_F_LOCAL flag.
- */
 static const char * const policy_modes[] =
 {
 	[MPOL_DEFAULT]    = "default",
@@ -2915,7 +2897,6 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 		 */
 		if (nodelist)
 			goto out;
-		mode = MPOL_PREFERRED;
 		break;
 	case MPOL_DEFAULT:
 		/*
@@ -2959,7 +2940,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 	else if (nodelist)
 		new->v.preferred_node = first_node(nodes);
 	else
-		new->flags |= MPOL_F_LOCAL;
+		new->mode = MPOL_LOCAL;
 
 	/*
 	 * Save nodes for contextualization: this will be used to "clone"
@@ -3005,12 +2986,10 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 
 	switch (mode) {
 	case MPOL_DEFAULT:
+	case MPOL_LOCAL:
 		break;
 	case MPOL_PREFERRED:
-		if (flags & MPOL_F_LOCAL)
-			mode = MPOL_LOCAL;
-		else
-			node_set(pol->v.preferred_node, nodes);
+		node_set(pol->v.preferred_node, nodes);
 		break;
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [v3 PATCH 3/3] mm/mempolicy: unify the parameter sanity check for mbind and set_mempolicy
  2021-05-31 14:05 [v3 PATCH 0/3] mm/mempolicy: some fix and semantics cleanup Feng Tang
  2021-05-31 14:05 ` [v3 PATCH 1/3] mm/mempolicy: cleanup nodemask intersection check for oom Feng Tang
  2021-05-31 14:05 ` [v3 PATCH 2/3] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy Feng Tang
@ 2021-05-31 14:05 ` Feng Tang
  2021-06-01  8:46   ` Michal Hocko
  2021-05-31 21:41 ` [v3 PATCH 0/3] mm/mempolicy: some fix and semantics cleanup Andrew Morton
  3 siblings, 1 reply; 13+ messages in thread
From: Feng Tang @ 2021-05-31 14:05 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Dave Hansen, Ben Widawsky
  Cc: linux-kernel, Andrea Arcangeli, Mel Gorman, Mike Kravetz,
	Randy Dunlap, Vlastimil Babka, Andi Kleen, Dan Williams,
	ying.huang, Feng Tang

Currently the kernel_mbind() and kernel_set_mempolicy() do almost
the same operation for parameter sanity check.

Add a helper function to unify the code to reduce the redundancy,
and make it easier for changing the pre-processing code in future.

[thanks to David Rientjes for suggesting using helper function
instead of macro]

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/mempolicy.c | 47 +++++++++++++++++++++++++++++------------------
 1 file changed, 29 insertions(+), 18 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index c337bd7..85ef512 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1444,26 +1444,37 @@ static int copy_nodes_to_user(unsigned long __user *mask, unsigned long maxnode,
 	return copy_to_user(mask, nodes_addr(*nodes), copy) ? -EFAULT : 0;
 }
 
+static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
+{
+	*flags = *mode & MPOL_MODE_FLAGS;
+	*mode &= ~MPOL_MODE_FLAGS;
+	if ((unsigned int)(*mode) >= MPOL_MAX)
+		return -EINVAL;
+	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
+		return -EINVAL;
+
+	return 0;
+}
+
 static long kernel_mbind(unsigned long start, unsigned long len,
 			 unsigned long mode, const unsigned long __user *nmask,
 			 unsigned long maxnode, unsigned int flags)
 {
+	unsigned short mode_flags;
 	nodemask_t nodes;
+	int lmode = mode;
 	int err;
-	unsigned short mode_flags;
 
 	start = untagged_addr(start);
-	mode_flags = mode & MPOL_MODE_FLAGS;
-	mode &= ~MPOL_MODE_FLAGS;
-	if (mode >= MPOL_MAX)
-		return -EINVAL;
-	if ((mode_flags & MPOL_F_STATIC_NODES) &&
-	    (mode_flags & MPOL_F_RELATIVE_NODES))
-		return -EINVAL;
+	err = sanitize_mpol_flags(&lmode, &mode_flags);
+	if (err)
+		return err;
+
 	err = get_nodes(&nodes, nmask, maxnode);
 	if (err)
 		return err;
-	return do_mbind(start, len, mode, mode_flags, &nodes, flags);
+
+	return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
 }
 
 SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
@@ -1477,20 +1488,20 @@ SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
 static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
 				 unsigned long maxnode)
 {
-	int err;
+	unsigned short mode_flags;
 	nodemask_t nodes;
-	unsigned short flags;
+	int lmode = mode;
+	int err;
+
+	err = sanitize_mpol_flags(&lmode, &mode_flags);
+	if (err)
+		return err;
 
-	flags = mode & MPOL_MODE_FLAGS;
-	mode &= ~MPOL_MODE_FLAGS;
-	if ((unsigned int)mode >= MPOL_MAX)
-		return -EINVAL;
-	if ((flags & MPOL_F_STATIC_NODES) && (flags & MPOL_F_RELATIVE_NODES))
-		return -EINVAL;
 	err = get_nodes(&nodes, nmask, maxnode);
 	if (err)
 		return err;
-	return do_set_mempolicy(mode, flags, &nodes);
+
+	return do_set_mempolicy(lmode, mode_flags, &nodes);
 }
 
 SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [v3 PATCH 0/3] mm/mempolicy: some fix and semantics cleanup
  2021-05-31 14:05 [v3 PATCH 0/3] mm/mempolicy: some fix and semantics cleanup Feng Tang
                   ` (2 preceding siblings ...)
  2021-05-31 14:05 ` [v3 PATCH 3/3] mm/mempolicy: unify the parameter sanity check for mbind and set_mempolicy Feng Tang
@ 2021-05-31 21:41 ` Andrew Morton
  2021-06-01  0:55   ` Feng Tang
  3 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2021-05-31 21:41 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Michal Hocko, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Mon, 31 May 2021 22:05:53 +0800 Feng Tang <feng.tang@intel.com> wrote:

> We've posted v4 patchset introducing a new "perfer-many" memory policy
> https://lore.kernel.org/lkml/1615952410-36895-1-git-send-email-feng.tang@intel.com/ ,
> for which Michal Hocko gave many comments while pointing out some
> problems, and we also found some semantics confusion about 'prefer'
> and 'local' policy, as well as some duplicated code. This patchset
> tries to address them. Please help to review, thanks!
> 
> The patchset has been run with some sanity test like 'stress-ng'
> and 'ltp', and no problem found.

None of the above is suitable for the [0/n] overall description.  I
copied-n-pasted the v1 cover letter from the above link.  Please check
that it is all still correct and up to date.  If not, please send along
replacement text, thanks.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v3 PATCH 0/3] mm/mempolicy: some fix and semantics cleanup
  2021-05-31 21:41 ` [v3 PATCH 0/3] mm/mempolicy: some fix and semantics cleanup Andrew Morton
@ 2021-06-01  0:55   ` Feng Tang
  2021-06-01  8:48     ` Michal Hocko
  0 siblings, 1 reply; 13+ messages in thread
From: Feng Tang @ 2021-06-01  0:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michal Hocko, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

Hi Andrew,

Thanks for reviewing and taking the patches.

On Mon, May 31, 2021 at 02:41:28PM -0700, Andrew Morton wrote:
> On Mon, 31 May 2021 22:05:53 +0800 Feng Tang <feng.tang@intel.com> wrote:
> 
> > We've posted v4 patchset introducing a new "perfer-many" memory policy
> > https://lore.kernel.org/lkml/1615952410-36895-1-git-send-email-feng.tang@intel.com/ ,
> > for which Michal Hocko gave many comments while pointing out some
> > problems, and we also found some semantics confusion about 'prefer'
> > and 'local' policy, as well as some duplicated code. This patchset
> > tries to address them. Please help to review, thanks!
> > 
> > The patchset has been run with some sanity test like 'stress-ng'
> > and 'ltp', and no problem found.
> 
> None of the above is suitable for the [0/n] overall description.  I
> copied-n-pasted the v1 cover letter from the above link.  Please check
> that it is all still correct and up to date.  If not, please send along
> replacement text, thanks.

I should make the cover-letter more descriptive. The link above is another
patchset to introduce a new memory policy MPOL_PREFERRED_MANY, while these
3 patches are preparation work for it, to make it easier for a new policy
to be hooked in.

So how about the following text:

Current memory policy code has some confusing and ambiguous part about
MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and
there are many places having to distinguish them. Also the nodemask
intersection check needs cleanup to be more explicit for OOM use, and
handle MPOL_INTERLEAVE correctly. This patchset cleans up these and
unifies the parameter sanity check for mbind() and set_mempolicy().

Please feel free to modify it, thanks!

- Feng

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v3 PATCH 1/3] mm/mempolicy: cleanup nodemask intersection check for oom
  2021-05-31 14:05 ` [v3 PATCH 1/3] mm/mempolicy: cleanup nodemask intersection check for oom Feng Tang
@ 2021-06-01  8:19   ` Michal Hocko
  2021-06-01 11:08     ` Feng Tang
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2021-06-01  8:19 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Mon 31-05-21 22:05:54, Feng Tang wrote:
> mempolicy_nodemask_intersects() is used in oom case to check if a
> task may have memory allocated on some memory nodes.
> 
> As it's only used by OOM check, rename it to mempolicy_in_oom_domain()
> to reduce confusion.
> 
> As only for 'bind' policy, the nodemask is a force requirement for
> from where to allocate memory, only do the intesection check for it,
> and return true for all other policies.

I would slightly rephrase the above to
"
mempolicy_nodemask_intersects seem to be a general purpose mempolicy
function. In fact it is partially tailored for the OOM purpose instead.
The oom proper is the only existing user so rename the function to make
that purpose explicit.

While at it drop the MPOL_INTERLEAVE as those allocations never has a
nodemask defined (see alloc_page_interleave) so this is a dead code
and a confusing one because MPOL_INTERLEAVE is a hint rather than a hard
requirement so it shouldn't be considered during the OOM.

The final code can be reduced to a check for MPOL_BIND which is the only
memory policy that is a hard requirement and thus relevant to a
constrained OOM logic.
"

> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Feng Tang <feng.tang@intel.com>

To the change itself
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/mempolicy.h |  2 +-
>  mm/mempolicy.c            | 34 +++++++++-------------------------
>  mm/oom_kill.c             |  2 +-
>  3 files changed, 11 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 5f1c74d..8773c55 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -150,7 +150,7 @@ extern int huge_node(struct vm_area_struct *vma,
>  				unsigned long addr, gfp_t gfp_flags,
>  				struct mempolicy **mpol, nodemask_t **nodemask);
>  extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
> -extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> +extern bool mempolicy_in_oom_domain(struct task_struct *tsk,
>  				const nodemask_t *mask);
>  extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy);
>  
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index d79fa29..6795a6a 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2094,16 +2094,16 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
>  #endif
>  
>  /*
> - * mempolicy_nodemask_intersects
> + * mempolicy_in_oom_domain
>   *
> - * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
> - * policy.  Otherwise, check for intersection between mask and the policy
> - * nodemask for 'bind' or 'interleave' policy.  For 'preferred' or 'local'
> - * policy, always return true since it may allocate elsewhere on fallback.
> + * If tsk's mempolicy is "bind", check for intersection between mask and
> + * the policy nodemask. Otherwise, return true for all other policies
> + * including "interleave", as a tsk with "interleave" policy may have
> + * memory allocated from all nodes in system.
>   *
>   * Takes task_lock(tsk) to prevent freeing of its mempolicy.
>   */
> -bool mempolicy_nodemask_intersects(struct task_struct *tsk,
> +bool mempolicy_in_oom_domain(struct task_struct *tsk,
>  					const nodemask_t *mask)
>  {
>  	struct mempolicy *mempolicy;
> @@ -2111,29 +2111,13 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
>  
>  	if (!mask)
>  		return ret;
> +
>  	task_lock(tsk);
>  	mempolicy = tsk->mempolicy;
> -	if (!mempolicy)
> -		goto out;
> -
> -	switch (mempolicy->mode) {
> -	case MPOL_PREFERRED:
> -		/*
> -		 * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to
> -		 * allocate from, they may fallback to other nodes when oom.
> -		 * Thus, it's possible for tsk to have allocated memory from
> -		 * nodes in mask.
> -		 */
> -		break;
> -	case MPOL_BIND:
> -	case MPOL_INTERLEAVE:
> +	if (mempolicy && mempolicy->mode == MPOL_BIND)
>  		ret = nodes_intersects(mempolicy->v.nodes, *mask);
> -		break;
> -	default:
> -		BUG();
> -	}
> -out:
>  	task_unlock(tsk);
> +
>  	return ret;
>  }
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index eefd3f5..fcc29e9 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -104,7 +104,7 @@ static bool oom_cpuset_eligible(struct task_struct *start,
>  			 * mempolicy intersects current, otherwise it may be
>  			 * needlessly killed.
>  			 */
> -			ret = mempolicy_nodemask_intersects(tsk, mask);
> +			ret = mempolicy_in_oom_domain(tsk, mask);
>  		} else {
>  			/*
>  			 * This is not a mempolicy constrained oom, so only
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v3 PATCH 2/3] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy
  2021-05-31 14:05 ` [v3 PATCH 2/3] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy Feng Tang
@ 2021-06-01  8:44   ` Michal Hocko
  2021-06-01 11:29     ` Feng Tang
  0 siblings, 1 reply; 13+ messages in thread
From: Michal Hocko @ 2021-06-01  8:44 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Mon 31-05-21 22:05:55, Feng Tang wrote:
> MPOL_LOCAL policy has been setup as a real policy, but it is still
> handled like a faked POL_PREFERRED policy with one internal
> MPOL_F_LOCAL flag bit set, and there are many places having to
> judge the real 'prefer' or the 'local' policy, which are quite
> confusing.
> 
> In current code, there are 4 cases that MPOL_LOCAL are used:
> 1. user specifies 'local' policy
> 2. user specifies 'prefer' policy, but with empty nodemask
> 3. system 'default' policy is used
> 4. 'prefer' policy + valid 'preferred' node with MPOL_F_STATIC_NODES
>    flag set, and when it is 'rebind' to a nodemask which doesn't
>    contains the 'preferred' node, it will perform as 'local' policy
> 
> So make 'local' a real policy instead of a fake 'prefer' one, and
> kill MPOL_F_LOCAL bit, which can greatly reduce the confusion for
> code reading.
> 
> For case 4, the logic of mpol_rebind_preferred() is confusing, as
> Michal Hocko pointed out:
> 
>  "
>  I do believe that rebinding preferred policy is just bogus and
>  it should be dropped altogether on the ground that a preference
>  is a mere hint from userspace where to start the allocation.
>  Unless I am missing something cpusets will be always authoritative
>  for the final placement. The preferred node just acts as a starting
>  point and it should be really preserved when cpusets changes.
>  Otherwise we have a very subtle behavior corner cases.
>  "
> So dump all the tricky transformation between 'prefer' and 'local',
> and just record the new nodemask of rebinding.
> 
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Feng Tang <feng.tang@intel.com>

I like this very much! It simplifies a tricky code and also a very
dubious behavior. I would like to hear from others whether there might
be some userspace depending on this obscure behavior though. One never
knows...

Some more notes/questions below

[...]
> @@ -239,25 +240,19 @@ static int mpol_set_nodemask(struct mempolicy *pol,
>  		  cpuset_current_mems_allowed, node_states[N_MEMORY]);
>  
>  	VM_BUG_ON(!nodes);
> -	if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
> -		nodes = NULL;	/* explicit local allocation */
> -	else {
> -		if (pol->flags & MPOL_F_RELATIVE_NODES)
> -			mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1);
> -		else
> -			nodes_and(nsc->mask2, *nodes, nsc->mask1);
>  
> -		if (mpol_store_user_nodemask(pol))
> -			pol->w.user_nodemask = *nodes;
> -		else
> -			pol->w.cpuset_mems_allowed =
> -						cpuset_current_mems_allowed;
> -	}
> +	if (pol->flags & MPOL_F_RELATIVE_NODES)
> +		mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1);
> +	else
> +		nodes_and(nsc->mask2, *nodes, nsc->mask1);

Maybe I've just got lost here but why don't you need to check for the
local policy anymore? mpol_new will take care of the MPOL_PREFERRED &&
nodes_empty special but why do we want/need all this for a local policy
at all?

>  
> -	if (nodes)
> -		ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
> +	if (mpol_store_user_nodemask(pol))
> +		pol->w.user_nodemask = *nodes;
>  	else
> -		ret = mpol_ops[pol->mode].create(pol, NULL);
> +		pol->w.cpuset_mems_allowed =
> +					cpuset_current_mems_allowed;

please use a single line. This is just harder to read. You will cross
the line limit but readability should be preferred here.

[...]

I haven't spotted anything else.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v3 PATCH 3/3] mm/mempolicy: unify the parameter sanity check for mbind and set_mempolicy
  2021-05-31 14:05 ` [v3 PATCH 3/3] mm/mempolicy: unify the parameter sanity check for mbind and set_mempolicy Feng Tang
@ 2021-06-01  8:46   ` Michal Hocko
  0 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2021-06-01  8:46 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Mon 31-05-21 22:05:56, Feng Tang wrote:
> Currently the kernel_mbind() and kernel_set_mempolicy() do almost
> the same operation for parameter sanity check.
> 
> Add a helper function to unify the code to reduce the redundancy,
> and make it easier for changing the pre-processing code in future.
> 
> [thanks to David Rientjes for suggesting using helper function
> instead of macro]
> 
> Signed-off-by: Feng Tang <feng.tang@intel.com>

sanitize_mpol_flags would benefit from some high level comments
explaining those modifications but this can be done on top. This looks
like a useful cleanup on its own

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/mempolicy.c | 47 +++++++++++++++++++++++++++++------------------
>  1 file changed, 29 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index c337bd7..85ef512 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1444,26 +1444,37 @@ static int copy_nodes_to_user(unsigned long __user *mask, unsigned long maxnode,
>  	return copy_to_user(mask, nodes_addr(*nodes), copy) ? -EFAULT : 0;
>  }
>  
> +static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
> +{
> +	*flags = *mode & MPOL_MODE_FLAGS;
> +	*mode &= ~MPOL_MODE_FLAGS;
> +	if ((unsigned int)(*mode) >= MPOL_MAX)
> +		return -EINVAL;
> +	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
>  static long kernel_mbind(unsigned long start, unsigned long len,
>  			 unsigned long mode, const unsigned long __user *nmask,
>  			 unsigned long maxnode, unsigned int flags)
>  {
> +	unsigned short mode_flags;
>  	nodemask_t nodes;
> +	int lmode = mode;
>  	int err;
> -	unsigned short mode_flags;
>  
>  	start = untagged_addr(start);
> -	mode_flags = mode & MPOL_MODE_FLAGS;
> -	mode &= ~MPOL_MODE_FLAGS;
> -	if (mode >= MPOL_MAX)
> -		return -EINVAL;
> -	if ((mode_flags & MPOL_F_STATIC_NODES) &&
> -	    (mode_flags & MPOL_F_RELATIVE_NODES))
> -		return -EINVAL;
> +	err = sanitize_mpol_flags(&lmode, &mode_flags);
> +	if (err)
> +		return err;
> +
>  	err = get_nodes(&nodes, nmask, maxnode);
>  	if (err)
>  		return err;
> -	return do_mbind(start, len, mode, mode_flags, &nodes, flags);
> +
> +	return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
>  }
>  
>  SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
> @@ -1477,20 +1488,20 @@ SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
>  static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
>  				 unsigned long maxnode)
>  {
> -	int err;
> +	unsigned short mode_flags;
>  	nodemask_t nodes;
> -	unsigned short flags;
> +	int lmode = mode;
> +	int err;
> +
> +	err = sanitize_mpol_flags(&lmode, &mode_flags);
> +	if (err)
> +		return err;
>  
> -	flags = mode & MPOL_MODE_FLAGS;
> -	mode &= ~MPOL_MODE_FLAGS;
> -	if ((unsigned int)mode >= MPOL_MAX)
> -		return -EINVAL;
> -	if ((flags & MPOL_F_STATIC_NODES) && (flags & MPOL_F_RELATIVE_NODES))
> -		return -EINVAL;
>  	err = get_nodes(&nodes, nmask, maxnode);
>  	if (err)
>  		return err;
> -	return do_set_mempolicy(mode, flags, &nodes);
> +
> +	return do_set_mempolicy(lmode, mode_flags, &nodes);
>  }
>  
>  SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v3 PATCH 0/3] mm/mempolicy: some fix and semantics cleanup
  2021-06-01  0:55   ` Feng Tang
@ 2021-06-01  8:48     ` Michal Hocko
  0 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2021-06-01  8:48 UTC (permalink / raw)
  To: Feng Tang
  Cc: Andrew Morton, linux-mm, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Tue 01-06-21 08:55:13, Feng Tang wrote:
[...]
> Current memory policy code has some confusing and ambiguous part about
> MPOL_LOCAL policy, as it is handled as a faked MPOL_PREFERRED one, and
> there are many places having to distinguish them. Also the nodemask
> intersection check needs cleanup to be more explicit for OOM use, and
> handle MPOL_INTERLEAVE correctly. This patchset cleans up these and
> unifies the parameter sanity check for mbind() and set_mempolicy().

Looks good to me. I would just add that this cleanup helps to make
further changes easier (notably MPOL_PREFERRED_MANY)

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v3 PATCH 1/3] mm/mempolicy: cleanup nodemask intersection check for oom
  2021-06-01  8:19   ` Michal Hocko
@ 2021-06-01 11:08     ` Feng Tang
  2021-06-01 23:56       ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Feng Tang @ 2021-06-01 11:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Tue, Jun 01, 2021 at 10:19:25AM +0200, Michal Hocko wrote:
> On Mon 31-05-21 22:05:54, Feng Tang wrote:
> > mempolicy_nodemask_intersects() is used in oom case to check if a
> > task may have memory allocated on some memory nodes.
> > 
> > As it's only used by OOM check, rename it to mempolicy_in_oom_domain()
> > to reduce confusion.
> > 
> > As only for 'bind' policy, the nodemask is a force requirement for
> > from where to allocate memory, only do the intesection check for it,
> > and return true for all other policies.
> 
> I would slightly rephrase the above to
> "
> mempolicy_nodemask_intersects seem to be a general purpose mempolicy
> function. In fact it is partially tailored for the OOM purpose instead.
> The oom proper is the only existing user so rename the function to make
> that purpose explicit.
> 
> While at it drop the MPOL_INTERLEAVE as those allocations never has a
> nodemask defined (see alloc_page_interleave) so this is a dead code
> and a confusing one because MPOL_INTERLEAVE is a hint rather than a hard
> requirement so it shouldn't be considered during the OOM.
> 
> The final code can be reduced to a check for MPOL_BIND which is the only
> memory policy that is a hard requirement and thus relevant to a
> constrained OOM logic.
> "

This is much clearer, thanks!

Will change this and the descrition in over-letter.

> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> 
> To the change itself
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

- Feng

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v3 PATCH 2/3] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy
  2021-06-01  8:44   ` Michal Hocko
@ 2021-06-01 11:29     ` Feng Tang
  0 siblings, 0 replies; 13+ messages in thread
From: Feng Tang @ 2021-06-01 11:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Tue, Jun 01, 2021 at 10:44:39AM +0200, Michal Hocko wrote:
> On Mon 31-05-21 22:05:55, Feng Tang wrote:
> > MPOL_LOCAL policy has been setup as a real policy, but it is still
> > handled like a faked POL_PREFERRED policy with one internal
> > MPOL_F_LOCAL flag bit set, and there are many places having to
> > judge the real 'prefer' or the 'local' policy, which are quite
> > confusing.
> > 
> > In current code, there are 4 cases that MPOL_LOCAL are used:
> > 1. user specifies 'local' policy
> > 2. user specifies 'prefer' policy, but with empty nodemask
> > 3. system 'default' policy is used
> > 4. 'prefer' policy + valid 'preferred' node with MPOL_F_STATIC_NODES
> >    flag set, and when it is 'rebind' to a nodemask which doesn't
> >    contains the 'preferred' node, it will perform as 'local' policy
> > 
> > So make 'local' a real policy instead of a fake 'prefer' one, and
> > kill MPOL_F_LOCAL bit, which can greatly reduce the confusion for
> > code reading.
> > 
> > For case 4, the logic of mpol_rebind_preferred() is confusing, as
> > Michal Hocko pointed out:
> > 
> >  "
> >  I do believe that rebinding preferred policy is just bogus and
> >  it should be dropped altogether on the ground that a preference
> >  is a mere hint from userspace where to start the allocation.
> >  Unless I am missing something cpusets will be always authoritative
> >  for the final placement. The preferred node just acts as a starting
> >  point and it should be really preserved when cpusets changes.
> >  Otherwise we have a very subtle behavior corner cases.
> >  "
> > So dump all the tricky transformation between 'prefer' and 'local',
> > and just record the new nodemask of rebinding.
> > 
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> 
> I like this very much! It simplifies a tricky code and also a very
> dubious behavior. I would like to hear from others whether there might
> be some userspace depending on this obscure behavior though. One never
> knows...
> 
> Some more notes/questions below
> 
> [...]
> > @@ -239,25 +240,19 @@ static int mpol_set_nodemask(struct mempolicy *pol,
> >  		  cpuset_current_mems_allowed, node_states[N_MEMORY]);
> >  
> >  	VM_BUG_ON(!nodes);
> > -	if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
> > -		nodes = NULL;	/* explicit local allocation */
> > -	else {
> > -		if (pol->flags & MPOL_F_RELATIVE_NODES)
> > -			mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1);
> > -		else
> > -			nodes_and(nsc->mask2, *nodes, nsc->mask1);
> >  
> > -		if (mpol_store_user_nodemask(pol))
> > -			pol->w.user_nodemask = *nodes;
> > -		else
> > -			pol->w.cpuset_mems_allowed =
> > -						cpuset_current_mems_allowed;
> > -	}
> > +	if (pol->flags & MPOL_F_RELATIVE_NODES)
> > +		mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1);
> > +	else
> > +		nodes_and(nsc->mask2, *nodes, nsc->mask1);
> 
> Maybe I've just got lost here but why don't you need to check for the
> local policy anymore? mpol_new will take care of the MPOL_PREFERRED &&
> nodes_empty special but why do we want/need all this for a local policy
> at all?
 
You are right that 'local' policy doesn't need this, it should just
return in the early port of this function, like 'default' policy, which
can remove the useless nop mpol_new_local().

> >  
> > -	if (nodes)
> > -		ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
> > +	if (mpol_store_user_nodemask(pol))
> > +		pol->w.user_nodemask = *nodes;
> >  	else
> > -		ret = mpol_ops[pol->mode].create(pol, NULL);
> > +		pol->w.cpuset_mems_allowed =
> > +					cpuset_current_mems_allowed;
> 
> please use a single line. This is just harder to read. You will cross
> the line limit but readability should be preferred here.
 
Will change.

Thanks,
Feng

> [...]
> 
> I haven't spotted anything else.
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [v3 PATCH 1/3] mm/mempolicy: cleanup nodemask intersection check for oom
  2021-06-01 11:08     ` Feng Tang
@ 2021-06-01 23:56       ` Andrew Morton
  0 siblings, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2021-06-01 23:56 UTC (permalink / raw)
  To: Feng Tang
  Cc: Michal Hocko, linux-mm, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Tue, 1 Jun 2021 19:08:40 +0800 Feng Tang <feng.tang@intel.com> wrote:

> This is much clearer, thanks!
> 
> Will change this and the descrition in over-letter.

I pasted Michal's changelog into -mm's
mm-mempolicy-cleanup-nodemask-intersection-check-for-oom.patch.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-06-01 23:56 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-31 14:05 [v3 PATCH 0/3] mm/mempolicy: some fix and semantics cleanup Feng Tang
2021-05-31 14:05 ` [v3 PATCH 1/3] mm/mempolicy: cleanup nodemask intersection check for oom Feng Tang
2021-06-01  8:19   ` Michal Hocko
2021-06-01 11:08     ` Feng Tang
2021-06-01 23:56       ` Andrew Morton
2021-05-31 14:05 ` [v3 PATCH 2/3] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy Feng Tang
2021-06-01  8:44   ` Michal Hocko
2021-06-01 11:29     ` Feng Tang
2021-05-31 14:05 ` [v3 PATCH 3/3] mm/mempolicy: unify the parameter sanity check for mbind and set_mempolicy Feng Tang
2021-06-01  8:46   ` Michal Hocko
2021-05-31 21:41 ` [v3 PATCH 0/3] mm/mempolicy: some fix and semantics cleanup Andrew Morton
2021-06-01  0:55   ` Feng Tang
2021-06-01  8:48     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).