All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v1 0/4] mm/mempolicy: some fix and semantics cleanup
@ 2021-05-26  5:01 Feng Tang
  2021-05-26  5:01 ` [PATCH v1 1/4] mm/mempolicy: skip nodemask intersect check for 'interleave' when oom Feng Tang
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Feng Tang @ 2021-05-26  5:01 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Dave Hansen, Ben Widawsky
  Cc: linux-kernel, Andrea Arcangeli, Mel Gorman, Mike Kravetz,
	Randy Dunlap, Vlastimil Babka, Andi Kleen, Dan Williams,
	ying.huang, Feng Tang

Hi All,

We've posted v4 patchset introducing a new "perfer-many" memory policy
https://lore.kernel.org/lkml/1615952410-36895-1-git-send-email-feng.tang@intel.com/ ,
for which Michal Hocko gave many comments while pointing out some
problems, and we also found some semantics confusion about 'prefer'
and 'local' policy, as well as some duplicated code. This patchset
tries to address them. Please help to review, thanks!

The patchset has been run with some sanity test like 'stress-ng'
and 'ltp', and no problem found.

- Feng

Changelogs:

    v1:
      * use helper func instead of macro for patch 2/4 (David Rientjes)
      * fix a possible null pointer case in patch 3/4 		
      * update commit log for 1/4
      
    RFC v2:
      * add for oom check fix patch 1/4
      * add the unification patch for mpol preprocess 2/4

Feng Tang (4):
  mm/mempolicy: skip nodemask intersect check for 'interleave' when oom
  mm/mempolicy: unify the preprocessing for mbind and set_mempolicy
  mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED
    policy
  mm/mempolicy: kill MPOL_F_LOCAL bit

 include/uapi/linux/mempolicy.h |   1 +
 mm/mempolicy.c                 | 208 +++++++++++++++++++++--------------------
 2 files changed, 109 insertions(+), 100 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v1 1/4] mm/mempolicy: skip nodemask intersect check for 'interleave' when oom
  2021-05-26  5:01 [PATCH v1 0/4] mm/mempolicy: some fix and semantics cleanup Feng Tang
@ 2021-05-26  5:01 ` Feng Tang
  2021-05-27  7:30   ` Michal Hocko
  2021-05-26  5:01 ` [PATCH v1 2/4] mm/mempolicy: unify the preprocessing for mbind and set_mempolicy Feng Tang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Feng Tang @ 2021-05-26  5:01 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Dave Hansen, Ben Widawsky
  Cc: linux-kernel, Andrea Arcangeli, Mel Gorman, Mike Kravetz,
	Randy Dunlap, Vlastimil Babka, Andi Kleen, Dan Williams,
	ying.huang, Feng Tang

mempolicy_nodemask_intersects() is used in oom case to check if a
task may have memory allocated on some memory nodes.

Currently, the nodes_intersects() is run for both 'bind' and 'interleave'
policies. But they are different regarding memory allocation, the nodemask
is a forced requirement for 'bind', while just a hint for 'interleave'.
Like in alloc_pages_vma():

	nmask = policy_nodemask(gfp, pol);
        preferred_nid = policy_node(gfp, pol, node);
        page = __alloc_pages(gfp, order, preferred_nid, nmask);

in plicy_nodemask(), only 'bind' policy may return its desired nodemask,
while others return NULL.  And this 'NULL' enables the 'interleave' policy
can get memory from other nodes than its nodemask.

So skip the nodemask intersect check for 'interleave' policy.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/mempolicy.c | 24 ++++--------------------
 1 file changed, 4 insertions(+), 20 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d79fa29..1964cca 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2098,7 +2098,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
  *
  * If tsk's mempolicy is "default" [NULL], return 'true' to indicate default
  * policy.  Otherwise, check for intersection between mask and the policy
- * nodemask for 'bind' or 'interleave' policy.  For 'preferred' or 'local'
+ * nodemask for 'bind' policy.  For 'interleave', 'preferred' or 'local'
  * policy, always return true since it may allocate elsewhere on fallback.
  *
  * Takes task_lock(tsk) to prevent freeing of its mempolicy.
@@ -2111,29 +2111,13 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
 
 	if (!mask)
 		return ret;
+
 	task_lock(tsk);
 	mempolicy = tsk->mempolicy;
-	if (!mempolicy)
-		goto out;
-
-	switch (mempolicy->mode) {
-	case MPOL_PREFERRED:
-		/*
-		 * MPOL_PREFERRED and MPOL_F_LOCAL are only preferred nodes to
-		 * allocate from, they may fallback to other nodes when oom.
-		 * Thus, it's possible for tsk to have allocated memory from
-		 * nodes in mask.
-		 */
-		break;
-	case MPOL_BIND:
-	case MPOL_INTERLEAVE:
+	if (mempolicy && mempolicy->mode == MPOL_BIND)
 		ret = nodes_intersects(mempolicy->v.nodes, *mask);
-		break;
-	default:
-		BUG();
-	}
-out:
 	task_unlock(tsk);
+
 	return ret;
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 2/4] mm/mempolicy: unify the preprocessing for mbind and set_mempolicy
  2021-05-26  5:01 [PATCH v1 0/4] mm/mempolicy: some fix and semantics cleanup Feng Tang
  2021-05-26  5:01 ` [PATCH v1 1/4] mm/mempolicy: skip nodemask intersect check for 'interleave' when oom Feng Tang
@ 2021-05-26  5:01 ` Feng Tang
  2021-05-27  7:39   ` Michal Hocko
  2021-05-26  5:01 ` [PATCH v1 3/4] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy Feng Tang
  2021-05-26  5:01 ` [PATCH v1 4/4] mm/mempolicy: kill MPOL_F_LOCAL bit Feng Tang
  3 siblings, 1 reply; 24+ messages in thread
From: Feng Tang @ 2021-05-26  5:01 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Dave Hansen, Ben Widawsky
  Cc: linux-kernel, Andrea Arcangeli, Mel Gorman, Mike Kravetz,
	Randy Dunlap, Vlastimil Babka, Andi Kleen, Dan Williams,
	ying.huang, Feng Tang

Currently the kernel_mbind() and kernel_set_mempolicy() do almost
the same operation for parameter sanity check and preprocessing.

Add a helper function to unify the code to reduce the redundancy,
and make it easier for changing the pre-processing code in future.

[thanks to David Rientjes for suggesting using helper function
instead of macro]

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/mempolicy.c | 43 ++++++++++++++++++++++++-------------------
 1 file changed, 24 insertions(+), 19 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1964cca..2830bb8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1460,6 +1460,20 @@ static int copy_nodes_to_user(unsigned long __user *mask, unsigned long maxnode,
 	return copy_to_user(mask, nodes_addr(*nodes), copy) ? -EFAULT : 0;
 }
 
+static inline int mpol_pre_process(int *mode, const unsigned long __user *nmask, unsigned long maxnode, nodemask_t *nodes, unsigned short *flags)
+{
+	int ret;
+
+	*flags = *mode & MPOL_MODE_FLAGS;
+	*mode &= ~MPOL_MODE_FLAGS;
+	if ((unsigned int)(*mode) >= MPOL_MAX)
+		return -EINVAL;
+	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
+		return -EINVAL;
+	ret = get_nodes(nodes, nmask, maxnode);
+	return ret;
+}
+
 static long kernel_mbind(unsigned long start, unsigned long len,
 			 unsigned long mode, const unsigned long __user *nmask,
 			 unsigned long maxnode, unsigned int flags)
@@ -1467,19 +1481,14 @@ static long kernel_mbind(unsigned long start, unsigned long len,
 	nodemask_t nodes;
 	int err;
 	unsigned short mode_flags;
+	int lmode = mode;
 
-	start = untagged_addr(start);
-	mode_flags = mode & MPOL_MODE_FLAGS;
-	mode &= ~MPOL_MODE_FLAGS;
-	if (mode >= MPOL_MAX)
-		return -EINVAL;
-	if ((mode_flags & MPOL_F_STATIC_NODES) &&
-	    (mode_flags & MPOL_F_RELATIVE_NODES))
-		return -EINVAL;
-	err = get_nodes(&nodes, nmask, maxnode);
+	err = mpol_pre_process(&lmode, nmask, maxnode, &nodes, &mode_flags);
 	if (err)
 		return err;
-	return do_mbind(start, len, mode, mode_flags, &nodes, flags);
+
+	start = untagged_addr(start);
+	return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
 }
 
 SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
@@ -1495,18 +1504,14 @@ static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
 {
 	int err;
 	nodemask_t nodes;
-	unsigned short flags;
+	unsigned short mode_flags;
+	int lmode = mode;
 
-	flags = mode & MPOL_MODE_FLAGS;
-	mode &= ~MPOL_MODE_FLAGS;
-	if ((unsigned int)mode >= MPOL_MAX)
-		return -EINVAL;
-	if ((flags & MPOL_F_STATIC_NODES) && (flags & MPOL_F_RELATIVE_NODES))
-		return -EINVAL;
-	err = get_nodes(&nodes, nmask, maxnode);
+	err = mpol_pre_process(&lmode, nmask, maxnode, &nodes, &mode_flags);
 	if (err)
 		return err;
-	return do_set_mempolicy(mode, flags, &nodes);
+
+	return do_set_mempolicy(lmode, mode_flags, &nodes);
 }
 
 SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 3/4] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy
  2021-05-26  5:01 [PATCH v1 0/4] mm/mempolicy: some fix and semantics cleanup Feng Tang
  2021-05-26  5:01 ` [PATCH v1 1/4] mm/mempolicy: skip nodemask intersect check for 'interleave' when oom Feng Tang
  2021-05-26  5:01 ` [PATCH v1 2/4] mm/mempolicy: unify the preprocessing for mbind and set_mempolicy Feng Tang
@ 2021-05-26  5:01 ` Feng Tang
  2021-05-27  8:12   ` Michal Hocko
  2021-05-26  5:01 ` [PATCH v1 4/4] mm/mempolicy: kill MPOL_F_LOCAL bit Feng Tang
  3 siblings, 1 reply; 24+ messages in thread
From: Feng Tang @ 2021-05-26  5:01 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Dave Hansen, Ben Widawsky
  Cc: linux-kernel, Andrea Arcangeli, Mel Gorman, Mike Kravetz,
	Randy Dunlap, Vlastimil Babka, Andi Kleen, Dan Williams,
	ying.huang, Feng Tang

MPOL_LOCAL policy has been setup as a real policy, but it is still
handled like a faked POL_PREFERRED policy with one internal
MPOL_F_LOCAL flag bit set, and there are many places having to
judge the real 'prefer' or the 'local' policy, which are quite
confusing.

In current code, there are four cases that MPOL_LOCAL are used:
* user specifies 'local' policy
* user specifies 'prefer' policy, but with empty nodemask
* system 'default' policy is used
* 'prefer' policy + valid 'preferred' node with MPOL_F_STATIC_NODES
  flag set, and when it is 'rebind' to a nodemask which doesn't
  contains the 'preferred' node, it will add the MPOL_F_LOCAL bit
  and performs as 'local' policy. In future if it is 'rebind' again
  with valid nodemask, the policy will be restored back to 'prefer'

So for the first three cases, we make 'local' a real policy
instead of a fake 'prefer' one, this will reduce confusion for
reading code.

And next optional patch will kill the 'MPOL_F_LOCAL' bit.

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/mempolicy.c | 66 ++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 39 insertions(+), 27 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2830bb8..d97839d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -121,8 +121,7 @@ enum zone_type policy_zone = 0;
  */
 static struct mempolicy default_policy = {
 	.refcnt = ATOMIC_INIT(1), /* never free it */
-	.mode = MPOL_PREFERRED,
-	.flags = MPOL_F_LOCAL,
+	.mode = MPOL_LOCAL,
 };
 
 static struct mempolicy preferred_node_policy[MAX_NUMNODES];
@@ -200,12 +199,9 @@ static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes)
 
 static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
 {
-	if (!nodes)
-		pol->flags |= MPOL_F_LOCAL;	/* local allocation */
-	else if (nodes_empty(*nodes))
-		return -EINVAL;			/*  no allowed nodes */
-	else
-		pol->v.preferred_node = first_node(*nodes);
+	if (nodes_empty(*nodes))
+		return -EINVAL;
+	pol->v.preferred_node = first_node(*nodes);
 	return 0;
 }
 
@@ -217,6 +213,11 @@ static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
 	return 0;
 }
 
+static int mpol_new_local(struct mempolicy *pol, const nodemask_t *nodes)
+{
+	return 0;
+}
+
 /*
  * mpol_set_nodemask is called after mpol_new() to set up the nodemask, if
  * any, for the new policy.  mpol_new() has already validated the nodes
@@ -239,25 +240,19 @@ static int mpol_set_nodemask(struct mempolicy *pol,
 		  cpuset_current_mems_allowed, node_states[N_MEMORY]);
 
 	VM_BUG_ON(!nodes);
-	if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
-		nodes = NULL;	/* explicit local allocation */
-	else {
-		if (pol->flags & MPOL_F_RELATIVE_NODES)
-			mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1);
-		else
-			nodes_and(nsc->mask2, *nodes, nsc->mask1);
 
-		if (mpol_store_user_nodemask(pol))
-			pol->w.user_nodemask = *nodes;
-		else
-			pol->w.cpuset_mems_allowed =
-						cpuset_current_mems_allowed;
-	}
+	if (pol->flags & MPOL_F_RELATIVE_NODES)
+		mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1);
+	else
+		nodes_and(nsc->mask2, *nodes, nsc->mask1);
 
-	if (nodes)
-		ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
+	if (mpol_store_user_nodemask(pol))
+		pol->w.user_nodemask = *nodes;
 	else
-		ret = mpol_ops[pol->mode].create(pol, NULL);
+		pol->w.cpuset_mems_allowed =
+					cpuset_current_mems_allowed;
+
+	ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
 	return ret;
 }
 
@@ -290,13 +285,14 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 			if (((flags & MPOL_F_STATIC_NODES) ||
 			     (flags & MPOL_F_RELATIVE_NODES)))
 				return ERR_PTR(-EINVAL);
+
+			mode = MPOL_LOCAL;
 		}
 	} else if (mode == MPOL_LOCAL) {
 		if (!nodes_empty(*nodes) ||
 		    (flags & MPOL_F_STATIC_NODES) ||
 		    (flags & MPOL_F_RELATIVE_NODES))
 			return ERR_PTR(-EINVAL);
-		mode = MPOL_PREFERRED;
 	} else if (nodes_empty(*nodes))
 		return ERR_PTR(-EINVAL);
 	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
@@ -427,6 +423,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 		.create = mpol_new_bind,
 		.rebind = mpol_rebind_nodemask,
 	},
+	[MPOL_LOCAL] = {
+		.create = mpol_new_local,
+		.rebind = mpol_rebind_default,
+	},
 };
 
 static int migrate_page_add(struct page *page, struct list_head *pagelist,
@@ -1965,6 +1965,8 @@ unsigned int mempolicy_slab_node(void)
 							&policy->v.nodes);
 		return z->zone ? zone_to_nid(z->zone) : node;
 	}
+	case MPOL_LOCAL:
+		return node;
 
 	default:
 		BUG();
@@ -2089,6 +2091,11 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 		*mask =  mempolicy->v.nodes;
 		break;
 
+	case MPOL_LOCAL:
+		nid = numa_node_id();
+		init_nodemask_of_node(mask, nid);
+		break;
+
 	default:
 		BUG();
 	}
@@ -2333,6 +2340,8 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 		if (a->flags & MPOL_F_LOCAL)
 			return true;
 		return a->v.preferred_node == b->v.preferred_node;
+	case MPOL_LOCAL:
+		return true;
 	default:
 		BUG();
 		return false;
@@ -2476,6 +2485,10 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 			polnid = pol->v.preferred_node;
 		break;
 
+	case MPOL_LOCAL:
+		polnid = numa_node_id();
+		break;
+
 	case MPOL_BIND:
 		/* Optimize placement among multiple nodes via NUMA balancing */
 		if (pol->flags & MPOL_F_MORON) {
@@ -2920,7 +2933,6 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 		 */
 		if (nodelist)
 			goto out;
-		mode = MPOL_PREFERRED;
 		break;
 	case MPOL_DEFAULT:
 		/*
@@ -2964,7 +2976,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 	else if (nodelist)
 		new->v.preferred_node = first_node(nodes);
 	else
-		new->flags |= MPOL_F_LOCAL;
+		new->mode = MPOL_LOCAL;
 
 	/*
 	 * Save nodes for contextualization: this will be used to "clone"
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 4/4] mm/mempolicy: kill MPOL_F_LOCAL bit
  2021-05-26  5:01 [PATCH v1 0/4] mm/mempolicy: some fix and semantics cleanup Feng Tang
                   ` (2 preceding siblings ...)
  2021-05-26  5:01 ` [PATCH v1 3/4] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy Feng Tang
@ 2021-05-26  5:01 ` Feng Tang
  2021-05-27  8:20   ` Michal Hocko
  3 siblings, 1 reply; 24+ messages in thread
From: Feng Tang @ 2021-05-26  5:01 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Dave Hansen, Ben Widawsky
  Cc: linux-kernel, Andrea Arcangeli, Mel Gorman, Mike Kravetz,
	Randy Dunlap, Vlastimil Babka, Andi Kleen, Dan Williams,
	ying.huang, Feng Tang

Now the only remaining case of a real 'local' policy faked by
'prefer' policy plus MPOL_F_LOCAL bit is:

A valid 'prefer' policy with a valid 'preferred' node is 'rebind'
to a nodemask which doesn't contains the 'preferred' node, then it
will handle allocation with 'local' policy.

Add a new 'MPOL_F_LOCAL_TEMP' bit for this case, and kill the
MPOL_F_LOCAL bit, which could simplify the code much.

Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 include/uapi/linux/mempolicy.h |  1 +
 mm/mempolicy.c                 | 77 +++++++++++++++++++++++-------------------
 2 files changed, 43 insertions(+), 35 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 4832fd0..942844a 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -63,6 +63,7 @@ enum {
 #define MPOL_F_LOCAL   (1 << 1)	/* preferred local allocation */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 #define MPOL_F_MORON	(1 << 4) /* Migrate On protnone Reference On Node */
+#define MPOL_F_LOCAL_TEMP  (1 << 5) /* a policy temporarily changed from 'prefer' to 'local' */
 
 /*
  * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d97839d..6046196 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -337,6 +337,22 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
 	pol->v.nodes = tmp;
 }
 
+static void mpol_rebind_local(struct mempolicy *pol,
+				const nodemask_t *nodes)
+{
+	if (unlikely(pol->flags & MPOL_F_STATIC_NODES)) {
+		int node = first_node(pol->w.user_nodemask);
+
+		BUG_ON(!(pol->flags & MPOL_F_LOCAL_TEMP));
+
+		if (node_isset(node, *nodes)) {
+			pol->v.preferred_node = node;
+			pol->mode = MPOL_PREFERRED;
+			pol->flags &= ~MPOL_F_LOCAL_TEMP;
+		}
+	}
+}
+
 static void mpol_rebind_preferred(struct mempolicy *pol,
 						const nodemask_t *nodes)
 {
@@ -347,13 +363,19 @@ static void mpol_rebind_preferred(struct mempolicy *pol,
 
 		if (node_isset(node, *nodes)) {
 			pol->v.preferred_node = node;
-			pol->flags &= ~MPOL_F_LOCAL;
-		} else
-			pol->flags |= MPOL_F_LOCAL;
+		} else {
+			/*
+			 * If there is no valid node, change the mode to
+			 * MPOL_LOCAL, which will be restored back when
+			 * next rebind() sees a valid node.
+			 */
+			pol->mode = MPOL_LOCAL;
+			pol->flags |= MPOL_F_LOCAL_TEMP;
+		}
 	} else if (pol->flags & MPOL_F_RELATIVE_NODES) {
 		mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
 		pol->v.preferred_node = first_node(tmp);
-	} else if (!(pol->flags & MPOL_F_LOCAL)) {
+	} else {
 		pol->v.preferred_node = node_remap(pol->v.preferred_node,
 						   pol->w.cpuset_mems_allowed,
 						   *nodes);
@@ -372,7 +394,7 @@ static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask)
 {
 	if (!pol)
 		return;
-	if (!mpol_store_user_nodemask(pol) && !(pol->flags & MPOL_F_LOCAL) &&
+	if (!mpol_store_user_nodemask(pol) &&
 	    nodes_equal(pol->w.cpuset_mems_allowed, *newmask))
 		return;
 
@@ -425,7 +447,7 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 	},
 	[MPOL_LOCAL] = {
 		.create = mpol_new_local,
-		.rebind = mpol_rebind_default,
+		.rebind = mpol_rebind_local,
 	},
 };
 
@@ -919,10 +941,12 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
 	case MPOL_INTERLEAVE:
 		*nodes = p->v.nodes;
 		break;
+	case MPOL_LOCAL:
+		/* return empty node mask for local allocation */
+		break;
+
 	case MPOL_PREFERRED:
-		if (!(p->flags & MPOL_F_LOCAL))
-			node_set(p->v.preferred_node, *nodes);
-		/* else return empty node mask for local allocation */
+		node_set(p->v.preferred_node, *nodes);
 		break;
 	default:
 		BUG();
@@ -1899,9 +1923,9 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 /* Return the node id preferred by the given mempolicy, or the given id */
 static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
 {
-	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
+	if (policy->mode == MPOL_PREFERRED) {
 		nd = policy->v.preferred_node;
-	else {
+	} else {
 		/*
 		 * __GFP_THISNODE shouldn't even be used with the bind policy
 		 * because we might easily break the expectation to stay on the
@@ -1938,14 +1962,11 @@ unsigned int mempolicy_slab_node(void)
 		return node;
 
 	policy = current->mempolicy;
-	if (!policy || policy->flags & MPOL_F_LOCAL)
+	if (!policy)
 		return node;
 
 	switch (policy->mode) {
 	case MPOL_PREFERRED:
-		/*
-		 * handled MPOL_F_LOCAL above
-		 */
 		return policy->v.preferred_node;
 
 	case MPOL_INTERLEAVE:
@@ -2079,16 +2100,13 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 	mempolicy = current->mempolicy;
 	switch (mempolicy->mode) {
 	case MPOL_PREFERRED:
-		if (mempolicy->flags & MPOL_F_LOCAL)
-			nid = numa_node_id();
-		else
-			nid = mempolicy->v.preferred_node;
+		nid = mempolicy->v.preferred_node;
 		init_nodemask_of_node(mask, nid);
 		break;
 
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
-		*mask =  mempolicy->v.nodes;
+		*mask = mempolicy->v.nodes;
 		break;
 
 	case MPOL_LOCAL:
@@ -2200,7 +2218,7 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		 * If the policy is interleave, or does not allow the current
 		 * node in its nodemask, we allocate the standard way.
 		 */
-		if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL))
+		if (pol->mode == MPOL_PREFERRED)
 			hpage_node = pol->v.preferred_node;
 
 		nmask = policy_nodemask(gfp, pol);
@@ -2336,9 +2354,6 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 	case MPOL_INTERLEAVE:
 		return !!nodes_equal(a->v.nodes, b->v.nodes);
 	case MPOL_PREFERRED:
-		/* a's ->flags is the same as b's */
-		if (a->flags & MPOL_F_LOCAL)
-			return true;
 		return a->v.preferred_node == b->v.preferred_node;
 	case MPOL_LOCAL:
 		return true;
@@ -2479,10 +2494,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		break;
 
 	case MPOL_PREFERRED:
-		if (pol->flags & MPOL_F_LOCAL)
-			polnid = numa_node_id();
-		else
-			polnid = pol->v.preferred_node;
+		polnid = pol->v.preferred_node;
 		break;
 
 	case MPOL_LOCAL:
@@ -2853,9 +2865,6 @@ void numa_default_policy(void)
  * Parse and format mempolicy from/to strings
  */
 
-/*
- * "local" is implemented internally by MPOL_PREFERRED with MPOL_F_LOCAL flag.
- */
 static const char * const policy_modes[] =
 {
 	[MPOL_DEFAULT]    = "default",
@@ -3022,12 +3031,10 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 
 	switch (mode) {
 	case MPOL_DEFAULT:
+	case MPOL_LOCAL:
 		break;
 	case MPOL_PREFERRED:
-		if (flags & MPOL_F_LOCAL)
-			mode = MPOL_LOCAL;
-		else
-			node_set(pol->v.preferred_node, nodes);
+		node_set(pol->v.preferred_node, nodes);
 		break;
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 1/4] mm/mempolicy: skip nodemask intersect check for 'interleave' when oom
  2021-05-26  5:01 ` [PATCH v1 1/4] mm/mempolicy: skip nodemask intersect check for 'interleave' when oom Feng Tang
@ 2021-05-27  7:30   ` Michal Hocko
  2021-05-27 13:05     ` Feng Tang
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2021-05-27  7:30 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Wed 26-05-21 13:01:39, Feng Tang wrote:
> mempolicy_nodemask_intersects() is used in oom case to check if a
> task may have memory allocated on some memory nodes.
> 
> Currently, the nodes_intersects() is run for both 'bind' and 'interleave'
> policies. But they are different regarding memory allocation, the nodemask
> is a forced requirement for 'bind', while just a hint for 'interleave'.
> Like in alloc_pages_vma():
> 
> 	nmask = policy_nodemask(gfp, pol);
>         preferred_nid = policy_node(gfp, pol, node);
>         page = __alloc_pages(gfp, order, preferred_nid, nmask);
> 
> in plicy_nodemask(), only 'bind' policy may return its desired nodemask,
> while others return NULL.  And this 'NULL' enables the 'interleave' policy
> can get memory from other nodes than its nodemask.
> 
> So skip the nodemask intersect check for 'interleave' policy.

The changelog is not really clear on the actual effect of the
patch and the above reference to alloc_pages_vma looks misleading to me
because that path is never called for interleaved policy.

This is very likely my fault because I was rather vague. The existing
code in its current form is confusing but it _works_ properly. The
problem is that it sounds like a general helper and in that regards
the function is correct for the interleaved policy and your proposed
preferred-many. But its only existing caller wants a different semantic.

Until now this was not a real problem even for OOM context because
alloc_page_interleave is always used for the interleaving policy
and that one doesn't use any node mask so the code is not really
exercised. With your MPOL_PREFERRED this would no longer be the case.

Your patch makes the code more robust for the oom context but it can
confuse other users who really want to do an intersect logic. So I think
it would really be best to rename the function and make it oom specific.
E.g. mempolicy_in_oom_domain(tsk, mask) this would make it clear that
this is not a general purpose function.

The changelog should be clear that this is just a code cleanup rather
than fix.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/4] mm/mempolicy: unify the preprocessing for mbind and set_mempolicy
  2021-05-26  5:01 ` [PATCH v1 2/4] mm/mempolicy: unify the preprocessing for mbind and set_mempolicy Feng Tang
@ 2021-05-27  7:39   ` Michal Hocko
  2021-05-27 12:31     ` Feng Tang
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2021-05-27  7:39 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Wed 26-05-21 13:01:40, Feng Tang wrote:
> Currently the kernel_mbind() and kernel_set_mempolicy() do almost
> the same operation for parameter sanity check and preprocessing.
> 
> Add a helper function to unify the code to reduce the redundancy,
> and make it easier for changing the pre-processing code in future.
> 
> [thanks to David Rientjes for suggesting using helper function
> instead of macro]

I appreciate removing the code duplication but I am not really convinced
this is an improvement. You are conflating two things. One is the mpol
flags checking and node mask copying. While abstracting the first one
makes sense to me the later is already a single line of code that makes
your helper unnecessarily complex. So I would go with sanitize_mpol_flags
and put a flags handling there and leave get_nodes alone.
 
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>  mm/mempolicy.c | 43 ++++++++++++++++++++++++-------------------
>  1 file changed, 24 insertions(+), 19 deletions(-)

Funny how removing code duplication adds more code than it removes ;)

> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 1964cca..2830bb8 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1460,6 +1460,20 @@ static int copy_nodes_to_user(unsigned long __user *mask, unsigned long maxnode,
>  	return copy_to_user(mask, nodes_addr(*nodes), copy) ? -EFAULT : 0;
>  }
>  
> +static inline int mpol_pre_process(int *mode, const unsigned long __user *nmask, unsigned long maxnode, nodemask_t *nodes, unsigned short *flags)
> +{
> +	int ret;
> +
> +	*flags = *mode & MPOL_MODE_FLAGS;
> +	*mode &= ~MPOL_MODE_FLAGS;
> +	if ((unsigned int)(*mode) >= MPOL_MAX)
> +		return -EINVAL;
> +	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
> +		return -EINVAL;
> +	ret = get_nodes(nodes, nmask, maxnode);
> +	return ret;
> +}
> +
>  static long kernel_mbind(unsigned long start, unsigned long len,
>  			 unsigned long mode, const unsigned long __user *nmask,
>  			 unsigned long maxnode, unsigned int flags)
> @@ -1467,19 +1481,14 @@ static long kernel_mbind(unsigned long start, unsigned long len,
>  	nodemask_t nodes;
>  	int err;
>  	unsigned short mode_flags;
> +	int lmode = mode;
>  
> -	start = untagged_addr(start);
> -	mode_flags = mode & MPOL_MODE_FLAGS;
> -	mode &= ~MPOL_MODE_FLAGS;
> -	if (mode >= MPOL_MAX)
> -		return -EINVAL;
> -	if ((mode_flags & MPOL_F_STATIC_NODES) &&
> -	    (mode_flags & MPOL_F_RELATIVE_NODES))
> -		return -EINVAL;
> -	err = get_nodes(&nodes, nmask, maxnode);
> +	err = mpol_pre_process(&lmode, nmask, maxnode, &nodes, &mode_flags);
>  	if (err)
>  		return err;
> -	return do_mbind(start, len, mode, mode_flags, &nodes, flags);
> +
> +	start = untagged_addr(start);
> +	return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
>  }
>  
>  SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
> @@ -1495,18 +1504,14 @@ static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
>  {
>  	int err;
>  	nodemask_t nodes;
> -	unsigned short flags;
> +	unsigned short mode_flags;
> +	int lmode = mode;
>  
> -	flags = mode & MPOL_MODE_FLAGS;
> -	mode &= ~MPOL_MODE_FLAGS;
> -	if ((unsigned int)mode >= MPOL_MAX)
> -		return -EINVAL;
> -	if ((flags & MPOL_F_STATIC_NODES) && (flags & MPOL_F_RELATIVE_NODES))
> -		return -EINVAL;
> -	err = get_nodes(&nodes, nmask, maxnode);
> +	err = mpol_pre_process(&lmode, nmask, maxnode, &nodes, &mode_flags);
>  	if (err)
>  		return err;
> -	return do_set_mempolicy(mode, flags, &nodes);
> +
> +	return do_set_mempolicy(lmode, mode_flags, &nodes);
>  }
>  
>  SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 3/4] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy
  2021-05-26  5:01 ` [PATCH v1 3/4] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy Feng Tang
@ 2021-05-27  8:12   ` Michal Hocko
  2021-05-27 12:06     ` Feng Tang
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2021-05-27  8:12 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Wed 26-05-21 13:01:41, Feng Tang wrote:
> MPOL_LOCAL policy has been setup as a real policy, but it is still
> handled like a faked POL_PREFERRED policy with one internal
> MPOL_F_LOCAL flag bit set, and there are many places having to
> judge the real 'prefer' or the 'local' policy, which are quite
> confusing.
> 
> In current code, there are four cases that MPOL_LOCAL are used:
> * user specifies 'local' policy
> * user specifies 'prefer' policy, but with empty nodemask
> * system 'default' policy is used
> * 'prefer' policy + valid 'preferred' node with MPOL_F_STATIC_NODES
>   flag set, and when it is 'rebind' to a nodemask which doesn't
>   contains the 'preferred' node, it will add the MPOL_F_LOCAL bit
>   and performs as 'local' policy. In future if it is 'rebind' again
>   with valid nodemask, the policy will be restored back to 'prefer'
> 
> So for the first three cases, we make 'local' a real policy
> instead of a fake 'prefer' one, this will reduce confusion for
> reading code.
> 
> And next optional patch will kill the 'MPOL_F_LOCAL' bit.

I do like this approach. An additional policy should be much easier to
grasp than a special casing. This code is quite tricky so another pair
of eyes would be definitely good for the review.

> Signed-off-by: Feng Tang <feng.tang@intel.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Just few nits.

>  static int migrate_page_add(struct page *page, struct list_head *pagelist,
> @@ -1965,6 +1965,8 @@ unsigned int mempolicy_slab_node(void)
>  							&policy->v.nodes);
>  		return z->zone ? zone_to_nid(z->zone) : node;
>  	}
> +	case MPOL_LOCAL:
> +		return node;

Any reason you haven't removed MPOL_F_LOCAL in this and following
functions? It would make it much more easier to review this patch if
there was no actual use of the flag in the code after this patch.

>  
>  	default:
>  		BUG();
> @@ -2089,6 +2091,11 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
>  		*mask =  mempolicy->v.nodes;
>  		break;
>  
> +	case MPOL_LOCAL:
> +		nid = numa_node_id();
> +		init_nodemask_of_node(mask, nid);
> +		break;
> +
>  	default:
>  		BUG();
>  	}
> @@ -2333,6 +2340,8 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
>  		if (a->flags & MPOL_F_LOCAL)
>  			return true;
>  		return a->v.preferred_node == b->v.preferred_node;
> +	case MPOL_LOCAL:
> +		return true;
>  	default:
>  		BUG();
>  		return false;
> @@ -2476,6 +2485,10 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
>  			polnid = pol->v.preferred_node;
>  		break;
>  
> +	case MPOL_LOCAL:
> +		polnid = numa_node_id();
> +		break;
> +
>  	case MPOL_BIND:
>  		/* Optimize placement among multiple nodes via NUMA balancing */
>  		if (pol->flags & MPOL_F_MORON) {
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/4] mm/mempolicy: kill MPOL_F_LOCAL bit
  2021-05-26  5:01 ` [PATCH v1 4/4] mm/mempolicy: kill MPOL_F_LOCAL bit Feng Tang
@ 2021-05-27  8:20   ` Michal Hocko
  2021-05-27 12:10     ` Feng Tang
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2021-05-27  8:20 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Wed 26-05-21 13:01:42, Feng Tang wrote:
> Now the only remaining case of a real 'local' policy faked by
> 'prefer' policy plus MPOL_F_LOCAL bit is:
> 
> A valid 'prefer' policy with a valid 'preferred' node is 'rebind'
> to a nodemask which doesn't contains the 'preferred' node, then it
> will handle allocation with 'local' policy.
> 
> Add a new 'MPOL_F_LOCAL_TEMP' bit for this case, and kill the
> MPOL_F_LOCAL bit, which could simplify the code much.

As I've pointed out in the reply to the previous patch. It would have
been much better if most of the MPOL_F_LOCAL usage was gone by this
patch.

I also dislike a new MPOL_F_LOCAL_TEMP. This smells like sneaking the
hack back in after you have painstakingly removed it. So this looks like
a step backwards to me. I also do not understand why do we need the
rebind callback for local policy at all. There is no node mask for local
so what is going on here?

[...]
> +static void mpol_rebind_local(struct mempolicy *pol,
> +				const nodemask_t *nodes)
> +{
> +	if (unlikely(pol->flags & MPOL_F_STATIC_NODES)) {
> +		int node = first_node(pol->w.user_nodemask);
> +
> +		BUG_ON(!(pol->flags & MPOL_F_LOCAL_TEMP));
> +
> +		if (node_isset(node, *nodes)) {
> +			pol->v.preferred_node = node;
> +			pol->mode = MPOL_PREFERRED;
> +			pol->flags &= ~MPOL_F_LOCAL_TEMP;
> +		}
> +	}
> +}
> +

I have to confess I've got lost here. Could you explain why do you need
all this handling for a local policy?

>  static void mpol_rebind_preferred(struct mempolicy *pol,
>  						const nodemask_t *nodes)
>  {
> @@ -347,13 +363,19 @@ static void mpol_rebind_preferred(struct mempolicy *pol,
>  
>  		if (node_isset(node, *nodes)) {
>  			pol->v.preferred_node = node;
> -			pol->flags &= ~MPOL_F_LOCAL;
> -		} else
> -			pol->flags |= MPOL_F_LOCAL;
> +		} else {
> +			/*
> +			 * If there is no valid node, change the mode to
> +			 * MPOL_LOCAL, which will be restored back when
> +			 * next rebind() sees a valid node.
> +			 */
> +			pol->mode = MPOL_LOCAL;
> +			pol->flags |= MPOL_F_LOCAL_TEMP;
> +		}
>  	} else if (pol->flags & MPOL_F_RELATIVE_NODES) {
>  		mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
>  		pol->v.preferred_node = first_node(tmp);
> -	} else if (!(pol->flags & MPOL_F_LOCAL)) {
> +	} else {
>  		pol->v.preferred_node = node_remap(pol->v.preferred_node,
>  						   pol->w.cpuset_mems_allowed,
>  						   *nodes);
> @@ -372,7 +394,7 @@ static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask)
>  {
>  	if (!pol)
>  		return;
> -	if (!mpol_store_user_nodemask(pol) && !(pol->flags & MPOL_F_LOCAL) &&
> +	if (!mpol_store_user_nodemask(pol) &&
>  	    nodes_equal(pol->w.cpuset_mems_allowed, *newmask))
>  		return;
>  
> @@ -425,7 +447,7 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
>  	},
>  	[MPOL_LOCAL] = {
>  		.create = mpol_new_local,
> -		.rebind = mpol_rebind_default,
> +		.rebind = mpol_rebind_local,
>  	},
>  };
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 3/4] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy
  2021-05-27  8:12   ` Michal Hocko
@ 2021-05-27 12:06     ` Feng Tang
  2021-05-27 12:16       ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Feng Tang @ 2021-05-27 12:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

Hi Michal,

Many thanks for the reivews!

On Thu, May 27, 2021 at 10:12:15AM +0200, Michal Hocko wrote:
> On Wed 26-05-21 13:01:41, Feng Tang wrote:
> > MPOL_LOCAL policy has been setup as a real policy, but it is still
> > handled like a faked POL_PREFERRED policy with one internal
> > MPOL_F_LOCAL flag bit set, and there are many places having to
> > judge the real 'prefer' or the 'local' policy, which are quite
> > confusing.
> > 
> > In current code, there are four cases that MPOL_LOCAL are used:
> > * user specifies 'local' policy
> > * user specifies 'prefer' policy, but with empty nodemask
> > * system 'default' policy is used
> > * 'prefer' policy + valid 'preferred' node with MPOL_F_STATIC_NODES
> >   flag set, and when it is 'rebind' to a nodemask which doesn't
> >   contains the 'preferred' node, it will add the MPOL_F_LOCAL bit
> >   and performs as 'local' policy. In future if it is 'rebind' again
> >   with valid nodemask, the policy will be restored back to 'prefer'
> > 
> > So for the first three cases, we make 'local' a real policy
> > instead of a fake 'prefer' one, this will reduce confusion for
> > reading code.
> > 
> > And next optional patch will kill the 'MPOL_F_LOCAL' bit.
> 
> I do like this approach. An additional policy should be much easier to
> grasp than a special casing. This code is quite tricky so another pair
> of eyes would be definitely good for the review.
> 
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> Just few nits.
> 
> >  static int migrate_page_add(struct page *page, struct list_head *pagelist,
> > @@ -1965,6 +1965,8 @@ unsigned int mempolicy_slab_node(void)
> >  							&policy->v.nodes);
> >  		return z->zone ? zone_to_nid(z->zone) : node;
> >  	}
> > +	case MPOL_LOCAL:
> > +		return node;
> 
> Any reason you haven't removed MPOL_F_LOCAL in this and following
> functions? It would make it much more easier to review this patch if
> there was no actual use of the flag in the code after this patch.

As in the commit log, there are 4 cases using 'prefer' + MPOL_F_LOCAL 
to represent 'local' policy. 

I'm confident in this patch which handle the case 1/2/3, while not 
sure if the solution (patch 4/4) for case 4 is the right method. So
I separte them into 3/4 and 4/4

Thanks,
Feng


> >  
> >  	default:
> >  		BUG();
> > @@ -2089,6 +2091,11 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
> >  		*mask =  mempolicy->v.nodes;
> >  		break;
> >  
> > +	case MPOL_LOCAL:
> > +		nid = numa_node_id();
> > +		init_nodemask_of_node(mask, nid);
> > +		break;
> > +
> >  	default:
> >  		BUG();
> >  	}
> > @@ -2333,6 +2340,8 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
> >  		if (a->flags & MPOL_F_LOCAL)
> >  			return true;
> >  		return a->v.preferred_node == b->v.preferred_node;
> > +	case MPOL_LOCAL:
> > +		return true;
> >  	default:
> >  		BUG();
> >  		return false;
> > @@ -2476,6 +2485,10 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
> >  			polnid = pol->v.preferred_node;
> >  		break;
> >  
> > +	case MPOL_LOCAL:
> > +		polnid = numa_node_id();
> > +		break;
> > +
> >  	case MPOL_BIND:
> >  		/* Optimize placement among multiple nodes via NUMA balancing */
> >  		if (pol->flags & MPOL_F_MORON) {
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/4] mm/mempolicy: kill MPOL_F_LOCAL bit
  2021-05-27  8:20   ` Michal Hocko
@ 2021-05-27 12:10     ` Feng Tang
  2021-05-27 12:26       ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Feng Tang @ 2021-05-27 12:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Thu, May 27, 2021 at 10:20:08AM +0200, Michal Hocko wrote:
> On Wed 26-05-21 13:01:42, Feng Tang wrote:
> > Now the only remaining case of a real 'local' policy faked by
> > 'prefer' policy plus MPOL_F_LOCAL bit is:
> > 
> > A valid 'prefer' policy with a valid 'preferred' node is 'rebind'
> > to a nodemask which doesn't contains the 'preferred' node, then it
> > will handle allocation with 'local' policy.
> > 
> > Add a new 'MPOL_F_LOCAL_TEMP' bit for this case, and kill the
> > MPOL_F_LOCAL bit, which could simplify the code much.
> 
> As I've pointed out in the reply to the previous patch. It would have
> been much better if most of the MPOL_F_LOCAL usage was gone by this
> patch.
> 
> I also dislike a new MPOL_F_LOCAL_TEMP. This smells like sneaking the
> hack back in after you have painstakingly removed it. So this looks like
> a step backwards to me. I also do not understand why do we need the
> rebind callback for local policy at all. There is no node mask for local
> so what is going on here?

This is the special case 4 for 'perfer' policy with MPOL_F_STATIC_NODES
flag set, say it prefer node 1, when it is later 'refind' to a new
nodemask node 2-3, according to current code it will be add the
MPOL_F_LOCAL bit and performs 'local' policy acctually. And in future
it is 'rebind' again with a nodemask 1-2, it will be restored back
to 'prefer' policy with preferred node 1.

This patch tries to address this special case. I have struggled but 
couldn't think of a good way. Any suggestions? thanks!

- Feng

> [...]
> > +static void mpol_rebind_local(struct mempolicy *pol,
> > +				const nodemask_t *nodes)
> > +{
> > +	if (unlikely(pol->flags & MPOL_F_STATIC_NODES)) {
> > +		int node = first_node(pol->w.user_nodemask);
> > +
> > +		BUG_ON(!(pol->flags & MPOL_F_LOCAL_TEMP));
> > +
> > +		if (node_isset(node, *nodes)) {
> > +			pol->v.preferred_node = node;
> > +			pol->mode = MPOL_PREFERRED;
> > +			pol->flags &= ~MPOL_F_LOCAL_TEMP;
> > +		}
> > +	}
> > +}
> > +
> 
> I have to confess I've got lost here. Could you explain why do you need
> all this handling for a local policy?
> 
> >  static void mpol_rebind_preferred(struct mempolicy *pol,
> >  						const nodemask_t *nodes)
> >  {
> > @@ -347,13 +363,19 @@ static void mpol_rebind_preferred(struct mempolicy *pol,
> >  
> >  		if (node_isset(node, *nodes)) {
> >  			pol->v.preferred_node = node;
> > -			pol->flags &= ~MPOL_F_LOCAL;
> > -		} else
> > -			pol->flags |= MPOL_F_LOCAL;
> > +		} else {
> > +			/*
> > +			 * If there is no valid node, change the mode to
> > +			 * MPOL_LOCAL, which will be restored back when
> > +			 * next rebind() sees a valid node.
> > +			 */
> > +			pol->mode = MPOL_LOCAL;
> > +			pol->flags |= MPOL_F_LOCAL_TEMP;
> > +		}
> >  	} else if (pol->flags & MPOL_F_RELATIVE_NODES) {
> >  		mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
> >  		pol->v.preferred_node = first_node(tmp);
> > -	} else if (!(pol->flags & MPOL_F_LOCAL)) {
> > +	} else {
> >  		pol->v.preferred_node = node_remap(pol->v.preferred_node,
> >  						   pol->w.cpuset_mems_allowed,
> >  						   *nodes);
> > @@ -372,7 +394,7 @@ static void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask)
> >  {
> >  	if (!pol)
> >  		return;
> > -	if (!mpol_store_user_nodemask(pol) && !(pol->flags & MPOL_F_LOCAL) &&
> > +	if (!mpol_store_user_nodemask(pol) &&
> >  	    nodes_equal(pol->w.cpuset_mems_allowed, *newmask))
> >  		return;
> >  
> > @@ -425,7 +447,7 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
> >  	},
> >  	[MPOL_LOCAL] = {
> >  		.create = mpol_new_local,
> > -		.rebind = mpol_rebind_default,
> > +		.rebind = mpol_rebind_local,
> >  	},
> >  };
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 3/4] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy
  2021-05-27 12:06     ` Feng Tang
@ 2021-05-27 12:16       ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2021-05-27 12:16 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Thu 27-05-21 20:06:42, Feng Tang wrote:
> Hi Michal,
> 
> Many thanks for the reivews!
> 
> On Thu, May 27, 2021 at 10:12:15AM +0200, Michal Hocko wrote:
> > On Wed 26-05-21 13:01:41, Feng Tang wrote:
> > > MPOL_LOCAL policy has been setup as a real policy, but it is still
> > > handled like a faked POL_PREFERRED policy with one internal
> > > MPOL_F_LOCAL flag bit set, and there are many places having to
> > > judge the real 'prefer' or the 'local' policy, which are quite
> > > confusing.
> > > 
> > > In current code, there are four cases that MPOL_LOCAL are used:
> > > * user specifies 'local' policy
> > > * user specifies 'prefer' policy, but with empty nodemask
> > > * system 'default' policy is used
> > > * 'prefer' policy + valid 'preferred' node with MPOL_F_STATIC_NODES
> > >   flag set, and when it is 'rebind' to a nodemask which doesn't
> > >   contains the 'preferred' node, it will add the MPOL_F_LOCAL bit
> > >   and performs as 'local' policy. In future if it is 'rebind' again
> > >   with valid nodemask, the policy will be restored back to 'prefer'
> > > 
> > > So for the first three cases, we make 'local' a real policy
> > > instead of a fake 'prefer' one, this will reduce confusion for
> > > reading code.
> > > 
> > > And next optional patch will kill the 'MPOL_F_LOCAL' bit.
> > 
> > I do like this approach. An additional policy should be much easier to
> > grasp than a special casing. This code is quite tricky so another pair
> > of eyes would be definitely good for the review.
> > 
> > > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > 
> > Acked-by: Michal Hocko <mhocko@suse.com>
> 
> Thanks!
> 
> > Just few nits.
> > 
> > >  static int migrate_page_add(struct page *page, struct list_head *pagelist,
> > > @@ -1965,6 +1965,8 @@ unsigned int mempolicy_slab_node(void)
> > >  							&policy->v.nodes);
> > >  		return z->zone ? zone_to_nid(z->zone) : node;
> > >  	}
> > > +	case MPOL_LOCAL:
> > > +		return node;
> > 
> > Any reason you haven't removed MPOL_F_LOCAL in this and following
> > functions? It would make it much more easier to review this patch if
> > there was no actual use of the flag in the code after this patch.
> 
> As in the commit log, there are 4 cases using 'prefer' + MPOL_F_LOCAL 
> to represent 'local' policy. 
> 
> I'm confident in this patch which handle the case 1/2/3, while not 
> sure if the solution (patch 4/4) for case 4 is the right method. So
> I separte them into 3/4 and 4/4

Please don't and handle the above and those below in a single patch.
 
> Thanks,
> Feng
> 
> 
> > >  
> > >  	default:
> > >  		BUG();
> > > @@ -2089,6 +2091,11 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
> > >  		*mask =  mempolicy->v.nodes;
> > >  		break;
> > >  
> > > +	case MPOL_LOCAL:
> > > +		nid = numa_node_id();
> > > +		init_nodemask_of_node(mask, nid);
> > > +		break;
> > > +
> > >  	default:
> > >  		BUG();
> > >  	}
> > > @@ -2333,6 +2340,8 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
> > >  		if (a->flags & MPOL_F_LOCAL)
> > >  			return true;
> > >  		return a->v.preferred_node == b->v.preferred_node;
> > > +	case MPOL_LOCAL:
> > > +		return true;
> > >  	default:
> > >  		BUG();
> > >  		return false;
> > > @@ -2476,6 +2485,10 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
> > >  			polnid = pol->v.preferred_node;
> > >  		break;
> > >  
> > > +	case MPOL_LOCAL:
> > > +		polnid = numa_node_id();
> > > +		break;
> > > +
> > >  	case MPOL_BIND:
> > >  		/* Optimize placement among multiple nodes via NUMA balancing */
> > >  		if (pol->flags & MPOL_F_MORON) {
> > -- 
> > Michal Hocko
> > SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/4] mm/mempolicy: kill MPOL_F_LOCAL bit
  2021-05-27 12:10     ` Feng Tang
@ 2021-05-27 12:26       ` Michal Hocko
  2021-05-27 13:34         ` Feng Tang
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2021-05-27 12:26 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Thu 27-05-21 20:10:41, Feng Tang wrote:
> On Thu, May 27, 2021 at 10:20:08AM +0200, Michal Hocko wrote:
> > On Wed 26-05-21 13:01:42, Feng Tang wrote:
> > > Now the only remaining case of a real 'local' policy faked by
> > > 'prefer' policy plus MPOL_F_LOCAL bit is:
> > > 
> > > A valid 'prefer' policy with a valid 'preferred' node is 'rebind'
> > > to a nodemask which doesn't contains the 'preferred' node, then it
> > > will handle allocation with 'local' policy.
> > > 
> > > Add a new 'MPOL_F_LOCAL_TEMP' bit for this case, and kill the
> > > MPOL_F_LOCAL bit, which could simplify the code much.
> > 
> > As I've pointed out in the reply to the previous patch. It would have
> > been much better if most of the MPOL_F_LOCAL usage was gone by this
> > patch.
> > 
> > I also dislike a new MPOL_F_LOCAL_TEMP. This smells like sneaking the
> > hack back in after you have painstakingly removed it. So this looks like
> > a step backwards to me. I also do not understand why do we need the
> > rebind callback for local policy at all. There is no node mask for local
> > so what is going on here?
> 
> This is the special case 4 for 'perfer' policy with MPOL_F_STATIC_NODES
> flag set, say it prefer node 1, when it is later 'refind' to a new
> nodemask node 2-3, according to current code it will be add the
> MPOL_F_LOCAL bit and performs 'local' policy acctually. And in future
> it is 'rebind' again with a nodemask 1-2, it will be restored back
> to 'prefer' policy with preferred node 1.

Honestly I still do not follow the actual problem. A preferred node is a
_hint_. If you rebind the task to a different cpuset then why should we
actually care? The allocator will fallback to the closest node according
to the distance metric. Maybe the original code was trying to handle
that in some way but I really do fail to understand that code and I
strongly suspect it is more likely to overengineered rather than backed
by a real usecase. I might be wrong here but then this is an excellent
opportunity to clarify all those subtleties.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 2/4] mm/mempolicy: unify the preprocessing for mbind and set_mempolicy
  2021-05-27  7:39   ` Michal Hocko
@ 2021-05-27 12:31     ` Feng Tang
  0 siblings, 0 replies; 24+ messages in thread
From: Feng Tang @ 2021-05-27 12:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Thu, May 27, 2021 at 09:39:49AM +0200, Michal Hocko wrote:
> On Wed 26-05-21 13:01:40, Feng Tang wrote:
> > Currently the kernel_mbind() and kernel_set_mempolicy() do almost
> > the same operation for parameter sanity check and preprocessing.
> > 
> > Add a helper function to unify the code to reduce the redundancy,
> > and make it easier for changing the pre-processing code in future.
> > 
> > [thanks to David Rientjes for suggesting using helper function
> > instead of macro]
> 
> I appreciate removing the code duplication but I am not really convinced
> this is an improvement. You are conflating two things. One is the mpol
> flags checking and node mask copying. While abstracting the first one
> makes sense to me the later is already a single line of code that makes
> your helper unnecessarily complex. So I would go with sanitize_mpol_flags
> and put a flags handling there and leave get_nodes alone.
>  
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > ---
> >  mm/mempolicy.c | 43 ++++++++++++++++++++++++-------------------
> >  1 file changed, 24 insertions(+), 19 deletions(-)
> 
> Funny how removing code duplication adds more code than it removes ;)

Yes.

And in last verion which uses macro to unify the code: 
https://lore.kernel.org/lkml/1621499404-67756-3-git-send-email-feng.tang@intel.com/
it does save some lines :)

 mm/mempolicy.c | 46 +++++++++++++++++++---------------------------
 1 file changed, 19 insertions(+), 27 deletions(-)

Thanks,
Feng

> > 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 1964cca..2830bb8 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1460,6 +1460,20 @@ static int copy_nodes_to_user(unsigned long __user *mask, unsigned long maxnode,
> >  	return copy_to_user(mask, nodes_addr(*nodes), copy) ? -EFAULT : 0;
> >  }
> >  
> > +static inline int mpol_pre_process(int *mode, const unsigned long __user *nmask, unsigned long maxnode, nodemask_t *nodes, unsigned short *flags)
> > +{
> > +	int ret;
> > +
> > +	*flags = *mode & MPOL_MODE_FLAGS;
> > +	*mode &= ~MPOL_MODE_FLAGS;
> > +	if ((unsigned int)(*mode) >= MPOL_MAX)
> > +		return -EINVAL;
> > +	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
> > +		return -EINVAL;
> > +	ret = get_nodes(nodes, nmask, maxnode);
> > +	return ret;
> > +}
> > +
> >  static long kernel_mbind(unsigned long start, unsigned long len,
> >  			 unsigned long mode, const unsigned long __user *nmask,
> >  			 unsigned long maxnode, unsigned int flags)
> > @@ -1467,19 +1481,14 @@ static long kernel_mbind(unsigned long start, unsigned long len,
> >  	nodemask_t nodes;
> >  	int err;
> >  	unsigned short mode_flags;
> > +	int lmode = mode;
> >  
> > -	start = untagged_addr(start);
> > -	mode_flags = mode & MPOL_MODE_FLAGS;
> > -	mode &= ~MPOL_MODE_FLAGS;
> > -	if (mode >= MPOL_MAX)
> > -		return -EINVAL;
> > -	if ((mode_flags & MPOL_F_STATIC_NODES) &&
> > -	    (mode_flags & MPOL_F_RELATIVE_NODES))
> > -		return -EINVAL;
> > -	err = get_nodes(&nodes, nmask, maxnode);
> > +	err = mpol_pre_process(&lmode, nmask, maxnode, &nodes, &mode_flags);
> >  	if (err)
> >  		return err;
> > -	return do_mbind(start, len, mode, mode_flags, &nodes, flags);
> > +
> > +	start = untagged_addr(start);
> > +	return do_mbind(start, len, lmode, mode_flags, &nodes, flags);
> >  }
> >  
> >  SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len,
> > @@ -1495,18 +1504,14 @@ static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
> >  {
> >  	int err;
> >  	nodemask_t nodes;
> > -	unsigned short flags;
> > +	unsigned short mode_flags;
> > +	int lmode = mode;
> >  
> > -	flags = mode & MPOL_MODE_FLAGS;
> > -	mode &= ~MPOL_MODE_FLAGS;
> > -	if ((unsigned int)mode >= MPOL_MAX)
> > -		return -EINVAL;
> > -	if ((flags & MPOL_F_STATIC_NODES) && (flags & MPOL_F_RELATIVE_NODES))
> > -		return -EINVAL;
> > -	err = get_nodes(&nodes, nmask, maxnode);
> > +	err = mpol_pre_process(&lmode, nmask, maxnode, &nodes, &mode_flags);
> >  	if (err)
> >  		return err;
> > -	return do_set_mempolicy(mode, flags, &nodes);
> > +
> > +	return do_set_mempolicy(lmode, mode_flags, &nodes);
> >  }
> >  
> >  SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
> > -- 
> > 2.7.4
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 1/4] mm/mempolicy: skip nodemask intersect check for 'interleave' when oom
  2021-05-27  7:30   ` Michal Hocko
@ 2021-05-27 13:05     ` Feng Tang
  2021-05-27 13:15       ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Feng Tang @ 2021-05-27 13:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Thu, May 27, 2021 at 09:30:00AM +0200, Michal Hocko wrote:
> On Wed 26-05-21 13:01:39, Feng Tang wrote:
> > mempolicy_nodemask_intersects() is used in oom case to check if a
> > task may have memory allocated on some memory nodes.
> > 
> > Currently, the nodes_intersects() is run for both 'bind' and 'interleave'
> > policies. But they are different regarding memory allocation, the nodemask
> > is a forced requirement for 'bind', while just a hint for 'interleave'.
> > Like in alloc_pages_vma():
> > 
> > 	nmask = policy_nodemask(gfp, pol);
> >         preferred_nid = policy_node(gfp, pol, node);
> >         page = __alloc_pages(gfp, order, preferred_nid, nmask);
> > 
> > in plicy_nodemask(), only 'bind' policy may return its desired nodemask,
> > while others return NULL.  And this 'NULL' enables the 'interleave' policy
> > can get memory from other nodes than its nodemask.
> > 
> > So skip the nodemask intersect check for 'interleave' policy.
> 
> The changelog is not really clear on the actual effect of the
> patch and the above reference to alloc_pages_vma looks misleading to me
> because that path is never called for interleaved policy.

You are right. thanks for pointing it out.

Only the 'bind' policy calls policy_nodemask() and gets its preset
nodemask, while for 'interleave', alloc_page_interleave() calls
__alloc_pages() with NULL nodemask, so the conclusion is the same
that 'bind' policy can only get memory from its preset nodemask,
while 'interleave' can get memory from all nodes.

> This is very likely my fault because I was rather vague. The existing
> code in its current form is confusing but it _works_ properly. The
> problem is that it sounds like a general helper and in that regards
> the function is correct for the interleaved policy and your proposed
> preferred-many. But its only existing caller wants a different semantic.
> 
> Until now this was not a real problem even for OOM context because
> alloc_page_interleave is always used for the interleaving policy
> and that one doesn't use any node mask so the code is not really
> exercised. With your MPOL_PREFERRED this would no longer be the case.
 
Given the 'interleave' task may have memory allocated from all nodes,
shouldn't the mempolicy_nodemask_intersects() return true for 'interleave'?
or I'm still missing something?

> Your patch makes the code more robust for the oom context but it can
> confuse other users who really want to do an intersect logic. So I think
> it would really be best to rename the function and make it oom specific.
> E.g. mempolicy_in_oom_domain(tsk, mask) this would make it clear that
> this is not a general purpose function.

Ok, will rename like this.

Thanks,
Feng

> The changelog should be clear that this is just a code cleanup rather
> than fix.
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 1/4] mm/mempolicy: skip nodemask intersect check for 'interleave' when oom
  2021-05-27 13:05     ` Feng Tang
@ 2021-05-27 13:15       ` Michal Hocko
  2021-05-27 13:22         ` Feng Tang
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2021-05-27 13:15 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Thu 27-05-21 21:05:01, Feng Tang wrote:
> On Thu, May 27, 2021 at 09:30:00AM +0200, Michal Hocko wrote:
[...]
> > Until now this was not a real problem even for OOM context because
> > alloc_page_interleave is always used for the interleaving policy
> > and that one doesn't use any node mask so the code is not really
> > exercised. With your MPOL_PREFERRED this would no longer be the case.
>  
> Given the 'interleave' task may have memory allocated from all nodes,
> shouldn't the mempolicy_nodemask_intersects() return true for 'interleave'?
> or I'm still missing something?

Well, if you go with the renaming then it should be quite obvious that
any policies which are not a hard binding should return true. 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 1/4] mm/mempolicy: skip nodemask intersect check for 'interleave' when oom
  2021-05-27 13:15       ` Michal Hocko
@ 2021-05-27 13:22         ` Feng Tang
  0 siblings, 0 replies; 24+ messages in thread
From: Feng Tang @ 2021-05-27 13:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Thu, May 27, 2021 at 03:15:04PM +0200, Michal Hocko wrote:
> On Thu 27-05-21 21:05:01, Feng Tang wrote:
> > On Thu, May 27, 2021 at 09:30:00AM +0200, Michal Hocko wrote:
> [...]
> > > Until now this was not a real problem even for OOM context because
> > > alloc_page_interleave is always used for the interleaving policy
> > > and that one doesn't use any node mask so the code is not really
> > > exercised. With your MPOL_PREFERRED this would no longer be the case.
> >  
> > Given the 'interleave' task may have memory allocated from all nodes,
> > shouldn't the mempolicy_nodemask_intersects() return true for 'interleave'?
> > or I'm still missing something?
> 
> Well, if you go with the renaming then it should be quite obvious that
> any policies which are not a hard binding should return true. 

Ok, will do the rename. thanks for clarifying!

- Feng

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/4] mm/mempolicy: kill MPOL_F_LOCAL bit
  2021-05-27 12:26       ` Michal Hocko
@ 2021-05-27 13:34         ` Feng Tang
  2021-05-27 15:34           ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Feng Tang @ 2021-05-27 13:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Thu, May 27, 2021 at 02:26:24PM +0200, Michal Hocko wrote:
> On Thu 27-05-21 20:10:41, Feng Tang wrote:
> > On Thu, May 27, 2021 at 10:20:08AM +0200, Michal Hocko wrote:
> > > On Wed 26-05-21 13:01:42, Feng Tang wrote:
> > > > Now the only remaining case of a real 'local' policy faked by
> > > > 'prefer' policy plus MPOL_F_LOCAL bit is:
> > > > 
> > > > A valid 'prefer' policy with a valid 'preferred' node is 'rebind'
> > > > to a nodemask which doesn't contains the 'preferred' node, then it
> > > > will handle allocation with 'local' policy.
> > > > 
> > > > Add a new 'MPOL_F_LOCAL_TEMP' bit for this case, and kill the
> > > > MPOL_F_LOCAL bit, which could simplify the code much.
> > > 
> > > As I've pointed out in the reply to the previous patch. It would have
> > > been much better if most of the MPOL_F_LOCAL usage was gone by this
> > > patch.
> > > 
> > > I also dislike a new MPOL_F_LOCAL_TEMP. This smells like sneaking the
> > > hack back in after you have painstakingly removed it. So this looks like
> > > a step backwards to me. I also do not understand why do we need the
> > > rebind callback for local policy at all. There is no node mask for local
> > > so what is going on here?
> > 
> > This is the special case 4 for 'perfer' policy with MPOL_F_STATIC_NODES
> > flag set, say it prefer node 1, when it is later 'refind' to a new
> > nodemask node 2-3, according to current code it will be add the
> > MPOL_F_LOCAL bit and performs 'local' policy acctually. And in future
> > it is 'rebind' again with a nodemask 1-2, it will be restored back
> > to 'prefer' policy with preferred node 1.
> 
> Honestly I still do not follow the actual problem. 

I was confused too, and don't know the original thought behind it. This
case 4 was just imagined by reading the code.

> A preferred node is a
> _hint_. If you rebind the task to a different cpuset then why should we
> actually care? The allocator will fallback to the closest node according
> to the distance metric. Maybe the original code was trying to handle
> that in some way but I really do fail to understand that code and I
> strongly suspect it is more likely to overengineered rather than backed
> by a real usecase. I might be wrong here but then this is an excellent
> opportunity to clarify all those subtleties.

From the code, the original special handling may be needed in 3 cases:
    get_policy_nodemask()
    policy_node()
    mempolicy_slab_node()
to not return the preset prefer_nid.

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/4] mm/mempolicy: kill MPOL_F_LOCAL bit
  2021-05-27 13:34         ` Feng Tang
@ 2021-05-27 15:34           ` Michal Hocko
  2021-05-28  4:39             ` Feng Tang
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2021-05-27 15:34 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Thu 27-05-21 21:34:36, Feng Tang wrote:
> On Thu, May 27, 2021 at 02:26:24PM +0200, Michal Hocko wrote:
> > On Thu 27-05-21 20:10:41, Feng Tang wrote:
> > > On Thu, May 27, 2021 at 10:20:08AM +0200, Michal Hocko wrote:
> > > > On Wed 26-05-21 13:01:42, Feng Tang wrote:
> > > > > Now the only remaining case of a real 'local' policy faked by
> > > > > 'prefer' policy plus MPOL_F_LOCAL bit is:
> > > > > 
> > > > > A valid 'prefer' policy with a valid 'preferred' node is 'rebind'
> > > > > to a nodemask which doesn't contains the 'preferred' node, then it
> > > > > will handle allocation with 'local' policy.
> > > > > 
> > > > > Add a new 'MPOL_F_LOCAL_TEMP' bit for this case, and kill the
> > > > > MPOL_F_LOCAL bit, which could simplify the code much.
> > > > 
> > > > As I've pointed out in the reply to the previous patch. It would have
> > > > been much better if most of the MPOL_F_LOCAL usage was gone by this
> > > > patch.
> > > > 
> > > > I also dislike a new MPOL_F_LOCAL_TEMP. This smells like sneaking the
> > > > hack back in after you have painstakingly removed it. So this looks like
> > > > a step backwards to me. I also do not understand why do we need the
> > > > rebind callback for local policy at all. There is no node mask for local
> > > > so what is going on here?
> > > 
> > > This is the special case 4 for 'perfer' policy with MPOL_F_STATIC_NODES
> > > flag set, say it prefer node 1, when it is later 'refind' to a new
> > > nodemask node 2-3, according to current code it will be add the
> > > MPOL_F_LOCAL bit and performs 'local' policy acctually. And in future
> > > it is 'rebind' again with a nodemask 1-2, it will be restored back
> > > to 'prefer' policy with preferred node 1.
> > 
> > Honestly I still do not follow the actual problem. 
> 
> I was confused too, and don't know the original thought behind it. This
> case 4 was just imagined by reading the code.
> 
> > A preferred node is a
> > _hint_. If you rebind the task to a different cpuset then why should we
> > actually care? The allocator will fallback to the closest node according
> > to the distance metric. Maybe the original code was trying to handle
> > that in some way but I really do fail to understand that code and I
> > strongly suspect it is more likely to overengineered rather than backed
> > by a real usecase. I might be wrong here but then this is an excellent
> > opportunity to clarify all those subtleties.
> 
> From the code, the original special handling may be needed in 3 cases:
>     get_policy_nodemask()
>     policy_node()
>     mempolicy_slab_node()
> to not return the preset prefer_nid.

I am sorry but I do not follow. What is actually wrong if the preferred
node is outside of the cpuset nodemask?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/4] mm/mempolicy: kill MPOL_F_LOCAL bit
  2021-05-27 15:34           ` Michal Hocko
@ 2021-05-28  4:39             ` Feng Tang
  2021-05-31  7:00               ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Feng Tang @ 2021-05-28  4:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Thu, May 27, 2021 at 05:34:56PM +0200, Michal Hocko wrote:
> On Thu 27-05-21 21:34:36, Feng Tang wrote:
> > On Thu, May 27, 2021 at 02:26:24PM +0200, Michal Hocko wrote:
> > > On Thu 27-05-21 20:10:41, Feng Tang wrote:
> > > > On Thu, May 27, 2021 at 10:20:08AM +0200, Michal Hocko wrote:
> > > > > On Wed 26-05-21 13:01:42, Feng Tang wrote:
> > > > > > Now the only remaining case of a real 'local' policy faked by
> > > > > > 'prefer' policy plus MPOL_F_LOCAL bit is:
> > > > > > 
> > > > > > A valid 'prefer' policy with a valid 'preferred' node is 'rebind'
> > > > > > to a nodemask which doesn't contains the 'preferred' node, then it
> > > > > > will handle allocation with 'local' policy.
> > > > > > 
> > > > > > Add a new 'MPOL_F_LOCAL_TEMP' bit for this case, and kill the
> > > > > > MPOL_F_LOCAL bit, which could simplify the code much.
> > > > > 
> > > > > As I've pointed out in the reply to the previous patch. It would have
> > > > > been much better if most of the MPOL_F_LOCAL usage was gone by this
> > > > > patch.
> > > > > 
> > > > > I also dislike a new MPOL_F_LOCAL_TEMP. This smells like sneaking the
> > > > > hack back in after you have painstakingly removed it. So this looks like
> > > > > a step backwards to me. I also do not understand why do we need the
> > > > > rebind callback for local policy at all. There is no node mask for local
> > > > > so what is going on here?
> > > > 
> > > > This is the special case 4 for 'perfer' policy with MPOL_F_STATIC_NODES
> > > > flag set, say it prefer node 1, when it is later 'refind' to a new
> > > > nodemask node 2-3, according to current code it will be add the
> > > > MPOL_F_LOCAL bit and performs 'local' policy acctually. And in future
> > > > it is 'rebind' again with a nodemask 1-2, it will be restored back
> > > > to 'prefer' policy with preferred node 1.
> > > 
> > > Honestly I still do not follow the actual problem. 
> > 
> > I was confused too, and don't know the original thought behind it. This
> > case 4 was just imagined by reading the code.
> > 
> > > A preferred node is a
> > > _hint_. If you rebind the task to a different cpuset then why should we
> > > actually care? The allocator will fallback to the closest node according
> > > to the distance metric. Maybe the original code was trying to handle
> > > that in some way but I really do fail to understand that code and I
> > > strongly suspect it is more likely to overengineered rather than backed
> > > by a real usecase. I might be wrong here but then this is an excellent
> > > opportunity to clarify all those subtleties.
> > 
> > From the code, the original special handling may be needed in 3 cases:
> >     get_policy_nodemask()
> >     policy_node()
> >     mempolicy_slab_node()
> > to not return the preset prefer_nid.
> 
> I am sorry but I do not follow. What is actually wrong if the preferred
> node is outside of the cpuset nodemask?

Sorry, I didn't make it clear. With current code logic, it will perform
as 'local' policy, but its mode is kept as 'prefer', so the code still
has these tricky bit checking when these APIs are called for this policy.
I agree with you that these ping-pong rebind() may be over engineering,
so for this case can we just change the policy from 'prefer' to 'local',
and drop the tricky bit manipulation, as the 'prefer' is just a hint,
if these rebind misses the target node, there is no need to stick with
the 'prefer' policy?

Thanks,
Feng

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/4] mm/mempolicy: kill MPOL_F_LOCAL bit
  2021-05-28  4:39             ` Feng Tang
@ 2021-05-31  7:00               ` Michal Hocko
  2021-05-31  7:32                 ` Feng Tang
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2021-05-31  7:00 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Fri 28-05-21 12:39:54, Feng Tang wrote:
> On Thu, May 27, 2021 at 05:34:56PM +0200, Michal Hocko wrote:
> > On Thu 27-05-21 21:34:36, Feng Tang wrote:
> > > On Thu, May 27, 2021 at 02:26:24PM +0200, Michal Hocko wrote:
> > > > On Thu 27-05-21 20:10:41, Feng Tang wrote:
> > > > > On Thu, May 27, 2021 at 10:20:08AM +0200, Michal Hocko wrote:
> > > > > > On Wed 26-05-21 13:01:42, Feng Tang wrote:
> > > > > > > Now the only remaining case of a real 'local' policy faked by
> > > > > > > 'prefer' policy plus MPOL_F_LOCAL bit is:
> > > > > > > 
> > > > > > > A valid 'prefer' policy with a valid 'preferred' node is 'rebind'
> > > > > > > to a nodemask which doesn't contains the 'preferred' node, then it
> > > > > > > will handle allocation with 'local' policy.
> > > > > > > 
> > > > > > > Add a new 'MPOL_F_LOCAL_TEMP' bit for this case, and kill the
> > > > > > > MPOL_F_LOCAL bit, which could simplify the code much.
> > > > > > 
> > > > > > As I've pointed out in the reply to the previous patch. It would have
> > > > > > been much better if most of the MPOL_F_LOCAL usage was gone by this
> > > > > > patch.
> > > > > > 
> > > > > > I also dislike a new MPOL_F_LOCAL_TEMP. This smells like sneaking the
> > > > > > hack back in after you have painstakingly removed it. So this looks like
> > > > > > a step backwards to me. I also do not understand why do we need the
> > > > > > rebind callback for local policy at all. There is no node mask for local
> > > > > > so what is going on here?
> > > > > 
> > > > > This is the special case 4 for 'perfer' policy with MPOL_F_STATIC_NODES
> > > > > flag set, say it prefer node 1, when it is later 'refind' to a new
> > > > > nodemask node 2-3, according to current code it will be add the
> > > > > MPOL_F_LOCAL bit and performs 'local' policy acctually. And in future
> > > > > it is 'rebind' again with a nodemask 1-2, it will be restored back
> > > > > to 'prefer' policy with preferred node 1.
> > > > 
> > > > Honestly I still do not follow the actual problem. 
> > > 
> > > I was confused too, and don't know the original thought behind it. This
> > > case 4 was just imagined by reading the code.
> > > 
> > > > A preferred node is a
> > > > _hint_. If you rebind the task to a different cpuset then why should we
> > > > actually care? The allocator will fallback to the closest node according
> > > > to the distance metric. Maybe the original code was trying to handle
> > > > that in some way but I really do fail to understand that code and I
> > > > strongly suspect it is more likely to overengineered rather than backed
> > > > by a real usecase. I might be wrong here but then this is an excellent
> > > > opportunity to clarify all those subtleties.
> > > 
> > > From the code, the original special handling may be needed in 3 cases:
> > >     get_policy_nodemask()
> > >     policy_node()
> > >     mempolicy_slab_node()
> > > to not return the preset prefer_nid.
> > 
> > I am sorry but I do not follow. What is actually wrong if the preferred
> > node is outside of the cpuset nodemask?
> 
> Sorry, I didn't make it clear. With current code logic, it will perform
> as 'local' policy, but its mode is kept as 'prefer', so the code still
> has these tricky bit checking when these APIs are called for this policy.
> I agree with you that these ping-pong rebind() may be over engineering,
> so for this case can we just change the policy from 'prefer' to 'local',
> and drop the tricky bit manipulation, as the 'prefer' is just a hint,
> if these rebind misses the target node, there is no need to stick with
> the 'prefer' policy?

Again. I really do not understand why we should rebind or mark as local
anything here. Is this a documented/expected behavior? What if somebody
just changes the cpuset to include the preferred node again. Is it
expected to have local preference now?

I can see you have posted a newer version which I haven't seen yet but
this is really better to get resolved before building up more on top.
And let me be explicit. I do believe that rebinding preferred policy is
just bogus and it should be dropped altogether on the ground that a 
preference is a mere hint from userspace where to start the allocation. 
Unless I am missing something cpusets will be always authoritative for
the final placement. The preferred node just acts as a starting point
and it should be really preserved when cpusets changes. Otherwise we
have a very subtle behavior corner cases.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/4] mm/mempolicy: kill MPOL_F_LOCAL bit
  2021-05-31  7:00               ` Michal Hocko
@ 2021-05-31  7:32                 ` Feng Tang
  2021-05-31  8:22                   ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Feng Tang @ 2021-05-31  7:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Mon, May 31, 2021 at 09:00:25AM +0200, Michal Hocko wrote:
> On Fri 28-05-21 12:39:54, Feng Tang wrote:
> > On Thu, May 27, 2021 at 05:34:56PM +0200, Michal Hocko wrote:
> > > On Thu 27-05-21 21:34:36, Feng Tang wrote:
> > > > On Thu, May 27, 2021 at 02:26:24PM +0200, Michal Hocko wrote:
> > > > > On Thu 27-05-21 20:10:41, Feng Tang wrote:
> > > > > > On Thu, May 27, 2021 at 10:20:08AM +0200, Michal Hocko wrote:
> > > > > > > On Wed 26-05-21 13:01:42, Feng Tang wrote:
> > > > > > > > Now the only remaining case of a real 'local' policy faked by
> > > > > > > > 'prefer' policy plus MPOL_F_LOCAL bit is:
> > > > > > > > 
> > > > > > > > A valid 'prefer' policy with a valid 'preferred' node is 'rebind'
> > > > > > > > to a nodemask which doesn't contains the 'preferred' node, then it
> > > > > > > > will handle allocation with 'local' policy.
> > > > > > > > 
> > > > > > > > Add a new 'MPOL_F_LOCAL_TEMP' bit for this case, and kill the
> > > > > > > > MPOL_F_LOCAL bit, which could simplify the code much.
> > > > > > > 
> > > > > > > As I've pointed out in the reply to the previous patch. It would have
> > > > > > > been much better if most of the MPOL_F_LOCAL usage was gone by this
> > > > > > > patch.
> > > > > > > 
> > > > > > > I also dislike a new MPOL_F_LOCAL_TEMP. This smells like sneaking the
> > > > > > > hack back in after you have painstakingly removed it. So this looks like
> > > > > > > a step backwards to me. I also do not understand why do we need the
> > > > > > > rebind callback for local policy at all. There is no node mask for local
> > > > > > > so what is going on here?
> > > > > > 
> > > > > > This is the special case 4 for 'perfer' policy with MPOL_F_STATIC_NODES
> > > > > > flag set, say it prefer node 1, when it is later 'refind' to a new
> > > > > > nodemask node 2-3, according to current code it will be add the
> > > > > > MPOL_F_LOCAL bit and performs 'local' policy acctually. And in future
> > > > > > it is 'rebind' again with a nodemask 1-2, it will be restored back
> > > > > > to 'prefer' policy with preferred node 1.
> > > > > 
> > > > > Honestly I still do not follow the actual problem. 
> > > > 
> > > > I was confused too, and don't know the original thought behind it. This
> > > > case 4 was just imagined by reading the code.
> > > > 
> > > > > A preferred node is a
> > > > > _hint_. If you rebind the task to a different cpuset then why should we
> > > > > actually care? The allocator will fallback to the closest node according
> > > > > to the distance metric. Maybe the original code was trying to handle
> > > > > that in some way but I really do fail to understand that code and I
> > > > > strongly suspect it is more likely to overengineered rather than backed
> > > > > by a real usecase. I might be wrong here but then this is an excellent
> > > > > opportunity to clarify all those subtleties.
> > > > 
> > > > From the code, the original special handling may be needed in 3 cases:
> > > >     get_policy_nodemask()
> > > >     policy_node()
> > > >     mempolicy_slab_node()
> > > > to not return the preset prefer_nid.
> > > 
> > > I am sorry but I do not follow. What is actually wrong if the preferred
> > > node is outside of the cpuset nodemask?
> > 
> > Sorry, I didn't make it clear. With current code logic, it will perform
> > as 'local' policy, but its mode is kept as 'prefer', so the code still
> > has these tricky bit checking when these APIs are called for this policy.
> > I agree with you that these ping-pong rebind() may be over engineering,
> > so for this case can we just change the policy from 'prefer' to 'local',
> > and drop the tricky bit manipulation, as the 'prefer' is just a hint,
> > if these rebind misses the target node, there is no need to stick with
> > the 'prefer' policy?
> 
> Again. I really do not understand why we should rebind or mark as local
> anything here. Is this a documented/expected behavior? What if somebody
> just changes the cpuset to include the preferred node again. Is it
> expected to have local preference now?

Good point! Marking 'local' doesn't solve the whole issue. And I didn't 
find any document defining the semantics.

> I can see you have posted a newer version which I haven't seen yet but
> this is really better to get resolved before building up more on top.
> And let me be explicit. I do believe that rebinding preferred policy is
> just bogus and it should be dropped altogether on the ground that a 
> preference is a mere hint from userspace where to start the allocation. 

Yes, the current mpol_rebind_preferred()'s logic is confusing. Let me
try to understand it correctly, are you suggesting to do nothing for
'prefer's rebinding regarding MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES,
while just setting 'pol->w.cpuset_mems_allowed' to the new nodemask?

Thanks,
Feng

> Unless I am missing something cpusets will be always authoritative for
> the final placement. The preferred node just acts as a starting point
> and it should be really preserved when cpusets changes. Otherwise we
> have a very subtle behavior corner cases.
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/4] mm/mempolicy: kill MPOL_F_LOCAL bit
  2021-05-31  7:32                 ` Feng Tang
@ 2021-05-31  8:22                   ` Michal Hocko
  2021-05-31  8:29                     ` Feng Tang
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2021-05-31  8:22 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Mon 31-05-21 15:32:52, Feng Tang wrote:
> On Mon, May 31, 2021 at 09:00:25AM +0200, Michal Hocko wrote:
[...]
> > I can see you have posted a newer version which I haven't seen yet but
> > this is really better to get resolved before building up more on top.
> > And let me be explicit. I do believe that rebinding preferred policy is
> > just bogus and it should be dropped altogether on the ground that a 
> > preference is a mere hint from userspace where to start the allocation. 
> 
> Yes, the current mpol_rebind_preferred()'s logic is confusing. Let me
> try to understand it correctly, are you suggesting to do nothing for
> 'prefer's rebinding regarding MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES,
> while just setting 'pol->w.cpuset_mems_allowed' to the new nodemask?

yes this is exactly what I've had in mind.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 4/4] mm/mempolicy: kill MPOL_F_LOCAL bit
  2021-05-31  8:22                   ` Michal Hocko
@ 2021-05-31  8:29                     ` Feng Tang
  0 siblings, 0 replies; 24+ messages in thread
From: Feng Tang @ 2021-05-31  8:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Mon, May 31, 2021 at 10:22:51AM +0200, Michal Hocko wrote:
> On Mon 31-05-21 15:32:52, Feng Tang wrote:
> > On Mon, May 31, 2021 at 09:00:25AM +0200, Michal Hocko wrote:
> [...]
> > > I can see you have posted a newer version which I haven't seen yet but
> > > this is really better to get resolved before building up more on top.
> > > And let me be explicit. I do believe that rebinding preferred policy is
> > > just bogus and it should be dropped altogether on the ground that a 
> > > preference is a mere hint from userspace where to start the allocation. 
> > 
> > Yes, the current mpol_rebind_preferred()'s logic is confusing. Let me
> > try to understand it correctly, are you suggesting to do nothing for
> > 'prefer's rebinding regarding MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES,
> > while just setting 'pol->w.cpuset_mems_allowed' to the new nodemask?
> 
> yes this is exactly what I've had in mind.
 
Thanks for confirming. Will spin another version.

- Feng

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-05-31  8:29 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-26  5:01 [PATCH v1 0/4] mm/mempolicy: some fix and semantics cleanup Feng Tang
2021-05-26  5:01 ` [PATCH v1 1/4] mm/mempolicy: skip nodemask intersect check for 'interleave' when oom Feng Tang
2021-05-27  7:30   ` Michal Hocko
2021-05-27 13:05     ` Feng Tang
2021-05-27 13:15       ` Michal Hocko
2021-05-27 13:22         ` Feng Tang
2021-05-26  5:01 ` [PATCH v1 2/4] mm/mempolicy: unify the preprocessing for mbind and set_mempolicy Feng Tang
2021-05-27  7:39   ` Michal Hocko
2021-05-27 12:31     ` Feng Tang
2021-05-26  5:01 ` [PATCH v1 3/4] mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy Feng Tang
2021-05-27  8:12   ` Michal Hocko
2021-05-27 12:06     ` Feng Tang
2021-05-27 12:16       ` Michal Hocko
2021-05-26  5:01 ` [PATCH v1 4/4] mm/mempolicy: kill MPOL_F_LOCAL bit Feng Tang
2021-05-27  8:20   ` Michal Hocko
2021-05-27 12:10     ` Feng Tang
2021-05-27 12:26       ` Michal Hocko
2021-05-27 13:34         ` Feng Tang
2021-05-27 15:34           ` Michal Hocko
2021-05-28  4:39             ` Feng Tang
2021-05-31  7:00               ` Michal Hocko
2021-05-31  7:32                 ` Feng Tang
2021-05-31  8:22                   ` Michal Hocko
2021-05-31  8:29                     ` Feng Tang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.