linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/6] Introduce multi-preference mempolicy
@ 2021-07-12  8:09 Feng Tang
  2021-07-12  8:09 ` [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes Feng Tang
                   ` (6 more replies)
  0 siblings, 7 replies; 38+ messages in thread
From: Feng Tang @ 2021-07-12  8:09 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Dave Hansen, Ben Widawsky
  Cc: linux-kernel, linux-api, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang, Feng Tang

This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy.
This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2)
interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a
preference for nodes which will fulfil memory allocation requests. Unlike the
MPOL_PREFERRED mode, it takes a set of nodes. Like the MPOL_BIND interface, it
works over a set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV or
invoke the OOM killer if those preferred nodes are not available.

Along with these patches are patches for libnuma, numactl, numademo, and memhog.
They still need some polish, but can be found here:
https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many
It allows new usage: `numactl -P 0,3,4`

The goal of the new mode is to enable some use-cases when using tiered memory
usage models which I've lovingly named.
1a. The Hare - The interconnect is fast enough to meet bandwidth and latency
requirements allowing preference to be given to all nodes with "fast" memory.
1b. The Indiscriminate Hare - An application knows it wants fast memory (or
perhaps slow memory), but doesn't care which node it runs on. The application
can prefer a set of nodes and then xpu bind to the local node (cpu, accelerator,
etc). This reverses the nodes are chosen today where the kernel attempts to use
local memory to the CPU whenever possible. This will attempt to use the local
accelerator to the memory.
2. The Tortoise - The administrator (or the application itself) is aware it only
needs slow memory, and so can prefer that.

Much of this is almost achievable with the bind interface, but the bind
interface suffers from an inability to fallback to another set of nodes if
binding fails to all nodes in the nodemask.

Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
preference.

> /* Set first two nodes as preferred in an 8 node system. */
> const unsigned long nodes = 0x3
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);

> /* Mimic interleave policy, but have fallback *.
> const unsigned long nodes = 0xaa
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);

Some internal discussion took place around the interface. There are two
alternatives which we have discussed, plus one I stuck in:
1. Ordered list of nodes. Currently it's believed that the added complexity is
   nod needed for expected usecases.
2. A flag for bind to allow falling back to other nodes. This confuses the
   notion of binding and is less flexible than the current solution.
3. Create flags or new modes that helps with some ordering. This offers both a
   friendlier API as well as a solution for more customized usage. It's unknown
   if it's worth the complexity to support this. Here is sample code for how
   this might work:

> // Prefer specific nodes for some something wacky
> set_mempolicy(MPOL_PREFER_MANY, 0x17c, 1024);
>
> // Default
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
> // which is the same as
> set_mempolicy(MPOL_DEFAULT, NULL, 0);
>
> // The Hare
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
>
> // The Tortoise
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
>
> // Prefer the fast memory of the first two sockets
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
>

In v1, Andi Kleen brought up reusing MPOL_PREFERRED as the mode for the API.
There wasn't consensus around this, so I've left the existing API as it was. I'm
open to more feedback here, but my slight preference is to use a new API as it
ensures if people are using it, they are entirely aware of what they're doing
and not accidentally misusing the old interface. (In a similar way to how
MPOL_LOCAL was introduced).

In v1, Michal also brought up renaming this MPOL_PREFERRED_MASK. I'm equally
fine with that change, but I hadn't heard much emphatic support for one way or
another, so I've left that too.

- Ben/Dave/Feng

---
Changelog: 

  Since v5:
  * Rebased against 5.14-rc1. 

  Since v4:
  * Rebased on latest -mm tree (v5.13-rc), whose mempolicy code has
    been refactored much since v4 submission
  * add a dedicated alloc_page_preferred_many() (Michal Hocko)
  * refactor and add fix to hugetlb supporting code (Michal Hocko) 

  Since v3:
  * Rebased against v5.12-rc2
  * Drop the v3/0013 patch of creating NO_SLOWPATH gfp_mask bit
  * Skip direct reclaim for the first allocation try for
    MPOL_PREFERRED_MANY, which makes its semantics close to
    existing MPOL_PREFFERRED policy

  Since v2:
  * Rebased against v5.11
  * Fix a stack overflow related panic, and a kernel warning (Feng)
  * Some code clearup (Feng)
  * One RFC patch to speedup mem alloc in some case (Feng)

  Since v1:
  * Dropped patch to replace numa_node_id in some places (mhocko)
  * Dropped all the page allocation patches in favor of new mechanism to
    use fallbacks. (mhocko)
  * Dropped the special snowflake preferred node algorithm (bwidawsk)
  * If the preferred node fails, ALL nodes are rechecked instead of just
    the non-preferred nodes.


---

Ben Widawsky (3):
  mm/mempolicy: enable page allocation for MPOL_PREFERRED_MANY for
    general cases
  mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
  mm/mempolicy: Advertise new MPOL_PREFERRED_MANY

Dave Hansen (1):
  mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes

Feng Tang (2):
  mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY
    policy
  mm/mempolicy: unify the create() func for bind/interleave/prefer-many
    policies

 .../admin-guide/mm/numa_memory_policy.rst          | 16 +++--
 include/uapi/linux/mempolicy.h                     |  1 +
 mm/hugetlb.c                                       | 25 ++++++++
 mm/mempolicy.c                                     | 75 +++++++++++++++++-----
 4 files changed, 96 insertions(+), 21 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
  2021-07-12  8:09 [PATCH v6 0/6] Introduce multi-preference mempolicy Feng Tang
@ 2021-07-12  8:09 ` Feng Tang
  2021-07-28 12:31   ` Michal Hocko
  2021-07-12  8:09 ` [PATCH v6 2/6] mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY policy Feng Tang
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 38+ messages in thread
From: Feng Tang @ 2021-07-12  8:09 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Dave Hansen, Ben Widawsky
  Cc: linux-kernel, linux-api, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang, Dave Hansen, Feng Tang

From: Dave Hansen <dave.hansen@linux.intel.com>

The NUMA APIs currently allow passing in a "preferred node" as a
single bit set in a nodemask.  If more than one bit it set, bits
after the first are ignored.

This single node is generally OK for location-based NUMA where
memory being allocated will eventually be operated on by a single
CPU.  However, in systems with multiple memory types, folks want
to target a *type* of memory instead of a location.  For instance,
someone might want some high-bandwidth memory but do not care about
the CPU next to which it is allocated.  Or, they want a cheap,
high capacity allocation and want to target all NUMA nodes which
have persistent memory in volatile mode.  In both of these cases,
the application wants to target a *set* of nodes, but does not
want strict MPOL_BIND behavior as that could lead to OOM killer or
SIGSEGV.

So add MPOL_PREFERRED_MANY policy to support the multiple preferred
nodes requirement. This is not a pie-in-the-sky dream for an API.
This was a response to a specific ask of more than one group at Intel.
Specifically:

1. There are existing libraries that target memory types such as
   https://github.com/memkind/memkind.  These are known to suffer
   from SIGSEGV's when memory is low on targeted memory "kinds" that
   span more than one node.  The MCDRAM on a Xeon Phi in "Cluster on
   Die" mode is an example of this.
2. Volatile-use persistent memory users want to have a memory policy
   which is targeted at either "cheap and slow" (PMEM) or "expensive and
   fast" (DRAM).  However, they do not want to experience allocation
   failures when the targeted type is unavailable.
3. Allocate-then-run.  Generally, we let the process scheduler decide
   on which physical CPU to run a task.  That location provides a
   default allocation policy, and memory availability is not generally
   considered when placing tasks.  For situations where memory is
   valuable and constrained, some users want to allocate memory first,
   *then* allocate close compute resources to the allocation.  This is
   the reverse of the normal (CPU) model.  Accelerators such as GPUs
   that operate on core-mm-managed memory are interested in this model.

A check is added in sanitize_mpol_flags() to not permit 'prefer_many'
policy to be used for now, and will be removed in later patch after all
implementations for 'prefer_many' are ready, as suggested by Michal Hocko.

Link: https://lore.kernel.org/r/20200630212517.308045-4-ben.widawsky@intel.com
Co-developed-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 include/uapi/linux/mempolicy.h |  1 +
 mm/mempolicy.c                 | 44 +++++++++++++++++++++++++++++++++++++-----
 2 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 19a00bc7fe86..046d0ccba4cd 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -22,6 +22,7 @@ enum {
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
+	MPOL_PREFERRED_MANY,
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e32360e90274..17b5800b7dcc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -31,6 +31,9 @@
  *                but useful to set in a VMA when you have a non default
  *                process policy.
  *
+ * preferred many Try a set of nodes first before normal fallback. This is
+ *                similar to preferred without the special case.
+ *
  * default        Allocate on the local node first, or when on a VMA
  *                use the process policy. This is what Linux always did
  *		  in a NUMA aware kernel and still does by, ahem, default.
@@ -207,6 +210,14 @@ static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
 	return 0;
 }
 
+static int mpol_new_preferred_many(struct mempolicy *pol, const nodemask_t *nodes)
+{
+	if (nodes_empty(*nodes))
+		return -EINVAL;
+	pol->nodes = *nodes;
+	return 0;
+}
+
 static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
 {
 	if (nodes_empty(*nodes))
@@ -408,6 +419,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 	[MPOL_LOCAL] = {
 		.rebind = mpol_rebind_default,
 	},
+	[MPOL_PREFERRED_MANY] = {
+		.create = mpol_new_preferred_many,
+		.rebind = mpol_rebind_preferred,
+	},
 };
 
 static int migrate_page_add(struct page *page, struct list_head *pagelist,
@@ -900,6 +915,7 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
 		*nodes = p->nodes;
 		break;
 	case MPOL_LOCAL:
@@ -1446,7 +1462,13 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
 {
 	*flags = *mode & MPOL_MODE_FLAGS;
 	*mode &= ~MPOL_MODE_FLAGS;
-	if ((unsigned int)(*mode) >= MPOL_MAX)
+
+	/*
+	 * The check should be 'mode >= MPOL_MAX', but as 'prefer_many'
+	 * is not fully implemented, don't permit it to be used for now,
+	 * and the logic will be restored in following patch
+	 */
+	if ((unsigned int)(*mode) >=  MPOL_PREFERRED_MANY)
 		return -EINVAL;
 	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
 		return -EINVAL;
@@ -1887,7 +1909,8 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 /* Return the node id preferred by the given mempolicy, or the given id */
 static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
 {
-	if (policy->mode == MPOL_PREFERRED) {
+	if (policy->mode == MPOL_PREFERRED ||
+	    policy->mode == MPOL_PREFERRED_MANY) {
 		nd = first_node(policy->nodes);
 	} else {
 		/*
@@ -1931,6 +1954,7 @@ unsigned int mempolicy_slab_node(void)
 
 	switch (policy->mode) {
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
 		return first_node(policy->nodes);
 
 	case MPOL_INTERLEAVE:
@@ -2063,6 +2087,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 	mempolicy = current->mempolicy;
 	switch (mempolicy->mode) {
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 		*mask = mempolicy->nodes;
@@ -2173,10 +2198,12 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		 * node and don't fall back to other nodes, as the cost of
 		 * remote accesses would likely offset THP benefits.
 		 *
-		 * If the policy is interleave, or does not allow the current
-		 * node in its nodemask, we allocate the standard way.
+		 * If the policy is interleave or multiple preferred nodes, or
+		 * does not allow the current node in its nodemask, we allocate
+		 * the standard way.
 		 */
-		if (pol->mode == MPOL_PREFERRED)
+		if ((pol->mode == MPOL_PREFERRED ||
+		     pol->mode == MPOL_PREFERRED_MANY))
 			hpage_node = first_node(pol->nodes);
 
 		nmask = policy_nodemask(gfp, pol);
@@ -2311,6 +2338,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
 		return !!nodes_equal(a->nodes, b->nodes);
 	case MPOL_LOCAL:
 		return true;
@@ -2451,6 +2479,9 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		break;
 
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
+		if (node_isset(curnid, pol->nodes))
+			goto out;
 		polnid = first_node(pol->nodes);
 		break;
 
@@ -2829,6 +2860,7 @@ static const char * const policy_modes[] =
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
 	[MPOL_LOCAL]      = "local",
+	[MPOL_PREFERRED_MANY]  = "prefer (many)",
 };
 
 
@@ -2907,6 +2939,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 		if (!nodelist)
 			err = 0;
 		goto out;
+	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 		/*
 		 * Insist on a nodelist
@@ -2993,6 +3026,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 	case MPOL_LOCAL:
 		break;
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 		nodes = pol->nodes;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v6 2/6] mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY policy
  2021-07-12  8:09 [PATCH v6 0/6] Introduce multi-preference mempolicy Feng Tang
  2021-07-12  8:09 ` [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes Feng Tang
@ 2021-07-12  8:09 ` Feng Tang
  2021-07-28 12:42   ` Michal Hocko
  2021-07-12  8:09 ` [PATCH v6 3/6] mm/mempolicy: enable page allocation for MPOL_PREFERRED_MANY for general cases Feng Tang
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 38+ messages in thread
From: Feng Tang @ 2021-07-12  8:09 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Dave Hansen, Ben Widawsky
  Cc: linux-kernel, linux-api, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang, Feng Tang

The semantics of MPOL_PREFERRED_MANY is similar to MPOL_PREFERRED,
that it will first try to allocate memory from the preferred node(s),
and fallback to all nodes in system when first try fails.

Add a dedicated function for it just like 'interleave' policy.

Link: https://lore.kernel.org/r/20200630212517.308045-9-ben.widawsky@intel.com
Suggested-by: Michal Hocko <mhocko@suse.com>
Co-developed-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/mempolicy.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 17b5800b7dcc..d17bf018efcc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2153,6 +2153,25 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
 	return page;
 }
 
+static struct page *alloc_page_preferred_many(gfp_t gfp, unsigned int order,
+						struct mempolicy *pol)
+{
+	struct page *page;
+
+	/*
+	 * This is a two pass approach. The first pass will only try the
+	 * preferred nodes but skip the direct reclaim and allow the
+	 * allocation to fail, while the second pass will try all the
+	 * nodes in system.
+	 */
+	page = __alloc_pages(((gfp | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM),
+				order, first_node(pol->nodes), &pol->nodes);
+	if (!page)
+		page = __alloc_pages(gfp, order, numa_node_id(), NULL);
+
+	return page;
+}
+
 /**
  * alloc_pages_vma - Allocate a page for a VMA.
  * @gfp: GFP flags.
-- 
2.7.4


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v6 3/6] mm/mempolicy: enable page allocation for MPOL_PREFERRED_MANY for general cases
  2021-07-12  8:09 [PATCH v6 0/6] Introduce multi-preference mempolicy Feng Tang
  2021-07-12  8:09 ` [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes Feng Tang
  2021-07-12  8:09 ` [PATCH v6 2/6] mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY policy Feng Tang
@ 2021-07-12  8:09 ` Feng Tang
  2021-07-12  8:09 ` [PATCH v6 4/6] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY Feng Tang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 38+ messages in thread
From: Feng Tang @ 2021-07-12  8:09 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Dave Hansen, Ben Widawsky
  Cc: linux-kernel, linux-api, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang, Feng Tang

From: Ben Widawsky <ben.widawsky@intel.com>

In order to support MPOL_PREFERRED_MANY which is used by
set_mempolicy(2), mbind(2), enable both alloc_pages() and
alloc_pages_vma() by using alloc_page_preferred_many().

Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/mempolicy.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d17bf018efcc..9dce67fc9bb6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2207,6 +2207,12 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		goto out;
 	}
 
+	if (pol->mode == MPOL_PREFERRED_MANY) {
+		page = alloc_page_preferred_many(gfp, order, pol);
+		mpol_cond_put(pol);
+		goto out;
+	}
+
 	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
 		int hpage_node = node;
 
@@ -2286,6 +2292,8 @@ struct page *alloc_pages(gfp_t gfp, unsigned order)
 	 */
 	if (pol->mode == MPOL_INTERLEAVE)
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
+	else if (pol->mode == MPOL_PREFERRED_MANY)
+		page = alloc_page_preferred_many(gfp, order, pol);
 	else
 		page = __alloc_pages(gfp, order,
 				policy_node(gfp, pol, numa_node_id()),
-- 
2.7.4


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v6 4/6] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
  2021-07-12  8:09 [PATCH v6 0/6] Introduce multi-preference mempolicy Feng Tang
                   ` (2 preceding siblings ...)
  2021-07-12  8:09 ` [PATCH v6 3/6] mm/mempolicy: enable page allocation for MPOL_PREFERRED_MANY for general cases Feng Tang
@ 2021-07-12  8:09 ` Feng Tang
  2021-07-21 20:49   ` Mike Kravetz
  2021-07-12  8:09 ` [PATCH v6 5/6] mm/mempolicy: Advertise new MPOL_PREFERRED_MANY Feng Tang
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 38+ messages in thread
From: Feng Tang @ 2021-07-12  8:09 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Dave Hansen, Ben Widawsky
  Cc: linux-kernel, linux-api, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang, Feng Tang

From: Ben Widawsky <ben.widawsky@intel.com>

Implement the missing huge page allocation functionality while obeying
the preferred node semantics. This is similar to the implementation
for general page allocation, as it uses a fallback mechanism to try
multiple preferred nodes first, and then all other nodes.

[Thanks to 0day bot for caching the missing #ifdef CONFIG_NUMA issue]

Link: https://lore.kernel.org/r/20200630212517.308045-12-ben.widawsky@intel.com
Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Co-developed-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/hugetlb.c   | 25 +++++++++++++++++++++++++
 mm/mempolicy.c |  3 ++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 924553aa8f78..3e84508c1b8c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1164,7 +1164,18 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 
 	gfp_mask = htlb_alloc_mask(h);
 	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
+#ifdef CONFIG_NUMA
+	if (mpol->mode == MPOL_PREFERRED_MANY) {
+		page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
+		if (page)
+			goto check_reserve;
+		/* Fallback to all nodes */
+		nodemask = NULL;
+	}
+#endif
 	page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
+
+check_reserve:
 	if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
 		SetHPageRestoreReserve(page);
 		h->resv_huge_pages--;
@@ -2095,6 +2106,20 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
 	nodemask_t *nodemask;
 
 	nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
+#ifdef CONFIG_NUMA
+	if (mpol->mode == MPOL_PREFERRED_MANY) {
+		gfp_t gfp = (gfp_mask | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM;
+
+		page = alloc_surplus_huge_page(h, gfp, nid, nodemask);
+		if (page) {
+			mpol_cond_put(mpol);
+			return page;
+		}
+
+		/* Fallback to all nodes */
+		nodemask = NULL;
+	}
+#endif
 	page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask);
 	mpol_cond_put(mpol);
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 9dce67fc9bb6..93f8789758a7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2054,7 +2054,8 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
 					huge_page_shift(hstate_vma(vma)));
 	} else {
 		nid = policy_node(gfp_flags, *mpol, numa_node_id());
-		if ((*mpol)->mode == MPOL_BIND)
+		if ((*mpol)->mode == MPOL_BIND ||
+		    (*mpol)->mode == MPOL_PREFERRED_MANY)
 			*nodemask = &(*mpol)->nodes;
 	}
 	return nid;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v6 5/6] mm/mempolicy: Advertise new MPOL_PREFERRED_MANY
  2021-07-12  8:09 [PATCH v6 0/6] Introduce multi-preference mempolicy Feng Tang
                   ` (3 preceding siblings ...)
  2021-07-12  8:09 ` [PATCH v6 4/6] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY Feng Tang
@ 2021-07-12  8:09 ` Feng Tang
  2021-07-28 12:47   ` Michal Hocko
  2021-07-12  8:09 ` [PATCH v6 6/6] mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies Feng Tang
  2021-07-15  0:15 ` [PATCH v6 0/6] Introduce multi-preference mempolicy Andrew Morton
  6 siblings, 1 reply; 38+ messages in thread
From: Feng Tang @ 2021-07-12  8:09 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Dave Hansen, Ben Widawsky
  Cc: linux-kernel, linux-api, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang, Feng Tang

From: Ben Widawsky <ben.widawsky@intel.com>

Adds a new mode to the existing mempolicy modes, MPOL_PREFERRED_MANY.

MPOL_PREFERRED_MANY will be adequately documented in the internal
admin-guide with this patch. Eventually, the man pages for mbind(2),
get_mempolicy(2), set_mempolicy(2) and numactl(8) will also have text
about this mode. Those shall contain the canonical reference.

NUMA systems continue to become more prevalent. New technologies like
PMEM make finer grain control over memory access patterns increasingly
desirable. MPOL_PREFERRED_MANY allows userspace to specify a set of
nodes that will be tried first when performing allocations. If those
allocations fail, all remaining nodes will be tried. It's a straight
forward API which solves many of the presumptive needs of system
administrators wanting to optimize workloads on such machines. The mode
will work either per VMA, or per thread.

Link: https://lore.kernel.org/r/20200630212517.308045-13-ben.widawsky@intel.com
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 Documentation/admin-guide/mm/numa_memory_policy.rst | 16 ++++++++++++----
 mm/mempolicy.c                                      |  7 +------
 2 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 067a90a1499c..cd653561e531 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -245,6 +245,14 @@ MPOL_INTERLEAVED
 	address range or file.  During system boot up, the temporary
 	interleaved system default policy works in this mode.
 
+MPOL_PREFERRED_MANY
+        This mode specifies that the allocation should be attempted from the
+        nodemask specified in the policy. If that allocation fails, the kernel
+        will search other nodes, in order of increasing distance from the first
+        set bit in the nodemask based on information provided by the platform
+        firmware. It is similar to MPOL_PREFERRED with the main exception that
+        is an error to have an empty nodemask.
+
 NUMA memory policy supports the following optional mode flags:
 
 MPOL_F_STATIC_NODES
@@ -253,10 +261,10 @@ MPOL_F_STATIC_NODES
 	nodes changes after the memory policy has been defined.
 
 	Without this flag, any time a mempolicy is rebound because of a
-	change in the set of allowed nodes, the node (Preferred) or
-	nodemask (Bind, Interleave) is remapped to the new set of
-	allowed nodes.  This may result in nodes being used that were
-	previously undesired.
+        change in the set of allowed nodes, the preferred nodemask (Preferred
+        Many), preferred node (Preferred) or nodemask (Bind, Interleave) is
+        remapped to the new set of allowed nodes.  This may result in nodes
+        being used that were previously undesired.
 
 	With this flag, if the user-specified nodes overlap with the
 	nodes allowed by the task's cpuset, then the memory policy is
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 93f8789758a7..d90247d6a71b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1463,12 +1463,7 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
 	*flags = *mode & MPOL_MODE_FLAGS;
 	*mode &= ~MPOL_MODE_FLAGS;
 
-	/*
-	 * The check should be 'mode >= MPOL_MAX', but as 'prefer_many'
-	 * is not fully implemented, don't permit it to be used for now,
-	 * and the logic will be restored in following patch
-	 */
-	if ((unsigned int)(*mode) >=  MPOL_PREFERRED_MANY)
+	if ((unsigned int)(*mode) >=  MPOL_MAX)
 		return -EINVAL;
 	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
 		return -EINVAL;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v6 6/6] mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
  2021-07-12  8:09 [PATCH v6 0/6] Introduce multi-preference mempolicy Feng Tang
                   ` (4 preceding siblings ...)
  2021-07-12  8:09 ` [PATCH v6 5/6] mm/mempolicy: Advertise new MPOL_PREFERRED_MANY Feng Tang
@ 2021-07-12  8:09 ` Feng Tang
  2021-07-28 12:51   ` Michal Hocko
  2021-07-15  0:15 ` [PATCH v6 0/6] Introduce multi-preference mempolicy Andrew Morton
  6 siblings, 1 reply; 38+ messages in thread
From: Feng Tang @ 2021-07-12  8:09 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Dave Hansen, Ben Widawsky
  Cc: linux-kernel, linux-api, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang, Feng Tang

As they all do the same thing: sanity check and save nodemask info, create
one mpol_new_nodemask() to reduce redundancy.

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/mempolicy.c | 24 ++++--------------------
 1 file changed, 4 insertions(+), 20 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d90247d6a71b..e5ce5a7e8d92 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -192,7 +192,7 @@ static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig,
 	nodes_onto(*ret, tmp, *rel);
 }
 
-static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes)
+static int mpol_new_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
 {
 	if (nodes_empty(*nodes))
 		return -EINVAL;
@@ -210,22 +210,6 @@ static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
 	return 0;
 }
 
-static int mpol_new_preferred_many(struct mempolicy *pol, const nodemask_t *nodes)
-{
-	if (nodes_empty(*nodes))
-		return -EINVAL;
-	pol->nodes = *nodes;
-	return 0;
-}
-
-static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
-{
-	if (nodes_empty(*nodes))
-		return -EINVAL;
-	pol->nodes = *nodes;
-	return 0;
-}
-
 /*
  * mpol_set_nodemask is called after mpol_new() to set up the nodemask, if
  * any, for the new policy.  mpol_new() has already validated the nodes
@@ -405,7 +389,7 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 		.rebind = mpol_rebind_default,
 	},
 	[MPOL_INTERLEAVE] = {
-		.create = mpol_new_interleave,
+		.create = mpol_new_nodemask,
 		.rebind = mpol_rebind_nodemask,
 	},
 	[MPOL_PREFERRED] = {
@@ -413,14 +397,14 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 		.rebind = mpol_rebind_preferred,
 	},
 	[MPOL_BIND] = {
-		.create = mpol_new_bind,
+		.create = mpol_new_nodemask,
 		.rebind = mpol_rebind_nodemask,
 	},
 	[MPOL_LOCAL] = {
 		.rebind = mpol_rebind_default,
 	},
 	[MPOL_PREFERRED_MANY] = {
-		.create = mpol_new_preferred_many,
+		.create = mpol_new_nodemask,
 		.rebind = mpol_rebind_preferred,
 	},
 };
-- 
2.7.4


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 0/6] Introduce multi-preference mempolicy
  2021-07-12  8:09 [PATCH v6 0/6] Introduce multi-preference mempolicy Feng Tang
                   ` (5 preceding siblings ...)
  2021-07-12  8:09 ` [PATCH v6 6/6] mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies Feng Tang
@ 2021-07-15  0:15 ` Andrew Morton
  2021-07-15  2:13   ` Feng Tang
  2021-07-15 18:49   ` Dave Hansen
  6 siblings, 2 replies; 38+ messages in thread
From: Andrew Morton @ 2021-07-15  0:15 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Michal Hocko, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang

On Mon, 12 Jul 2021 16:09:28 +0800 Feng Tang <feng.tang@intel.com> wrote:

> This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy.
> This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2)
> interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a
> preference for nodes which will fulfil memory allocation requests. Unlike the
> MPOL_PREFERRED mode, it takes a set of nodes. Like the MPOL_BIND interface, it
> works over a set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV or
> invoke the OOM killer if those preferred nodes are not available.

Do we have any real-world testing which demonstrates the benefits of
all of this?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 0/6] Introduce multi-preference mempolicy
  2021-07-15  0:15 ` [PATCH v6 0/6] Introduce multi-preference mempolicy Andrew Morton
@ 2021-07-15  2:13   ` Feng Tang
  2021-07-15 18:49   ` Dave Hansen
  1 sibling, 0 replies; 38+ messages in thread
From: Feng Tang @ 2021-07-15  2:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Michal Hocko, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang

Hi Andrew,

Thanks for reviewing!

On Wed, Jul 14, 2021 at 05:15:40PM -0700, Andrew Morton wrote:
> On Mon, 12 Jul 2021 16:09:28 +0800 Feng Tang <feng.tang@intel.com> wrote:
> 
> > This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy.
> > This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2)
> > interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a
> > preference for nodes which will fulfil memory allocation requests. Unlike the
> > MPOL_PREFERRED mode, it takes a set of nodes. Like the MPOL_BIND interface, it
> > works over a set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV or
> > invoke the OOM killer if those preferred nodes are not available.
> 
> Do we have any real-world testing which demonstrates the benefits of
> all of this?

We have done some internal tests, and are actively working with some external
customer on using this new 'prefer-many' policy, as they have different
types of memory (fast DRAM and slower Persistent memory) in system, and their
program wants to set clear preference for several NUMA nodes, to better deploy
the huge application data before running the application. 

We have met another issue that customer wanted to run a docker container
while binding it to 2 persistent memory nodes, which always failed. At that
time we tried 2 hack pachtes to solve it.
https://lore.kernel.org/lkml/1604470210-124827-2-git-send-email-feng.tang@intel.com/
https://lore.kernel.org/lkml/1604470210-124827-3-git-send-email-feng.tang@intel.com/
And that use case can be easily achieved with this new policy.

Thanks,
Feng

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 0/6] Introduce multi-preference mempolicy
  2021-07-15  0:15 ` [PATCH v6 0/6] Introduce multi-preference mempolicy Andrew Morton
  2021-07-15  2:13   ` Feng Tang
@ 2021-07-15 18:49   ` Dave Hansen
  1 sibling, 0 replies; 38+ messages in thread
From: Dave Hansen @ 2021-07-15 18:49 UTC (permalink / raw)
  To: Andrew Morton, Feng Tang
  Cc: linux-mm, Michal Hocko, David Rientjes, Ben Widawsky,
	linux-kernel, linux-api, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On 7/14/21 5:15 PM, Andrew Morton wrote:
> On Mon, 12 Jul 2021 16:09:28 +0800 Feng Tang <feng.tang@intel.com> wrote:
>> This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy.
>> This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2)
>> interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a
>> preference for nodes which will fulfil memory allocation requests. Unlike the
>> MPOL_PREFERRED mode, it takes a set of nodes. Like the MPOL_BIND interface, it
>> works over a set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV or
>> invoke the OOM killer if those preferred nodes are not available.
> Do we have any real-world testing which demonstrates the benefits of
> all of this?

Yes, it's actually been quite useful in practice already.

If we take persistent memory media (PMEM) and hot-add/online it with the
DAX kmem driver, we get NUMA nodes with lots of capacity (~6TB is
typical) but weird performance; PMEM has good read speed, but low write
speed.

That low write speed is *so* low that it dominates the performance more
than the distance from the CPUs.  Folks who want PMEM really don't care
about locality.  The discussions with the testers usually go something
like this:

Tester: How do I make my test use PMEM on nodes 2 and 3?
Kernel Guys: use 'numactl --membind=2-3'
Tester: I tried that, but I'm getting allocation failures once I fill up
        PMEM.  Shouldn't it fall back to DRAM?
Kernel Guys: Fine, use 'numactl --preferred=2-3'
Tester: That worked, but it started using DRAM after it exhausted node 2
Kernel Guys:  Dang it.  I forgot --preferred ignores everything after
              the first node.  Fine, we'll patch the kernel.

This has happened more than once.  End users want to be able to specify
a specific physical media, but don't want to have to deal with the sharp
edges of strict binding.

This has happened both with slow media like PMEM and "faster" media like
High-Bandwidth Memory.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/6] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
  2021-07-12  8:09 ` [PATCH v6 4/6] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY Feng Tang
@ 2021-07-21 20:49   ` Mike Kravetz
  2021-07-22  8:11     ` Feng Tang
  2021-07-22  9:42     ` Michal Hocko
  0 siblings, 2 replies; 38+ messages in thread
From: Mike Kravetz @ 2021-07-21 20:49 UTC (permalink / raw)
  To: Feng Tang, linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Dave Hansen, Ben Widawsky
  Cc: linux-kernel, linux-api, Andrea Arcangeli, Mel Gorman,
	Randy Dunlap, Vlastimil Babka, Andi Kleen, Dan Williams,
	ying.huang

On 7/12/21 1:09 AM, Feng Tang wrote:
> From: Ben Widawsky <ben.widawsky@intel.com>
> 
> Implement the missing huge page allocation functionality while obeying
> the preferred node semantics. This is similar to the implementation
> for general page allocation, as it uses a fallback mechanism to try
> multiple preferred nodes first, and then all other nodes.
> 
> [Thanks to 0day bot for caching the missing #ifdef CONFIG_NUMA issue]
> 
> Link: https://lore.kernel.org/r/20200630212517.308045-12-ben.widawsky@intel.com
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> Co-developed-by: Feng Tang <feng.tang@intel.com>
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>  mm/hugetlb.c   | 25 +++++++++++++++++++++++++
>  mm/mempolicy.c |  3 ++-
>  2 files changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 924553aa8f78..3e84508c1b8c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1164,7 +1164,18 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>  
>  	gfp_mask = htlb_alloc_mask(h);
>  	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
> +#ifdef CONFIG_NUMA
> +	if (mpol->mode == MPOL_PREFERRED_MANY) {
> +		page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
> +		if (page)
> +			goto check_reserve;
> +		/* Fallback to all nodes */
> +		nodemask = NULL;
> +	}
> +#endif
>  	page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
> +
> +check_reserve:
>  	if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
>  		SetHPageRestoreReserve(page);
>  		h->resv_huge_pages--;
> @@ -2095,6 +2106,20 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
>  	nodemask_t *nodemask;
>  
>  	nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
> +#ifdef CONFIG_NUMA
> +	if (mpol->mode == MPOL_PREFERRED_MANY) {
> +		gfp_t gfp = (gfp_mask | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM;

I believe __GFP_NOWARN will be added later in alloc_buddy_huge_page, so
no need to add here?

> +
> +		page = alloc_surplus_huge_page(h, gfp, nid, nodemask);
> +		if (page) {
> +			mpol_cond_put(mpol);
> +			return page;
> +		}
> +
> +		/* Fallback to all nodes */
> +		nodemask = NULL;
> +	}
> +#endif
>  	page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask);
>  	mpol_cond_put(mpol);
>  
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 9dce67fc9bb6..93f8789758a7 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2054,7 +2054,8 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
>  					huge_page_shift(hstate_vma(vma)));
>  	} else {
>  		nid = policy_node(gfp_flags, *mpol, numa_node_id());
> -		if ((*mpol)->mode == MPOL_BIND)
> +		if ((*mpol)->mode == MPOL_BIND ||
> +		    (*mpol)->mode == MPOL_PREFERRED_MANY)
>  			*nodemask = &(*mpol)->nodes;
>  	}
>  	return nid;
> 

Other than the one nit above,

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/6] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
  2021-07-21 20:49   ` Mike Kravetz
@ 2021-07-22  8:11     ` Feng Tang
  2021-07-22  9:42     ` Michal Hocko
  1 sibling, 0 replies; 38+ messages in thread
From: Feng Tang @ 2021-07-22  8:11 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, Andrew Morton, Michal Hocko, David Rientjes,
	Dave Hansen, Ben Widawsky, linux-kernel, linux-api,
	Andrea Arcangeli, Mel Gorman, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang

Mike,

On Wed, Jul 21, 2021 at 01:49:15PM -0700, Mike Kravetz wrote:
> On 7/12/21 1:09 AM, Feng Tang wrote:
> > From: Ben Widawsky <ben.widawsky@intel.com>
> > 
> > Implement the missing huge page allocation functionality while obeying
> > the preferred node semantics. This is similar to the implementation
> > for general page allocation, as it uses a fallback mechanism to try
> > multiple preferred nodes first, and then all other nodes.
> > 
> > [Thanks to 0day bot for caching the missing #ifdef CONFIG_NUMA issue]
> > 
> > Link: https://lore.kernel.org/r/20200630212517.308045-12-ben.widawsky@intel.com
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > Co-developed-by: Feng Tang <feng.tang@intel.com>
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > ---
> >  mm/hugetlb.c   | 25 +++++++++++++++++++++++++
> >  mm/mempolicy.c |  3 ++-
> >  2 files changed, 27 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 924553aa8f78..3e84508c1b8c 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1164,7 +1164,18 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
> >  
> >  	gfp_mask = htlb_alloc_mask(h);
> >  	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
> > +#ifdef CONFIG_NUMA
> > +	if (mpol->mode == MPOL_PREFERRED_MANY) {
> > +		page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
> > +		if (page)
> > +			goto check_reserve;
> > +		/* Fallback to all nodes */
> > +		nodemask = NULL;
> > +	}
> > +#endif
> >  	page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
> > +
> > +check_reserve:
> >  	if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
> >  		SetHPageRestoreReserve(page);
> >  		h->resv_huge_pages--;
> > @@ -2095,6 +2106,20 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
> >  	nodemask_t *nodemask;
> >  
> >  	nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
> > +#ifdef CONFIG_NUMA
> > +	if (mpol->mode == MPOL_PREFERRED_MANY) {
> > +		gfp_t gfp = (gfp_mask | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM;
> 
> I believe __GFP_NOWARN will be added later in alloc_buddy_huge_page, so
> no need to add here?

Thanks for the suggestion, will remove it. 

> > +
> > +		page = alloc_surplus_huge_page(h, gfp, nid, nodemask);
> > +		if (page) {
> > +			mpol_cond_put(mpol);
> > +			return page;
> > +		}
> > +
> > +		/* Fallback to all nodes */
> > +		nodemask = NULL;
> > +	}
> > +#endif
> >  	page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask);
> >  	mpol_cond_put(mpol);
> >  
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 9dce67fc9bb6..93f8789758a7 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2054,7 +2054,8 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
> >  					huge_page_shift(hstate_vma(vma)));
> >  	} else {
> >  		nid = policy_node(gfp_flags, *mpol, numa_node_id());
> > -		if ((*mpol)->mode == MPOL_BIND)
> > +		if ((*mpol)->mode == MPOL_BIND ||
> > +		    (*mpol)->mode == MPOL_PREFERRED_MANY)
> >  			*nodemask = &(*mpol)->nodes;
> >  	}
> >  	return nid;
> > 
> 
> Other than the one nit above,
> 
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

Thanks!



Andrew,

I have to ask for your help again to fold this to the 4/6 patch, thanks!

- Feng

---------------------------8<--------------------------------------------

From de1cd29d8da96856a6d754a30a4c7585d87b8348 Mon Sep 17 00:00:00 2001
From: Feng Tang <feng.tang@intel.com>
Date: Thu, 22 Jul 2021 16:00:49 +0800
Subject: [PATCH] mm/hugetlb: remove the unneeded __GFP_NOWARN flag setting

As the alloc_buddy_huge_page() will set it anyway.

Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/hugetlb.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 528947d..a96e283 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2162,9 +2162,9 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
 	nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
 #ifdef CONFIG_NUMA
 	if (mpol->mode == MPOL_PREFERRED_MANY) {
-		gfp_t gfp = (gfp_mask | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM;
-
-		page = alloc_surplus_huge_page(h, gfp, nid, nodemask, false);
+		page = alloc_surplus_huge_page(h,
+					gfp_mask & ~__GFP_DIRECT_RECLAIM,
+					nid, nodemask, false);
 		if (page) {
 			mpol_cond_put(mpol);
 			return page;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/6] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
  2021-07-21 20:49   ` Mike Kravetz
  2021-07-22  8:11     ` Feng Tang
@ 2021-07-22  9:42     ` Michal Hocko
  2021-07-22 16:21       ` Mike Kravetz
  1 sibling, 1 reply; 38+ messages in thread
From: Michal Hocko @ 2021-07-22  9:42 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Feng Tang, linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On Wed 21-07-21 13:49:15, Mike Kravetz wrote:
> On 7/12/21 1:09 AM, Feng Tang wrote:
[...]
> > +#ifdef CONFIG_NUMA
> > +	if (mpol->mode == MPOL_PREFERRED_MANY) {
> > +		gfp_t gfp = (gfp_mask | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM;
> 
> I believe __GFP_NOWARN will be added later in alloc_buddy_huge_page, so
> no need to add here?

The mask is manipulated here anyway and the __GFP_NOWARN is really
telling that there is no need to report the failure for _this_
allocation request. alloc_surplus_huge_page might alter that in whatever
way in the future. So I would keep NOWARN here for the code clarity
rather than rely on some implicit assumption down the path.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 4/6] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
  2021-07-22  9:42     ` Michal Hocko
@ 2021-07-22 16:21       ` Mike Kravetz
  0 siblings, 0 replies; 38+ messages in thread
From: Mike Kravetz @ 2021-07-22 16:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Feng Tang, linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Dan Williams, ying.huang

On 7/22/21 2:42 AM, Michal Hocko wrote:
> On Wed 21-07-21 13:49:15, Mike Kravetz wrote:
>> On 7/12/21 1:09 AM, Feng Tang wrote:
> [...]
>>> +#ifdef CONFIG_NUMA
>>> +	if (mpol->mode == MPOL_PREFERRED_MANY) {
>>> +		gfp_t gfp = (gfp_mask | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM;
>>
>> I believe __GFP_NOWARN will be added later in alloc_buddy_huge_page, so
>> no need to add here?
> 
> The mask is manipulated here anyway and the __GFP_NOWARN is really
> telling that there is no need to report the failure for _this_
> allocation request. alloc_surplus_huge_page might alter that in whatever
> way in the future. So I would keep NOWARN here for the code clarity
> rather than rely on some implicit assumption down the path.

Makes sense.  Better to leave the __GFP_NOWARN here for clarity.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
  2021-07-12  8:09 ` [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes Feng Tang
@ 2021-07-28 12:31   ` Michal Hocko
  2021-07-28 14:11     ` Feng Tang
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Hocko @ 2021-07-28 12:31 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang, Dave Hansen

[Sorry for a late review]

On Mon 12-07-21 16:09:29, Feng Tang wrote:
[...]
> @@ -1887,7 +1909,8 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
>  /* Return the node id preferred by the given mempolicy, or the given id */
>  static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
>  {
> -	if (policy->mode == MPOL_PREFERRED) {
> +	if (policy->mode == MPOL_PREFERRED ||
> +	    policy->mode == MPOL_PREFERRED_MANY) {
>  		nd = first_node(policy->nodes);
>  	} else {
>  		/*

Do we really want to have the preferred node to be always the first node
in the node mask? Shouldn't that strive for a locality as well? Existing
callers already prefer numa_node_id() - aka local node - and I belive we
shouldn't just throw that away here.

> @@ -1931,6 +1954,7 @@ unsigned int mempolicy_slab_node(void)
>  
>  	switch (policy->mode) {
>  	case MPOL_PREFERRED:
> +	case MPOL_PREFERRED_MANY:
>  		return first_node(policy->nodes);

Similarly here but I am not really familiar with the slab numa code
enough to have strong opinions here.

> @@ -2173,10 +2198,12 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		 * node and don't fall back to other nodes, as the cost of
>  		 * remote accesses would likely offset THP benefits.
>  		 *
> -		 * If the policy is interleave, or does not allow the current
> -		 * node in its nodemask, we allocate the standard way.
> +		 * If the policy is interleave or multiple preferred nodes, or
> +		 * does not allow the current node in its nodemask, we allocate
> +		 * the standard way.
>  		 */
> -		if (pol->mode == MPOL_PREFERRED)
> +		if ((pol->mode == MPOL_PREFERRED ||
> +		     pol->mode == MPOL_PREFERRED_MANY))
>  			hpage_node = first_node(pol->nodes);

Same here.

> @@ -2451,6 +2479,9 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
>  		break;
>  
>  	case MPOL_PREFERRED:
> +	case MPOL_PREFERRED_MANY:
> +		if (node_isset(curnid, pol->nodes))
> +			goto out;
>  		polnid = first_node(pol->nodes);
>  		break;

I do not follow what is the point of using first_node here. Either the
node is in the mask or it is misplaced. What are you trying to achieve
here?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 2/6] mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY policy
  2021-07-12  8:09 ` [PATCH v6 2/6] mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY policy Feng Tang
@ 2021-07-28 12:42   ` Michal Hocko
  2021-07-28 15:18     ` Feng Tang
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Hocko @ 2021-07-28 12:42 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang

On Mon 12-07-21 16:09:30, Feng Tang wrote:
> The semantics of MPOL_PREFERRED_MANY is similar to MPOL_PREFERRED,
> that it will first try to allocate memory from the preferred node(s),
> and fallback to all nodes in system when first try fails.
> 
> Add a dedicated function for it just like 'interleave' policy.
> 
> Link: https://lore.kernel.org/r/20200630212517.308045-9-ben.widawsky@intel.com
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Co-developed-by: Ben Widawsky <ben.widawsky@intel.com>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> Signed-off-by: Feng Tang <feng.tang@intel.com>

It would be better to squash this together with the actual user of the
function added by the next patch.

> ---
>  mm/mempolicy.c | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 17b5800b7dcc..d17bf018efcc 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2153,6 +2153,25 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
>  	return page;
>  }
>  
> +static struct page *alloc_page_preferred_many(gfp_t gfp, unsigned int order,
> +						struct mempolicy *pol)

We likely want a node parameter to know which one we want to start with
for locality. Callers should use policy_node for that.

> +{
> +	struct page *page;
> +
> +	/*
> +	 * This is a two pass approach. The first pass will only try the
> +	 * preferred nodes but skip the direct reclaim and allow the
> +	 * allocation to fail, while the second pass will try all the
> +	 * nodes in system.
> +	 */
> +	page = __alloc_pages(((gfp | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM),
> +				order, first_node(pol->nodes), &pol->nodes);

Although most users will likely have some form of GFP_*USER* here and
clearing __GFP_DIRECT_RECLAIM will put all other reclaim modifiers out
of game I think it would be better to explicitly disable some of them to
prevent from surprises. E.g. any potential __GFP_NOFAIL would be more
than surprising here. We do not have any (hopefully) but this should be
pretty cheap to exclude as we already have to modify already.

	preferred_gfp = gfp | __GFP_NOWARN;
	preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL)


> +	if (!page)
> +		page = __alloc_pages(gfp, order, numa_node_id(), NULL);
> +
> +	return page;
> +}
> +
>  /**
>   * alloc_pages_vma - Allocate a page for a VMA.
>   * @gfp: GFP flags.
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 5/6] mm/mempolicy: Advertise new MPOL_PREFERRED_MANY
  2021-07-12  8:09 ` [PATCH v6 5/6] mm/mempolicy: Advertise new MPOL_PREFERRED_MANY Feng Tang
@ 2021-07-28 12:47   ` Michal Hocko
  2021-07-28 13:41     ` Feng Tang
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Hocko @ 2021-07-28 12:47 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang

On Mon 12-07-21 16:09:33, Feng Tang wrote:
> From: Ben Widawsky <ben.widawsky@intel.com>
> 
> Adds a new mode to the existing mempolicy modes, MPOL_PREFERRED_MANY.
> 
> MPOL_PREFERRED_MANY will be adequately documented in the internal
> admin-guide with this patch. Eventually, the man pages for mbind(2),
> get_mempolicy(2), set_mempolicy(2) and numactl(8) will also have text
> about this mode. Those shall contain the canonical reference.
> 
> NUMA systems continue to become more prevalent. New technologies like
> PMEM make finer grain control over memory access patterns increasingly
> desirable. MPOL_PREFERRED_MANY allows userspace to specify a set of
> nodes that will be tried first when performing allocations. If those
> allocations fail, all remaining nodes will be tried. It's a straight
> forward API which solves many of the presumptive needs of system
> administrators wanting to optimize workloads on such machines. The mode
> will work either per VMA, or per thread.
> 
> Link: https://lore.kernel.org/r/20200630212517.308045-13-ben.widawsky@intel.com
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>  Documentation/admin-guide/mm/numa_memory_policy.rst | 16 ++++++++++++----
>  mm/mempolicy.c                                      |  7 +------
>  2 files changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index 067a90a1499c..cd653561e531 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -245,6 +245,14 @@ MPOL_INTERLEAVED
>  	address range or file.  During system boot up, the temporary
>  	interleaved system default policy works in this mode.
>  
> +MPOL_PREFERRED_MANY
> +        This mode specifies that the allocation should be attempted from the
> +        nodemask specified in the policy. If that allocation fails, the kernel
> +        will search other nodes, in order of increasing distance from the first
> +        set bit in the nodemask based on information provided by the platform
> +        firmware. It is similar to MPOL_PREFERRED with the main exception that
> +        is an error to have an empty nodemask.

I believe the target audience of this documents are users rather than
kernel developers and for those the wording might be rather cryptic. I
would rephrase like this
	This mode specifices that the allocation should be preferrably
	satisfied from the nodemask specified in the policy. If there is
	a memory pressure on all nodes in the nodemask the allocation
	can fall back to all existing numa nodes. This is effectively
	MPOL_PREFERRED allowed for a mask rather than a single node.

With that or similar feel free to add
Acked-by: Michal Hocko <mhocko@suse.com>
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 6/6] mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
  2021-07-12  8:09 ` [PATCH v6 6/6] mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies Feng Tang
@ 2021-07-28 12:51   ` Michal Hocko
  2021-07-28 13:50     ` Feng Tang
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Hocko @ 2021-07-28 12:51 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang

On Mon 12-07-21 16:09:34, Feng Tang wrote:
> As they all do the same thing: sanity check and save nodemask info, create
> one mpol_new_nodemask() to reduce redundancy.

Do we really need a create() callback these days?

> Signed-off-by: Feng Tang <feng.tang@intel.com>

Other than that LGTM
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/mempolicy.c | 24 ++++--------------------
>  1 file changed, 4 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index d90247d6a71b..e5ce5a7e8d92 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -192,7 +192,7 @@ static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig,
>  	nodes_onto(*ret, tmp, *rel);
>  }
>  
> -static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes)
> +static int mpol_new_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
>  {
>  	if (nodes_empty(*nodes))
>  		return -EINVAL;
> @@ -210,22 +210,6 @@ static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
>  	return 0;
>  }
>  
> -static int mpol_new_preferred_many(struct mempolicy *pol, const nodemask_t *nodes)
> -{
> -	if (nodes_empty(*nodes))
> -		return -EINVAL;
> -	pol->nodes = *nodes;
> -	return 0;
> -}
> -
> -static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
> -{
> -	if (nodes_empty(*nodes))
> -		return -EINVAL;
> -	pol->nodes = *nodes;
> -	return 0;
> -}
> -
>  /*
>   * mpol_set_nodemask is called after mpol_new() to set up the nodemask, if
>   * any, for the new policy.  mpol_new() has already validated the nodes
> @@ -405,7 +389,7 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
>  		.rebind = mpol_rebind_default,
>  	},
>  	[MPOL_INTERLEAVE] = {
> -		.create = mpol_new_interleave,
> +		.create = mpol_new_nodemask,
>  		.rebind = mpol_rebind_nodemask,
>  	},
>  	[MPOL_PREFERRED] = {
> @@ -413,14 +397,14 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
>  		.rebind = mpol_rebind_preferred,
>  	},
>  	[MPOL_BIND] = {
> -		.create = mpol_new_bind,
> +		.create = mpol_new_nodemask,
>  		.rebind = mpol_rebind_nodemask,
>  	},
>  	[MPOL_LOCAL] = {
>  		.rebind = mpol_rebind_default,
>  	},
>  	[MPOL_PREFERRED_MANY] = {
> -		.create = mpol_new_preferred_many,
> +		.create = mpol_new_nodemask,
>  		.rebind = mpol_rebind_preferred,
>  	},
>  };
> -- 
> 2.7.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 5/6] mm/mempolicy: Advertise new MPOL_PREFERRED_MANY
  2021-07-28 12:47   ` Michal Hocko
@ 2021-07-28 13:41     ` Feng Tang
  0 siblings, 0 replies; 38+ messages in thread
From: Feng Tang @ 2021-07-28 13:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang

On Wed, Jul 28, 2021 at 02:47:23PM +0200, Michal Hocko wrote:
> On Mon 12-07-21 16:09:33, Feng Tang wrote:
> > From: Ben Widawsky <ben.widawsky@intel.com>
> > 
> > Adds a new mode to the existing mempolicy modes, MPOL_PREFERRED_MANY.
> > 
> > MPOL_PREFERRED_MANY will be adequately documented in the internal
> > admin-guide with this patch. Eventually, the man pages for mbind(2),
> > get_mempolicy(2), set_mempolicy(2) and numactl(8) will also have text
> > about this mode. Those shall contain the canonical reference.
> > 
> > NUMA systems continue to become more prevalent. New technologies like
> > PMEM make finer grain control over memory access patterns increasingly
> > desirable. MPOL_PREFERRED_MANY allows userspace to specify a set of
> > nodes that will be tried first when performing allocations. If those
> > allocations fail, all remaining nodes will be tried. It's a straight
> > forward API which solves many of the presumptive needs of system
> > administrators wanting to optimize workloads on such machines. The mode
> > will work either per VMA, or per thread.
> > 
> > Link: https://lore.kernel.org/r/20200630212517.308045-13-ben.widawsky@intel.com
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > ---
> >  Documentation/admin-guide/mm/numa_memory_policy.rst | 16 ++++++++++++----
> >  mm/mempolicy.c                                      |  7 +------
> >  2 files changed, 13 insertions(+), 10 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> > index 067a90a1499c..cd653561e531 100644
> > --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> > +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> > @@ -245,6 +245,14 @@ MPOL_INTERLEAVED
> >  	address range or file.  During system boot up, the temporary
> >  	interleaved system default policy works in this mode.
> >  
> > +MPOL_PREFERRED_MANY
> > +        This mode specifies that the allocation should be attempted from the
> > +        nodemask specified in the policy. If that allocation fails, the kernel
> > +        will search other nodes, in order of increasing distance from the first
> > +        set bit in the nodemask based on information provided by the platform
> > +        firmware. It is similar to MPOL_PREFERRED with the main exception that
> > +        is an error to have an empty nodemask.
> 
> I believe the target audience of this documents are users rather than
> kernel developers and for those the wording might be rather cryptic. I
> would rephrase like this
> 	This mode specifices that the allocation should be preferrably
> 	satisfied from the nodemask specified in the policy. If there is
> 	a memory pressure on all nodes in the nodemask the allocation
> 	can fall back to all existing numa nodes. This is effectively
> 	MPOL_PREFERRED allowed for a mask rather than a single node.
> 
> With that or similar feel free to add
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

Will revise the test as suggested.

- Feng

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 6/6] mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
  2021-07-28 12:51   ` Michal Hocko
@ 2021-07-28 13:50     ` Feng Tang
  0 siblings, 0 replies; 38+ messages in thread
From: Feng Tang @ 2021-07-28 13:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang

On Wed, Jul 28, 2021 at 02:51:42PM +0200, Michal Hocko wrote:
> On Mon 12-07-21 16:09:34, Feng Tang wrote:
> > As they all do the same thing: sanity check and save nodemask info, create
> > one mpol_new_nodemask() to reduce redundancy.
> 
> Do we really need a create() callback these days?

I think it tries to provide a per-policy sanity check (though
it's the same for all existing ones), and a per-policy
nodemask setting (current 'prefer' policy is different from
others).

> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> 
> Other than that LGTM
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

- Feng

> > ---
> >  mm/mempolicy.c | 24 ++++--------------------
> >  1 file changed, 4 insertions(+), 20 deletions(-)
> > 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index d90247d6a71b..e5ce5a7e8d92 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -192,7 +192,7 @@ static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig,
> >  	nodes_onto(*ret, tmp, *rel);
> >  }
> >  
> > -static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes)
> > +static int mpol_new_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
> >  {
> >  	if (nodes_empty(*nodes))
> >  		return -EINVAL;
> > @@ -210,22 +210,6 @@ static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
> >  	return 0;
> >  }
> >  
> > -static int mpol_new_preferred_many(struct mempolicy *pol, const nodemask_t *nodes)
> > -{
> > -	if (nodes_empty(*nodes))
> > -		return -EINVAL;
> > -	pol->nodes = *nodes;
> > -	return 0;
> > -}
> > -
> > -static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
> > -{
> > -	if (nodes_empty(*nodes))
> > -		return -EINVAL;
> > -	pol->nodes = *nodes;
> > -	return 0;
> > -}
> > -
> >  /*
> >   * mpol_set_nodemask is called after mpol_new() to set up the nodemask, if
> >   * any, for the new policy.  mpol_new() has already validated the nodes
> > @@ -405,7 +389,7 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
> >  		.rebind = mpol_rebind_default,
> >  	},
> >  	[MPOL_INTERLEAVE] = {
> > -		.create = mpol_new_interleave,
> > +		.create = mpol_new_nodemask,
> >  		.rebind = mpol_rebind_nodemask,
> >  	},
> >  	[MPOL_PREFERRED] = {
> > @@ -413,14 +397,14 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
> >  		.rebind = mpol_rebind_preferred,
> >  	},
> >  	[MPOL_BIND] = {
> > -		.create = mpol_new_bind,
> > +		.create = mpol_new_nodemask,
> >  		.rebind = mpol_rebind_nodemask,
> >  	},
> >  	[MPOL_LOCAL] = {
> >  		.rebind = mpol_rebind_default,
> >  	},
> >  	[MPOL_PREFERRED_MANY] = {
> > -		.create = mpol_new_preferred_many,
> > +		.create = mpol_new_nodemask,
> >  		.rebind = mpol_rebind_preferred,
> >  	},
> >  };
> > -- 
> > 2.7.4
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
  2021-07-28 12:31   ` Michal Hocko
@ 2021-07-28 14:11     ` Feng Tang
  2021-07-28 16:12       ` Michal Hocko
  0 siblings, 1 reply; 38+ messages in thread
From: Feng Tang @ 2021-07-28 14:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang, Dave Hansen

On Wed, Jul 28, 2021 at 02:31:03PM +0200, Michal Hocko wrote:
> [Sorry for a late review]

Not at all. Thank you for all your reviews and suggestions from v1
to v6!

> On Mon 12-07-21 16:09:29, Feng Tang wrote:
> [...]
> > @@ -1887,7 +1909,8 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> >  /* Return the node id preferred by the given mempolicy, or the given id */
> >  static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
> >  {
> > -	if (policy->mode == MPOL_PREFERRED) {
> > +	if (policy->mode == MPOL_PREFERRED ||
> > +	    policy->mode == MPOL_PREFERRED_MANY) {
> >  		nd = first_node(policy->nodes);
> >  	} else {
> >  		/*
> 
> Do we really want to have the preferred node to be always the first node
> in the node mask? Shouldn't that strive for a locality as well? Existing
> callers already prefer numa_node_id() - aka local node - and I belive we
> shouldn't just throw that away here.
 
I think it's about the difference of 'local' and 'prefer/perfer-many'
policy. There are different kinds of memory HW: HBM(High Bandwidth
Memory), normal DRAM, PMEM (Persistent Memory), which have different
price, bandwidth, speed etc. A platform may have two, or all three of
these types, and there are real use case which want memory comes
'preferred' node/nodes than the local node.

And good point for 'local node', if the 'prefer-many' policy's
nodemask has local node set, we should pick it han this
'first_node', and the same semantic also applies to the other
several places you pointed out. Or do I misunderstand you point?

Thanks,
Feng

> > @@ -1931,6 +1954,7 @@ unsigned int mempolicy_slab_node(void)
> >  
> >  	switch (policy->mode) {
> >  	case MPOL_PREFERRED:
> > +	case MPOL_PREFERRED_MANY:
> >  		return first_node(policy->nodes);
> 
> Similarly here but I am not really familiar with the slab numa code
> enough to have strong opinions here.
> 
> > @@ -2173,10 +2198,12 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> >  		 * node and don't fall back to other nodes, as the cost of
> >  		 * remote accesses would likely offset THP benefits.
> >  		 *
> > -		 * If the policy is interleave, or does not allow the current
> > -		 * node in its nodemask, we allocate the standard way.
> > +		 * If the policy is interleave or multiple preferred nodes, or
> > +		 * does not allow the current node in its nodemask, we allocate
> > +		 * the standard way.
> >  		 */
> > -		if (pol->mode == MPOL_PREFERRED)
> > +		if ((pol->mode == MPOL_PREFERRED ||
> > +		     pol->mode == MPOL_PREFERRED_MANY))
> >  			hpage_node = first_node(pol->nodes);
> 
> Same here.
> 
> > @@ -2451,6 +2479,9 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
> >  		break;
> >  
> >  	case MPOL_PREFERRED:
> > +	case MPOL_PREFERRED_MANY:
> > +		if (node_isset(curnid, pol->nodes))
> > +			goto out;
> >  		polnid = first_node(pol->nodes);
> >  		break;
> 
> I do not follow what is the point of using first_node here. Either the
> node is in the mask or it is misplaced. What are you trying to achieve
> here?
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 2/6] mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY policy
  2021-07-28 12:42   ` Michal Hocko
@ 2021-07-28 15:18     ` Feng Tang
  2021-07-28 15:25       ` Feng Tang
  2021-07-28 16:14       ` Michal Hocko
  0 siblings, 2 replies; 38+ messages in thread
From: Feng Tang @ 2021-07-28 15:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang

On Wed, Jul 28, 2021 at 02:42:26PM +0200, Michal Hocko wrote:
> On Mon 12-07-21 16:09:30, Feng Tang wrote:
> > The semantics of MPOL_PREFERRED_MANY is similar to MPOL_PREFERRED,
> > that it will first try to allocate memory from the preferred node(s),
> > and fallback to all nodes in system when first try fails.
> > 
> > Add a dedicated function for it just like 'interleave' policy.
> > 
> > Link: https://lore.kernel.org/r/20200630212517.308045-9-ben.widawsky@intel.com
> > Suggested-by: Michal Hocko <mhocko@suse.com>
> > Co-developed-by: Ben Widawsky <ben.widawsky@intel.com>
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> 
> It would be better to squash this together with the actual user of the
> function added by the next patch.
 
Ok, will do

> > ---
> >  mm/mempolicy.c | 19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> > 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 17b5800b7dcc..d17bf018efcc 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2153,6 +2153,25 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
> >  	return page;
> >  }
> >  
> > +static struct page *alloc_page_preferred_many(gfp_t gfp, unsigned int order,
> > +						struct mempolicy *pol)
> 
> We likely want a node parameter to know which one we want to start with
> for locality. Callers should use policy_node for that.
 
Yes, locality should be considered, something like this?

	int pnid, lnid = numa_node_id();

	if (is_nodeset(lnid, &pol->nodes))
		pnid = local_nid;
	else
		pnid = first_node(pol->nodes);

	page = __alloc_pages(((gfp | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM),
				order, pnid, &pol->nodes);
	if (!page)
		page = __alloc_pages(gfp, order, lnid, NULL);
	return page;


> > +{
> > +	struct page *page;
> > +
> > +	/*
> > +	 * This is a two pass approach. The first pass will only try the
> > +	 * preferred nodes but skip the direct reclaim and allow the
> > +	 * allocation to fail, while the second pass will try all the
> > +	 * nodes in system.
> > +	 */
> > +	page = __alloc_pages(((gfp | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM),
> > +				order, first_node(pol->nodes), &pol->nodes);
> 
> Although most users will likely have some form of GFP_*USER* here and
> clearing __GFP_DIRECT_RECLAIM will put all other reclaim modifiers out
> of game I think it would be better to explicitly disable some of them to
> prevent from surprises. E.g. any potential __GFP_NOFAIL would be more
> than surprising here. We do not have any (hopefully) but this should be
> pretty cheap to exclude as we already have to modify already.
> 
> 	preferred_gfp = gfp | __GFP_NOWARN;
> 	preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL)

OK, will add.

Thanks,
Feng

> 
> > +	if (!page)
> > +		page = __alloc_pages(gfp, order, numa_node_id(), NULL);
> > +
> > +	return page;
> > +}
> > +
> >  /**
> >   * alloc_pages_vma - Allocate a page for a VMA.
> >   * @gfp: GFP flags.
> > -- 
> > 2.7.4
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 2/6] mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY policy
  2021-07-28 15:18     ` Feng Tang
@ 2021-07-28 15:25       ` Feng Tang
  2021-07-28 16:15         ` Michal Hocko
  2021-07-28 16:14       ` Michal Hocko
  1 sibling, 1 reply; 38+ messages in thread
From: Feng Tang @ 2021-07-28 15:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Hansen, Dave, Widawsky,
	Ben, linux-kernel, linux-api, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Williams, Dan J, Huang, Ying

On Wed, Jul 28, 2021 at 11:18:10PM +0800, Tang, Feng wrote:
> On Wed, Jul 28, 2021 at 02:42:26PM +0200, Michal Hocko wrote:
> > On Mon 12-07-21 16:09:30, Feng Tang wrote:
> > > The semantics of MPOL_PREFERRED_MANY is similar to MPOL_PREFERRED,
> > > that it will first try to allocate memory from the preferred node(s),
> > > and fallback to all nodes in system when first try fails.
> > > 
> > > Add a dedicated function for it just like 'interleave' policy.
> > > 
> > > Link: https://lore.kernel.org/r/20200630212517.308045-9-ben.widawsky@intel.com
> > > Suggested-by: Michal Hocko <mhocko@suse.com>
> > > Co-developed-by: Ben Widawsky <ben.widawsky@intel.com>
> > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > 
> > It would be better to squash this together with the actual user of the
> > function added by the next patch.
>  
> Ok, will do
> 
> > > ---
> > >  mm/mempolicy.c | 19 +++++++++++++++++++
> > >  1 file changed, 19 insertions(+)
> > > 
> > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > index 17b5800b7dcc..d17bf018efcc 100644
> > > --- a/mm/mempolicy.c
> > > +++ b/mm/mempolicy.c
> > > @@ -2153,6 +2153,25 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
> > >  	return page;
> > >  }
> > >  
> > > +static struct page *alloc_page_preferred_many(gfp_t gfp, unsigned int order,
> > > +						struct mempolicy *pol)
> > 
> > We likely want a node parameter to know which one we want to start with
> > for locality. Callers should use policy_node for that.
>  
> Yes, locality should be considered, something like this?
> 
> 	int pnid, lnid = numa_node_id();
> 
> 	if (is_nodeset(lnid, &pol->nodes))
> 		pnid = local_nid;
> 	else
> 		pnid = first_node(pol->nodes);

One further thought is, if local node is not in the nodemask,
should we compare the distance of all the nodes in nodemask
to the local node and chose the shortest? 

Thanks,
Feng

> 	page = __alloc_pages(((gfp | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM),
> 				order, pnid, &pol->nodes);
> 	if (!page)
> 		page = __alloc_pages(gfp, order, lnid, NULL);
> 	return page;
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
  2021-07-28 14:11     ` Feng Tang
@ 2021-07-28 16:12       ` Michal Hocko
  2021-07-29  7:09         ` Feng Tang
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Hocko @ 2021-07-28 16:12 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang, Dave Hansen

On Wed 28-07-21 22:11:56, Feng Tang wrote:
> On Wed, Jul 28, 2021 at 02:31:03PM +0200, Michal Hocko wrote:
> > [Sorry for a late review]
> 
> Not at all. Thank you for all your reviews and suggestions from v1
> to v6!
> 
> > On Mon 12-07-21 16:09:29, Feng Tang wrote:
> > [...]
> > > @@ -1887,7 +1909,8 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > >  /* Return the node id preferred by the given mempolicy, or the given id */
> > >  static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
> > >  {
> > > -	if (policy->mode == MPOL_PREFERRED) {
> > > +	if (policy->mode == MPOL_PREFERRED ||
> > > +	    policy->mode == MPOL_PREFERRED_MANY) {
> > >  		nd = first_node(policy->nodes);
> > >  	} else {
> > >  		/*
> > 
> > Do we really want to have the preferred node to be always the first node
> > in the node mask? Shouldn't that strive for a locality as well? Existing
> > callers already prefer numa_node_id() - aka local node - and I belive we
> > shouldn't just throw that away here.
>  
> I think it's about the difference of 'local' and 'prefer/perfer-many'
> policy. There are different kinds of memory HW: HBM(High Bandwidth
> Memory), normal DRAM, PMEM (Persistent Memory), which have different
> price, bandwidth, speed etc. A platform may have two, or all three of
> these types, and there are real use case which want memory comes
> 'preferred' node/nodes than the local node.
> 
> And good point for 'local node', if the 'prefer-many' policy's
> nodemask has local node set, we should pick it han this
> 'first_node', and the same semantic also applies to the other
> several places you pointed out. Or do I misunderstand you point?

Yeah. Essentially what I am trying to tell is that for
MPOL_PREFERRED_MANY you simply want to return the given node without any
alternation. That node will be used for the fallback zonelist and the
nodemask would make sure we won't get out of the policy.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 2/6] mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY policy
  2021-07-28 15:18     ` Feng Tang
  2021-07-28 15:25       ` Feng Tang
@ 2021-07-28 16:14       ` Michal Hocko
  1 sibling, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2021-07-28 16:14 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang

On Wed 28-07-21 23:18:10, Feng Tang wrote:
> On Wed, Jul 28, 2021 at 02:42:26PM +0200, Michal Hocko wrote:
> > On Mon 12-07-21 16:09:30, Feng Tang wrote:
> > > The semantics of MPOL_PREFERRED_MANY is similar to MPOL_PREFERRED,
> > > that it will first try to allocate memory from the preferred node(s),
> > > and fallback to all nodes in system when first try fails.
> > > 
> > > Add a dedicated function for it just like 'interleave' policy.
> > > 
> > > Link: https://lore.kernel.org/r/20200630212517.308045-9-ben.widawsky@intel.com
> > > Suggested-by: Michal Hocko <mhocko@suse.com>
> > > Co-developed-by: Ben Widawsky <ben.widawsky@intel.com>
> > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > 
> > It would be better to squash this together with the actual user of the
> > function added by the next patch.
>  
> Ok, will do
> 
> > > ---
> > >  mm/mempolicy.c | 19 +++++++++++++++++++
> > >  1 file changed, 19 insertions(+)
> > > 
> > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > index 17b5800b7dcc..d17bf018efcc 100644
> > > --- a/mm/mempolicy.c
> > > +++ b/mm/mempolicy.c
> > > @@ -2153,6 +2153,25 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
> > >  	return page;
> > >  }
> > >  
> > > +static struct page *alloc_page_preferred_many(gfp_t gfp, unsigned int order,
> > > +						struct mempolicy *pol)
> > 
> > We likely want a node parameter to know which one we want to start with
> > for locality. Callers should use policy_node for that.
>  
> Yes, locality should be considered, something like this?
> 
> 	int pnid, lnid = numa_node_id();
> 
> 	if (is_nodeset(lnid, &pol->nodes))
> 		pnid = local_nid;
> 	else
> 		pnid = first_node(pol->nodes);
> 
> 	page = __alloc_pages(((gfp | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM),
> 				order, pnid, &pol->nodes);
> 	if (!page)
> 		page = __alloc_pages(gfp, order, lnid, NULL);
> 	return page;

No. I really meant to get a node argument and use it as it is. Your
callers already have some node preferences. Usually a local node and as
we have a nodemask here then we do not really need to have any special
logic here as mentioned in other email. The preferred node will act only
as a source for the zone list.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 2/6] mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY policy
  2021-07-28 15:25       ` Feng Tang
@ 2021-07-28 16:15         ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2021-07-28 16:15 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Hansen, Dave, Widawsky,
	Ben, linux-kernel, linux-api, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Williams, Dan J, Huang, Ying

On Wed 28-07-21 23:25:07, Feng Tang wrote:
> On Wed, Jul 28, 2021 at 11:18:10PM +0800, Tang, Feng wrote:
> > On Wed, Jul 28, 2021 at 02:42:26PM +0200, Michal Hocko wrote:
> > > On Mon 12-07-21 16:09:30, Feng Tang wrote:
> > > > The semantics of MPOL_PREFERRED_MANY is similar to MPOL_PREFERRED,
> > > > that it will first try to allocate memory from the preferred node(s),
> > > > and fallback to all nodes in system when first try fails.
> > > > 
> > > > Add a dedicated function for it just like 'interleave' policy.
> > > > 
> > > > Link: https://lore.kernel.org/r/20200630212517.308045-9-ben.widawsky@intel.com
> > > > Suggested-by: Michal Hocko <mhocko@suse.com>
> > > > Co-developed-by: Ben Widawsky <ben.widawsky@intel.com>
> > > > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> > > > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > > 
> > > It would be better to squash this together with the actual user of the
> > > function added by the next patch.
> >  
> > Ok, will do
> > 
> > > > ---
> > > >  mm/mempolicy.c | 19 +++++++++++++++++++
> > > >  1 file changed, 19 insertions(+)
> > > > 
> > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > > index 17b5800b7dcc..d17bf018efcc 100644
> > > > --- a/mm/mempolicy.c
> > > > +++ b/mm/mempolicy.c
> > > > @@ -2153,6 +2153,25 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
> > > >  	return page;
> > > >  }
> > > >  
> > > > +static struct page *alloc_page_preferred_many(gfp_t gfp, unsigned int order,
> > > > +						struct mempolicy *pol)
> > > 
> > > We likely want a node parameter to know which one we want to start with
> > > for locality. Callers should use policy_node for that.
> >  
> > Yes, locality should be considered, something like this?
> > 
> > 	int pnid, lnid = numa_node_id();
> > 
> > 	if (is_nodeset(lnid, &pol->nodes))
> > 		pnid = local_nid;
> > 	else
> > 		pnid = first_node(pol->nodes);
> 
> One further thought is, if local node is not in the nodemask,
> should we compare the distance of all the nodes in nodemask
> to the local node and chose the shortest? 

Nope, That is zonelist for. Nodemask will do the rest.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
  2021-07-28 16:12       ` Michal Hocko
@ 2021-07-29  7:09         ` Feng Tang
  2021-07-29 13:38           ` Michal Hocko
  0 siblings, 1 reply; 38+ messages in thread
From: Feng Tang @ 2021-07-29  7:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang, Dave Hansen

On Wed, Jul 28, 2021 at 06:12:21PM +0200, Michal Hocko wrote:
> On Wed 28-07-21 22:11:56, Feng Tang wrote:
> > On Wed, Jul 28, 2021 at 02:31:03PM +0200, Michal Hocko wrote:
> > > [Sorry for a late review]
> > 
> > Not at all. Thank you for all your reviews and suggestions from v1
> > to v6!
> > 
> > > On Mon 12-07-21 16:09:29, Feng Tang wrote:
> > > [...]
> > > > @@ -1887,7 +1909,8 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > > >  /* Return the node id preferred by the given mempolicy, or the given id */
> > > >  static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
> > > >  {
> > > > -	if (policy->mode == MPOL_PREFERRED) {
> > > > +	if (policy->mode == MPOL_PREFERRED ||
> > > > +	    policy->mode == MPOL_PREFERRED_MANY) {
> > > >  		nd = first_node(policy->nodes);
> > > >  	} else {
> > > >  		/*
> > > 
> > > Do we really want to have the preferred node to be always the first node
> > > in the node mask? Shouldn't that strive for a locality as well? Existing
> > > callers already prefer numa_node_id() - aka local node - and I belive we
> > > shouldn't just throw that away here.
> >  
> > I think it's about the difference of 'local' and 'prefer/perfer-many'
> > policy. There are different kinds of memory HW: HBM(High Bandwidth
> > Memory), normal DRAM, PMEM (Persistent Memory), which have different
> > price, bandwidth, speed etc. A platform may have two, or all three of
> > these types, and there are real use case which want memory comes
> > 'preferred' node/nodes than the local node.
> > 
> > And good point for 'local node', if the 'prefer-many' policy's
> > nodemask has local node set, we should pick it han this
> > 'first_node', and the same semantic also applies to the other
> > several places you pointed out. Or do I misunderstand you point?
> 
> Yeah. Essentially what I am trying to tell is that for
> MPOL_PREFERRED_MANY you simply want to return the given node without any
> alternation. That node will be used for the fallback zonelist and the
> nodemask would make sure we won't get out of the policy.

I think I got your point now :)

With current mainline code, the 'prefer' policy will return the preferred
node.

For 'prefer-many', we would like to keep the similar semantic, that the
preference of node is 'preferred' > 'local' > all other nodes. There is
some customer use case, whose platform has both DRAM and cheaper, bigger
and slower PMEM, and they anlayzed the hotness of their huge data, and
they want to put huge cold data into the PMEM, and only fallback to DRAM
as the last step. The HW topology could be simplified like this:

Socket 0:  Node 0 (CPU + 64GB DRAM), Node 2 (512GB PMEM)
Socket 1:  Node 1 (CPU + 64GB DRAM), Node 3 (512GB PMEM)

E.g they want to allocate memory for colde application data with
'prefer-many' policy + 0xC nodemask (N2+N3 PMEM nodes), so no matter the
application is running on Node 0 or Node 1, the 'local' node only has DRAM
which is not their preference, and want a preferred-->local-->others order. 

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
  2021-07-29  7:09         ` Feng Tang
@ 2021-07-29 13:38           ` Michal Hocko
  2021-07-29 15:12             ` Feng Tang
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Hocko @ 2021-07-29 13:38 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang, Dave Hansen

On Thu 29-07-21 15:09:18, Feng Tang wrote:
> On Wed, Jul 28, 2021 at 06:12:21PM +0200, Michal Hocko wrote:
> > On Wed 28-07-21 22:11:56, Feng Tang wrote:
> > > On Wed, Jul 28, 2021 at 02:31:03PM +0200, Michal Hocko wrote:
> > > > [Sorry for a late review]
> > > 
> > > Not at all. Thank you for all your reviews and suggestions from v1
> > > to v6!
> > > 
> > > > On Mon 12-07-21 16:09:29, Feng Tang wrote:
> > > > [...]
> > > > > @@ -1887,7 +1909,8 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > > > >  /* Return the node id preferred by the given mempolicy, or the given id */
> > > > >  static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
> > > > >  {
> > > > > -	if (policy->mode == MPOL_PREFERRED) {
> > > > > +	if (policy->mode == MPOL_PREFERRED ||
> > > > > +	    policy->mode == MPOL_PREFERRED_MANY) {
> > > > >  		nd = first_node(policy->nodes);
> > > > >  	} else {
> > > > >  		/*
> > > > 
> > > > Do we really want to have the preferred node to be always the first node
> > > > in the node mask? Shouldn't that strive for a locality as well? Existing
> > > > callers already prefer numa_node_id() - aka local node - and I belive we
> > > > shouldn't just throw that away here.
> > >  
> > > I think it's about the difference of 'local' and 'prefer/perfer-many'
> > > policy. There are different kinds of memory HW: HBM(High Bandwidth
> > > Memory), normal DRAM, PMEM (Persistent Memory), which have different
> > > price, bandwidth, speed etc. A platform may have two, or all three of
> > > these types, and there are real use case which want memory comes
> > > 'preferred' node/nodes than the local node.
> > > 
> > > And good point for 'local node', if the 'prefer-many' policy's
> > > nodemask has local node set, we should pick it han this
> > > 'first_node', and the same semantic also applies to the other
> > > several places you pointed out. Or do I misunderstand you point?
> > 
> > Yeah. Essentially what I am trying to tell is that for
> > MPOL_PREFERRED_MANY you simply want to return the given node without any
> > alternation. That node will be used for the fallback zonelist and the
> > nodemask would make sure we won't get out of the policy.
> 
> I think I got your point now :)
> 
> With current mainline code, the 'prefer' policy will return the preferred
> node.

Yes this makes sense as there is only one node.

> For 'prefer-many', we would like to keep the similar semantic, that the
> preference of node is 'preferred' > 'local' > all other nodes.

Yes but which of the preferred nodes you want to start with. Say your
nodemask preferring nodes 0 and 2 with the following topology
	0	1	2	3
0	10	30	20	30
1	30	10	20	30
2	20	30	10	30
3	30	30	30	10

And say you are running on cpu 1. I believe you want your allocation
preferably from node 2 rathern than 0, right? With your approach you
would start with node 0 which would be more distant from cpu 1. Also the
semantic to give nodes some ordering based on their numbers sounds
rather weird to me.

The semantic I am proposing is to allocate from prefered nodes in
distance order starting from the local node.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
  2021-07-29 13:38           ` Michal Hocko
@ 2021-07-29 15:12             ` Feng Tang
  2021-07-29 16:21               ` Michal Hocko
  0 siblings, 1 reply; 38+ messages in thread
From: Feng Tang @ 2021-07-29 15:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang, Dave Hansen

On Thu, Jul 29, 2021 at 03:38:44PM +0200, Michal Hocko wrote:
> On Thu 29-07-21 15:09:18, Feng Tang wrote:
> > On Wed, Jul 28, 2021 at 06:12:21PM +0200, Michal Hocko wrote:
> > > On Wed 28-07-21 22:11:56, Feng Tang wrote:
> > > > On Wed, Jul 28, 2021 at 02:31:03PM +0200, Michal Hocko wrote:
> > > > > [Sorry for a late review]
> > > > 
> > > > Not at all. Thank you for all your reviews and suggestions from v1
> > > > to v6!
> > > > 
> > > > > On Mon 12-07-21 16:09:29, Feng Tang wrote:
> > > > > [...]
> > > > > > @@ -1887,7 +1909,8 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > > > > >  /* Return the node id preferred by the given mempolicy, or the given id */
> > > > > >  static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
> > > > > >  {
> > > > > > -	if (policy->mode == MPOL_PREFERRED) {
> > > > > > +	if (policy->mode == MPOL_PREFERRED ||
> > > > > > +	    policy->mode == MPOL_PREFERRED_MANY) {
> > > > > >  		nd = first_node(policy->nodes);
> > > > > >  	} else {
> > > > > >  		/*
> > > > > 
> > > > > Do we really want to have the preferred node to be always the first node
> > > > > in the node mask? Shouldn't that strive for a locality as well? Existing
> > > > > callers already prefer numa_node_id() - aka local node - and I belive we
> > > > > shouldn't just throw that away here.
> > > >  
> > > > I think it's about the difference of 'local' and 'prefer/perfer-many'
> > > > policy. There are different kinds of memory HW: HBM(High Bandwidth
> > > > Memory), normal DRAM, PMEM (Persistent Memory), which have different
> > > > price, bandwidth, speed etc. A platform may have two, or all three of
> > > > these types, and there are real use case which want memory comes
> > > > 'preferred' node/nodes than the local node.
> > > > 
> > > > And good point for 'local node', if the 'prefer-many' policy's
> > > > nodemask has local node set, we should pick it han this
> > > > 'first_node', and the same semantic also applies to the other
> > > > several places you pointed out. Or do I misunderstand you point?
> > > 
> > > Yeah. Essentially what I am trying to tell is that for
> > > MPOL_PREFERRED_MANY you simply want to return the given node without any
> > > alternation. That node will be used for the fallback zonelist and the
> > > nodemask would make sure we won't get out of the policy.
> > 
> > I think I got your point now :)
> > 
> > With current mainline code, the 'prefer' policy will return the preferred
> > node.
> 
> Yes this makes sense as there is only one node.
> 
> > For 'prefer-many', we would like to keep the similar semantic, that the
> > preference of node is 'preferred' > 'local' > all other nodes.
> 
> Yes but which of the preferred nodes you want to start with. Say your
> nodemask preferring nodes 0 and 2 with the following topology
> 	0	1	2	3
> 0	10	30	20	30
> 1	30	10	20	30
> 2	20	30	10	30
> 3	30	30	30	10
> 
> And say you are running on cpu 1. I believe you want your allocation
> preferably from node 2 rathern than 0, right?

Yes, and in one earlier reply, I had a similar thought
https://lore.kernel.org/lkml/20210728152507.GE43486@shbuild999.sh.intel.com/

  "
  One further thought is, if local node is not in the nodemask,
  should we compare the distance of all the nodes in nodemask
  to the local node and chose the shortest?
  "
And we may add a new API if there is no existing one:
	int cloest_node(int nid, nodemask_t *nmask);
to pick the best node from 'prefer-many' nodemsk.

> With your approach you
> would start with node 0 which would be more distant from cpu 1. 

> Also the
> semantic to give nodes some ordering based on their numbers sounds
> rather weird to me.

I agree, and as I admitted in the first reply, this need to be fixed.

> The semantic I am proposing is to allocate from prefered nodes in
> distance order starting from the local node.

So the plan is:
* if the local node is set in 'prefer-many's nodemask, then chose
* otherwise chose the node with the shortest distance to local node
?

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
  2021-07-29 15:12             ` Feng Tang
@ 2021-07-29 16:21               ` Michal Hocko
  2021-07-30  3:05                 ` Feng Tang
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Hocko @ 2021-07-29 16:21 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang, Dave Hansen

On Thu 29-07-21 23:12:42, Feng Tang wrote:
> On Thu, Jul 29, 2021 at 03:38:44PM +0200, Michal Hocko wrote:
[...]
> > Also the
> > semantic to give nodes some ordering based on their numbers sounds
> > rather weird to me.
> 
> I agree, and as I admitted in the first reply, this need to be fixed.

OK. I was not really clear that we are on the same page here.

> > The semantic I am proposing is to allocate from prefered nodes in
> > distance order starting from the local node.
> 
> So the plan is:
> * if the local node is set in 'prefer-many's nodemask, then chose
> * otherwise chose the node with the shortest distance to local node
> ?

Yes and what I am trying to say is that you will achieve that simply by
doing the following in policy_node:
	if (policy->mode == MPOL_PREFERRED_MANY)
		return nd;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
  2021-07-29 16:21               ` Michal Hocko
@ 2021-07-30  3:05                 ` Feng Tang
  2021-07-30  6:36                   ` Michal Hocko
  0 siblings, 1 reply; 38+ messages in thread
From: Feng Tang @ 2021-07-30  3:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang, Dave Hansen

On Thu, Jul 29, 2021 at 06:21:19PM +0200, Michal Hocko wrote:
> On Thu 29-07-21 23:12:42, Feng Tang wrote:
> > On Thu, Jul 29, 2021 at 03:38:44PM +0200, Michal Hocko wrote:
> [...]
> > > Also the
> > > semantic to give nodes some ordering based on their numbers sounds
> > > rather weird to me.
> > 
> > I agree, and as I admitted in the first reply, this need to be fixed.
> 
> OK. I was not really clear that we are on the same page here.
> 
> > > The semantic I am proposing is to allocate from prefered nodes in
> > > distance order starting from the local node.
> > 
> > So the plan is:
> > * if the local node is set in 'prefer-many's nodemask, then chose
> > * otherwise chose the node with the shortest distance to local node
> > ?
> 
> Yes and what I am trying to say is that you will achieve that simply by
> doing the following in policy_node:
> 	if (policy->mode == MPOL_PREFERRED_MANY)
> 		return nd;

One thing is, it's possible that 'nd' is not set in the preferred
nodemask. 

For policy_node(), most of its caller use the local node id as 'nd'
parameter. For HBM and PMEM memory nodes, they are cpuless nodes,
so they will not be a 'local node', but some use cases only prefer
these nodes.

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
  2021-07-30  3:05                 ` Feng Tang
@ 2021-07-30  6:36                   ` Michal Hocko
  2021-07-30  7:18                     ` Feng Tang
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Hocko @ 2021-07-30  6:36 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang, Dave Hansen

On Fri 30-07-21 11:05:02, Feng Tang wrote:
> On Thu, Jul 29, 2021 at 06:21:19PM +0200, Michal Hocko wrote:
> > On Thu 29-07-21 23:12:42, Feng Tang wrote:
> > > On Thu, Jul 29, 2021 at 03:38:44PM +0200, Michal Hocko wrote:
> > [...]
> > > > Also the
> > > > semantic to give nodes some ordering based on their numbers sounds
> > > > rather weird to me.
> > > 
> > > I agree, and as I admitted in the first reply, this need to be fixed.
> > 
> > OK. I was not really clear that we are on the same page here.
> > 
> > > > The semantic I am proposing is to allocate from prefered nodes in
> > > > distance order starting from the local node.
> > > 
> > > So the plan is:
> > > * if the local node is set in 'prefer-many's nodemask, then chose
> > > * otherwise chose the node with the shortest distance to local node
> > > ?
> > 
> > Yes and what I am trying to say is that you will achieve that simply by
> > doing the following in policy_node:
> > 	if (policy->mode == MPOL_PREFERRED_MANY)
> > 		return nd;
> 
> One thing is, it's possible that 'nd' is not set in the preferred
> nodemask. 

Yes, and there shouldn't be any problem with that.  The given node is
only used to get the respective zonelist (order distance ordered list of
zones to try). get_page_from_freelist will then use the preferred node
mask to filter this zone list. Is that more clear now?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
  2021-07-30  6:36                   ` Michal Hocko
@ 2021-07-30  7:18                     ` Feng Tang
  2021-07-30  7:38                       ` Michal Hocko
  2021-08-02  8:11                       ` Feng Tang
  0 siblings, 2 replies; 38+ messages in thread
From: Feng Tang @ 2021-07-30  7:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang, Dave Hansen

On Fri, Jul 30, 2021 at 08:36:50AM +0200, Michal Hocko wrote:
> On Fri 30-07-21 11:05:02, Feng Tang wrote:
> > On Thu, Jul 29, 2021 at 06:21:19PM +0200, Michal Hocko wrote:
> > > On Thu 29-07-21 23:12:42, Feng Tang wrote:
> > > > On Thu, Jul 29, 2021 at 03:38:44PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > Also the
> > > > > semantic to give nodes some ordering based on their numbers sounds
> > > > > rather weird to me.
> > > > 
> > > > I agree, and as I admitted in the first reply, this need to be fixed.
> > > 
> > > OK. I was not really clear that we are on the same page here.
> > > 
> > > > > The semantic I am proposing is to allocate from prefered nodes in
> > > > > distance order starting from the local node.
> > > > 
> > > > So the plan is:
> > > > * if the local node is set in 'prefer-many's nodemask, then chose
> > > > * otherwise chose the node with the shortest distance to local node
> > > > ?
> > > 
> > > Yes and what I am trying to say is that you will achieve that simply by
> > > doing the following in policy_node:
> > > 	if (policy->mode == MPOL_PREFERRED_MANY)
> > > 		return nd;
> > 
> > One thing is, it's possible that 'nd' is not set in the preferred
> > nodemask. 
> 
> Yes, and there shouldn't be any problem with that.  The given node is
> only used to get the respective zonelist (order distance ordered list of
> zones to try). get_page_from_freelist will then use the preferred node
> mask to filter this zone list. Is that more clear now?

Yes, from the code, the policy_node() is always coupled with
policy_nodemask(), which secures the 'nodemask' limit. Thanks for
the clarification!

And for the mempolicy_slab_node(), it seems to be a little different,
and we may need to reuse its logic for 'bind' policy, which is similar
to what we've discussed, pick a nearest node to the local node. And
similar for mpol_misplaced(). Thoughts?

Thanks,
Feng

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
  2021-07-30  7:18                     ` Feng Tang
@ 2021-07-30  7:38                       ` Michal Hocko
  2021-08-02  8:11                       ` Feng Tang
  1 sibling, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2021-07-30  7:38 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Dave Hansen,
	Ben Widawsky, linux-kernel, linux-api, Andrea Arcangeli,
	Mel Gorman, Mike Kravetz, Randy Dunlap, Vlastimil Babka,
	Andi Kleen, Dan Williams, ying.huang, Dave Hansen

On Fri 30-07-21 15:18:40, Feng Tang wrote:
[...]
> And for the mempolicy_slab_node(), it seems to be a little different,
> and we may need to reuse its logic for 'bind' policy, which is similar
> to what we've discussed, pick a nearest node to the local node. And
> similar for mpol_misplaced(). Thoughts?

I have to admit, I haven't looked closer to what slab does. I am not
familiar with internals and would have to study it some more. Maybe slab
maintainers have an idea.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
  2021-07-30  7:18                     ` Feng Tang
  2021-07-30  7:38                       ` Michal Hocko
@ 2021-08-02  8:11                       ` Feng Tang
  2021-08-02 11:14                         ` Michal Hocko
  1 sibling, 1 reply; 38+ messages in thread
From: Feng Tang @ 2021-08-02  8:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Hansen, Dave, Widawsky,
	Ben, linux-kernel, linux-api, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Williams, Dan J, Huang, Ying, Dave Hansen

On Fri, Jul 30, 2021 at 03:18:40PM +0800, Tang, Feng wrote:
[snip]
> > > One thing is, it's possible that 'nd' is not set in the preferred
> > > nodemask. 
> > 
> > Yes, and there shouldn't be any problem with that.  The given node is
> > only used to get the respective zonelist (order distance ordered list of
> > zones to try). get_page_from_freelist will then use the preferred node
> > mask to filter this zone list. Is that more clear now?
> 
> Yes, from the code, the policy_node() is always coupled with
> policy_nodemask(), which secures the 'nodemask' limit. Thanks for
> the clarification!

Hi Michal,

To ensure the nodemask limit, the policy_nodemask() also needs some
change to return the nodemask for 'prefer-many' policy, so here is a
updated 1/6 patch, which mainly changes the node/nodemask selection
for 'prefer-many' policy, could you review it? thanks!

- Feng

------8<-------------------------------------------------------
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 19a00bc7fe86..046d0ccba4cd 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -22,6 +22,7 @@ enum {
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
+	MPOL_PREFERRED_MANY,
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e675bfb856da..ea97530f86db 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -31,6 +31,9 @@
  *                but useful to set in a VMA when you have a non default
  *                process policy.
  *
+ * preferred many Try a set of nodes first before normal fallback. This is
+ *                similar to preferred without the special case.
+ *
  * default        Allocate on the local node first, or when on a VMA
  *                use the process policy. This is what Linux always did
  *		  in a NUMA aware kernel and still does by, ahem, default.
@@ -207,6 +210,14 @@ static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
 	return 0;
 }
 
+static int mpol_new_preferred_many(struct mempolicy *pol, const nodemask_t *nodes)
+{
+	if (nodes_empty(*nodes))
+		return -EINVAL;
+	pol->nodes = *nodes;
+	return 0;
+}
+
 static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
 {
 	if (nodes_empty(*nodes))
@@ -408,6 +419,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 	[MPOL_LOCAL] = {
 		.rebind = mpol_rebind_default,
 	},
+	[MPOL_PREFERRED_MANY] = {
+		.create = mpol_new_preferred_many,
+		.rebind = mpol_rebind_preferred,
+	},
 };
 
 static int migrate_page_add(struct page *page, struct list_head *pagelist,
@@ -900,6 +915,7 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
 		*nodes = p->nodes;
 		break;
 	case MPOL_LOCAL:
@@ -1446,7 +1462,13 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
 {
 	*flags = *mode & MPOL_MODE_FLAGS;
 	*mode &= ~MPOL_MODE_FLAGS;
-	if ((unsigned int)(*mode) >= MPOL_MAX)
+
+	/*
+	 * The check should be 'mode >= MPOL_MAX', but as 'prefer_many'
+	 * is not fully implemented, don't permit it to be used for now,
+	 * and the logic will be restored in following patch
+	 */
+	if ((unsigned int)(*mode) >=  MPOL_PREFERRED_MANY)
 		return -EINVAL;
 	if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
 		return -EINVAL;
@@ -1875,8 +1897,13 @@ static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
  */
 nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 {
-	/* Lower zones don't get a nodemask applied for MPOL_BIND */
-	if (unlikely(policy->mode == MPOL_BIND) &&
+	int mode = policy->mode;
+
+	/*
+	 * Lower zones don't get a nodemask applied for 'bind' and
+	 * 'prefer-many' policies
+	 */
+	if (unlikely(mode == MPOL_BIND || mode == MPOL_PREFERRED_MANY) &&
 			apply_policy_zone(policy, gfp_zone(gfp)) &&
 			cpuset_nodemask_valid_mems_allowed(&policy->nodes))
 		return &policy->nodes;
@@ -1884,7 +1911,13 @@ nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 	return NULL;
 }
 
-/* Return the node id preferred by the given mempolicy, or the given id */
+/*
+ * Return the  preferred node id for 'prefer' mempolicy, and return
+ * the given id for all other policies.
+ *
+ * policy_node() is always coupbled with policy_nodemask(), which
+ * secures the the nodemask limit for 'bind' and 'prefer-many' policy.
+ */
 static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
 {
 	if (policy->mode == MPOL_PREFERRED) {
@@ -1936,7 +1969,9 @@ unsigned int mempolicy_slab_node(void)
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(policy);
 
-	case MPOL_BIND: {
+	case MPOL_BIND:
+	case MPOL_PREFERRED_MANY:
+	{
 		struct zoneref *z;
 
 		/*
@@ -2008,7 +2043,7 @@ static inline unsigned interleave_nid(struct mempolicy *pol,
  * @addr: address in @vma for shared policy lookup and interleave policy
  * @gfp_flags: for requested zone
  * @mpol: pointer to mempolicy pointer for reference counted mempolicy
- * @nodemask: pointer to nodemask pointer for MPOL_BIND nodemask
+ * @nodemask: pointer to nodemask pointer for 'bind' and 'prefer-many' policy
  *
  * Returns a nid suitable for a huge page allocation and a pointer
  * to the struct mempolicy for conditional unref after allocation.
@@ -2021,16 +2056,18 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask)
 {
 	int nid;
+	int mode;
 
 	*mpol = get_vma_policy(vma, addr);
-	*nodemask = NULL;	/* assume !MPOL_BIND */
+	*nodemask = NULL;
+	mode = (*mpol)->mode;
 
-	if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
+	if (unlikely(mode == MPOL_INTERLEAVE)) {
 		nid = interleave_nid(*mpol, vma, addr,
 					huge_page_shift(hstate_vma(vma)));
 	} else {
 		nid = policy_node(gfp_flags, *mpol, numa_node_id());
-		if ((*mpol)->mode == MPOL_BIND)
+		if (mode == MPOL_BIND || mode == MPOL_PREFERRED_MANY)
 			*nodemask = &(*mpol)->nodes;
 	}
 	return nid;
@@ -2063,6 +2100,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 	mempolicy = current->mempolicy;
 	switch (mempolicy->mode) {
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 		*mask = mempolicy->nodes;
@@ -2173,7 +2211,7 @@ struct page *alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		 * node and don't fall back to other nodes, as the cost of
 		 * remote accesses would likely offset THP benefits.
 		 *
-		 * If the policy is interleave, or does not allow the current
+		 * If the policy is interleave or does not allow the current
 		 * node in its nodemask, we allocate the standard way.
 		 */
 		if (pol->mode == MPOL_PREFERRED)
@@ -2311,6 +2349,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
 		return !!nodes_equal(a->nodes, b->nodes);
 	case MPOL_LOCAL:
 		return true;
@@ -2451,6 +2490,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		break;
 
 	case MPOL_PREFERRED:
+		if (node_isset(curnid, pol->nodes))
+			goto out;
 		polnid = first_node(pol->nodes);
 		break;
 
@@ -2465,9 +2506,10 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 				break;
 			goto out;
 		}
+		fallthrough;
 
+	case MPOL_PREFERRED_MANY:
 		/*
-		 * allows binding to multiple nodes.
 		 * use current page if in policy nodemask,
 		 * else select nearest allowed node, if any.
 		 * If no allowed nodes, use current [!misplaced].
@@ -2829,6 +2871,7 @@ static const char * const policy_modes[] =
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
 	[MPOL_LOCAL]      = "local",
+	[MPOL_PREFERRED_MANY]  = "prefer (many)",
 };
 
 
@@ -2907,6 +2950,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 		if (!nodelist)
 			err = 0;
 		goto out;
+	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 		/*
 		 * Insist on a nodelist
@@ -2993,6 +3037,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 	case MPOL_LOCAL:
 		break;
 	case MPOL_PREFERRED:
+	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 		nodes = pol->nodes;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
  2021-08-02  8:11                       ` Feng Tang
@ 2021-08-02 11:14                         ` Michal Hocko
  2021-08-02 11:33                           ` Feng Tang
  0 siblings, 1 reply; 38+ messages in thread
From: Michal Hocko @ 2021-08-02 11:14 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, David Rientjes, Hansen, Dave, Widawsky,
	Ben, linux-kernel, linux-api, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Williams, Dan J, Huang, Ying, Dave Hansen

On Mon 02-08-21 16:11:30, Feng Tang wrote:
> On Fri, Jul 30, 2021 at 03:18:40PM +0800, Tang, Feng wrote:
> [snip]
> > > > One thing is, it's possible that 'nd' is not set in the preferred
> > > > nodemask. 
> > > 
> > > Yes, and there shouldn't be any problem with that.  The given node is
> > > only used to get the respective zonelist (order distance ordered list of
> > > zones to try). get_page_from_freelist will then use the preferred node
> > > mask to filter this zone list. Is that more clear now?
> > 
> > Yes, from the code, the policy_node() is always coupled with
> > policy_nodemask(), which secures the 'nodemask' limit. Thanks for
> > the clarification!
> 
> Hi Michal,
> 
> To ensure the nodemask limit, the policy_nodemask() also needs some
> change to return the nodemask for 'prefer-many' policy, so here is a
> updated 1/6 patch, which mainly changes the node/nodemask selection
> for 'prefer-many' policy, could you review it? thanks!

right, I have mixed it with get_policy_nodemask

> @@ -1875,8 +1897,13 @@ static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
>   */
>  nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
>  {
> -	/* Lower zones don't get a nodemask applied for MPOL_BIND */
> -	if (unlikely(policy->mode == MPOL_BIND) &&
> +	int mode = policy->mode;
> +
> +	/*
> +	 * Lower zones don't get a nodemask applied for 'bind' and
> +	 * 'prefer-many' policies
> +	 */
> +	if (unlikely(mode == MPOL_BIND || mode == MPOL_PREFERRED_MANY) &&
>  			apply_policy_zone(policy, gfp_zone(gfp)) &&
>  			cpuset_nodemask_valid_mems_allowed(&policy->nodes))
>  		return &policy->nodes;

Isn't this just too cryptic? Why didn't you simply
	if (mode == MPOL_PREFERRED_MANY)
		return &policy->mode;

in addition to the existing code? I mean why would you even care about
cpusets? Those are handled at the page allocator layer and will further
filter the given nodemask. 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
  2021-08-02 11:14                         ` Michal Hocko
@ 2021-08-02 11:33                           ` Feng Tang
  2021-08-02 11:47                             ` Michal Hocko
  0 siblings, 1 reply; 38+ messages in thread
From: Feng Tang @ 2021-08-02 11:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, David Rientjes, Hansen, Dave, Widawsky,
	Ben, linux-kernel, linux-api, Andrea Arcangeli, Mel Gorman,
	Mike Kravetz, Randy Dunlap, Vlastimil Babka, Andi Kleen,
	Williams, Dan J, Huang, Ying, Dave Hansen

On Mon, Aug 02, 2021 at 01:14:29PM +0200, Michal Hocko wrote:
> On Mon 02-08-21 16:11:30, Feng Tang wrote:
> > On Fri, Jul 30, 2021 at 03:18:40PM +0800, Tang, Feng wrote:
> > [snip]
> > > > > One thing is, it's possible that 'nd' is not set in the preferred
> > > > > nodemask. 
> > > > 
> > > > Yes, and there shouldn't be any problem with that.  The given node is
> > > > only used to get the respective zonelist (order distance ordered list of
> > > > zones to try). get_page_from_freelist will then use the preferred node
> > > > mask to filter this zone list. Is that more clear now?
> > > 
> > > Yes, from the code, the policy_node() is always coupled with
> > > policy_nodemask(), which secures the 'nodemask' limit. Thanks for
> > > the clarification!
> > 
> > Hi Michal,
> > 
> > To ensure the nodemask limit, the policy_nodemask() also needs some
> > change to return the nodemask for 'prefer-many' policy, so here is a
> > updated 1/6 patch, which mainly changes the node/nodemask selection
> > for 'prefer-many' policy, could you review it? thanks!
> 
> right, I have mixed it with get_policy_nodemask
> 
> > @@ -1875,8 +1897,13 @@ static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
> >   */
> >  nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> >  {
> > -	/* Lower zones don't get a nodemask applied for MPOL_BIND */
> > -	if (unlikely(policy->mode == MPOL_BIND) &&
> > +	int mode = policy->mode;
> > +
> > +	/*
> > +	 * Lower zones don't get a nodemask applied for 'bind' and
> > +	 * 'prefer-many' policies
> > +	 */
> > +	if (unlikely(mode == MPOL_BIND || mode == MPOL_PREFERRED_MANY) &&
> >  			apply_policy_zone(policy, gfp_zone(gfp)) &&
> >  			cpuset_nodemask_valid_mems_allowed(&policy->nodes))
> >  		return &policy->nodes;
> 
> Isn't this just too cryptic? Why didn't you simply
> 	if (mode == MPOL_PREFERRED_MANY)
> 		return &policy->mode;
> 
> in addition to the existing code? I mean why would you even care about
> cpusets? Those are handled at the page allocator layer and will further
> filter the given nodemask. 

Ok, I will follow your suggestion and keep 'bind' handling unchanged.

And to be honest, I don't fully understand the current handling for
'bind' policy, will the returning NULL for 'bind' policy open a
sideway for the strict 'bind' limit. 

Thanks,
Feng


> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
  2021-08-02 11:33                           ` Feng Tang
@ 2021-08-02 11:47                             ` Michal Hocko
  0 siblings, 0 replies; 38+ messages in thread
From: Michal Hocko @ 2021-08-02 11:47 UTC (permalink / raw)
  To: Feng Tang, Mel Gorman
  Cc: linux-mm, Andrew Morton, David Rientjes, Hansen, Dave, Widawsky,
	Ben, linux-kernel, linux-api, Andrea Arcangeli, Mike Kravetz,
	Randy Dunlap, Vlastimil Babka, Andi Kleen, Williams, Dan J,
	Huang, Ying, Dave Hansen

On Mon 02-08-21 19:33:26, Feng Tang wrote:
[...]
> And to be honest, I don't fully understand the current handling for
> 'bind' policy, will the returning NULL for 'bind' policy open a
> sideway for the strict 'bind' limit. 

I do not remember all the details but this is an old behavior that MBIND
policy doesn't apply to kernel allocations in presnce of the movable
zone. Detailed reasoning is not clear to me at the moment, maybe Mel
remembers?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2021-08-02 11:47 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-12  8:09 [PATCH v6 0/6] Introduce multi-preference mempolicy Feng Tang
2021-07-12  8:09 ` [PATCH v6 1/6] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes Feng Tang
2021-07-28 12:31   ` Michal Hocko
2021-07-28 14:11     ` Feng Tang
2021-07-28 16:12       ` Michal Hocko
2021-07-29  7:09         ` Feng Tang
2021-07-29 13:38           ` Michal Hocko
2021-07-29 15:12             ` Feng Tang
2021-07-29 16:21               ` Michal Hocko
2021-07-30  3:05                 ` Feng Tang
2021-07-30  6:36                   ` Michal Hocko
2021-07-30  7:18                     ` Feng Tang
2021-07-30  7:38                       ` Michal Hocko
2021-08-02  8:11                       ` Feng Tang
2021-08-02 11:14                         ` Michal Hocko
2021-08-02 11:33                           ` Feng Tang
2021-08-02 11:47                             ` Michal Hocko
2021-07-12  8:09 ` [PATCH v6 2/6] mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY policy Feng Tang
2021-07-28 12:42   ` Michal Hocko
2021-07-28 15:18     ` Feng Tang
2021-07-28 15:25       ` Feng Tang
2021-07-28 16:15         ` Michal Hocko
2021-07-28 16:14       ` Michal Hocko
2021-07-12  8:09 ` [PATCH v6 3/6] mm/mempolicy: enable page allocation for MPOL_PREFERRED_MANY for general cases Feng Tang
2021-07-12  8:09 ` [PATCH v6 4/6] mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY Feng Tang
2021-07-21 20:49   ` Mike Kravetz
2021-07-22  8:11     ` Feng Tang
2021-07-22  9:42     ` Michal Hocko
2021-07-22 16:21       ` Mike Kravetz
2021-07-12  8:09 ` [PATCH v6 5/6] mm/mempolicy: Advertise new MPOL_PREFERRED_MANY Feng Tang
2021-07-28 12:47   ` Michal Hocko
2021-07-28 13:41     ` Feng Tang
2021-07-12  8:09 ` [PATCH v6 6/6] mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies Feng Tang
2021-07-28 12:51   ` Michal Hocko
2021-07-28 13:50     ` Feng Tang
2021-07-15  0:15 ` [PATCH v6 0/6] Introduce multi-preference mempolicy Andrew Morton
2021-07-15  2:13   ` Feng Tang
2021-07-15 18:49   ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).