linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/18] multiple preferred nodes
@ 2020-06-19 16:24 Ben Widawsky
  2020-06-19 16:24 ` [PATCH 01/18] mm/mempolicy: Add comment for missing LOCAL Ben Widawsky
                   ` (19 more replies)
  0 siblings, 20 replies; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Ben Widawsky, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Michal Hocko, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka

This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy.
This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2)
interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a
preference for nodes which will fulfil memory allocation requests. Like the
MPOL_BIND interface, it works over a set of nodes.

Summary:
1-2: Random fixes I found along the way
3-4: Logic to handle many preferred nodes in page allocation
5-9: Plumbing to allow multiple preferred nodes in mempolicy
10-13: Teach page allocation APIs about nodemasks
14: Provide a helper to generate preferred nodemasks
15: Have page allocation callers generate preferred nodemasks
16-17: Flip the switch to have __alloc_pages_nodemask take preferred mask.
18: Expose the new uapi

Along with these patches are patches for libnuma, numactl, numademo, and memhog.
They still need some polish, but can be found here:
https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many
It allows new usage: `numactl -P 0,3,4`

The goal of the new mode is to enable some use-cases when using tiered memory
usage models which I've lovingly named.
1a. The Hare - The interconnect is fast enough to meet bandwidth and latency
requirements allowing preference to be given to all nodes with "fast" memory.
1b. The Indiscriminate Hare - An application knows it wants fast memory (or
perhaps slow memory), but doesn't care which node it runs on. The application
can prefer a set of nodes and then xpu bind to the local node (cpu, accelerator,
etc). This reverses the nodes are chosen today where the kernel attempts to use
local memory to the CPU whenever possible. This will attempt to use the local
accelerator to the memory.
2. The Tortoise - The administrator (or the application itself) is aware it only
needs slow memory, and so can prefer that.

Much of this is almost achievable with the bind interface, but the bind
interface suffers from an inability to fallback to another set of nodes if
binding fails to all nodes in the nodemask.

Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
preference.

> /* Set first two nodes as preferred in an 8 node system. */
> const unsigned long nodes = 0x3
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);

> /* Mimic interleave policy, but have fallback *.
> const unsigned long nodes = 0xaa
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);

Some internal discussion took place around the interface. There are two
alternatives which we have discussed, plus one I stuck in:
1. Ordered list of nodes. Currently it's believed that the added complexity is
   nod needed for expected usecases.
2. A flag for bind to allow falling back to other nodes. This confuses the
   notion of binding and is less flexible than the current solution.
3. Create flags or new modes that helps with some ordering. This offers both a
   friendlier API as well as a solution for more customized usage. It's unknown
   if it's worth the complexity to support this. Here is sample code for how
   this might work:

> // Default
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
> // which is the same as
> set_mempolicy(MPOL_DEFAULT, NULL, 0);
>
> // The Hare
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
>
> // The Tortoise
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
>
> // Prefer the fast memory of the first two sockets
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
>
> // Prefer specific nodes for some something wacky
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_CUSTOM, 0x17c, 1024);

---

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Li Xinhai <lixinhai.lxh@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>

Ben Widawsky (14):
  mm/mempolicy: Add comment for missing LOCAL
  mm/mempolicy: Use node_mem_id() instead of node_id()
  mm/page_alloc: start plumbing multi preferred node
  mm/page_alloc: add preferred pass to page allocation
  mm: Finish handling MPOL_PREFERRED_MANY
  mm: clean up alloc_pages_vma (thp)
  mm: Extract THP hugepage allocation
  mm/mempolicy: Use __alloc_page_node for interleaved
  mm: kill __alloc_pages
  mm/mempolicy: Introduce policy_preferred_nodes()
  mm: convert callers of __alloc_pages_nodemask to pmask
  alloc_pages_nodemask: turn preferred nid into a nodemask
  mm: Use less stack for page allocations
  mm/mempolicy: Advertise new MPOL_PREFERRED_MANY

Dave Hansen (4):
  mm/mempolicy: convert single preferred_node to full nodemask
  mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
  mm/mempolicy: allow preferred code to take a nodemask
  mm/mempolicy: refactor rebind code for PREFERRED_MANY

 .../admin-guide/mm/numa_memory_policy.rst     |  22 +-
 include/linux/gfp.h                           |  19 +-
 include/linux/mempolicy.h                     |   4 +-
 include/linux/migrate.h                       |   4 +-
 include/linux/mmzone.h                        |   3 +
 include/uapi/linux/mempolicy.h                |   6 +-
 mm/hugetlb.c                                  |  10 +-
 mm/internal.h                                 |   1 +
 mm/mempolicy.c                                | 271 +++++++++++++-----
 mm/page_alloc.c                               | 179 +++++++++++-
 10 files changed, 403 insertions(+), 116 deletions(-)


-- 
2.27.0



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 01/18] mm/mempolicy: Add comment for missing LOCAL
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
@ 2020-06-19 16:24 ` Ben Widawsky
  2020-06-24  7:55   ` Michal Hocko
  2020-06-19 16:24 ` [PATCH 02/18] mm/mempolicy: Use node_mem_id() instead of node_id() Ben Widawsky
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:24 UTC (permalink / raw)
  To: linux-mm; +Cc: Ben Widawsky, Christoph Lameter, Andrew Morton, David Rientjes

MPOL_LOCAL is a bit weird because it is simply a different name for an
existing behavior (preferred policy with no node mask). It has been this
way since it was added here:
commit 479e2802d09f ("mm: mempolicy: Make MPOL_LOCAL a real policy")

It is so similar to MPOL_PREFERRED in fact that when the policy is
created in mpol_new, the mode is set as PREFERRED, and an internal state
representing LOCAL doesn't exist.

To prevent future explorers from scratching their head as to why
MPOL_LOCAL isn't defined in the mpol_ops table, add a small comment
explaining the situations.

Cc: Christoph Lameter <cl@linux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 mm/mempolicy.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 381320671677..36ee3267c25f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -427,6 +427,7 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 		.create = mpol_new_bind,
 		.rebind = mpol_rebind_nodemask,
 	},
+	/* MPOL_LOCAL is converted to MPOL_PREFERRED on policy creation */
 };
 
 static int migrate_page_add(struct page *page, struct list_head *pagelist,
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 02/18] mm/mempolicy: Use node_mem_id() instead of node_id()
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
  2020-06-19 16:24 ` [PATCH 01/18] mm/mempolicy: Add comment for missing LOCAL Ben Widawsky
@ 2020-06-19 16:24 ` Ben Widawsky
  2020-06-24  8:25   ` Michal Hocko
  2020-06-19 16:24 ` [PATCH 03/18] mm/page_alloc: start plumbing multi preferred node Ben Widawsky
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:24 UTC (permalink / raw)
  To: linux-mm; +Cc: Ben Widawsky, Andrew Morton, Lee Schermerhorn

Calling out some distinctions first as I understand it, and the
reasoning of the patch:
numa_node_id() - The node id for the currently running CPU.
numa_mem_id() - The node id for the closest memory node.

The case where they are not the same is CONFIG_HAVE_MEMORYLESS_NODES.
Only ia64 and powerpc support this option, so it is perhaps not a very
interesting situation to most.

The question is, when you do want which? numa_node_id() is definitely
what's desired if MPOL_PREFERRED, or MPOL_LOCAL were used, since the ABI
states "This mode specifies "local allocation"; the memory is allocated
on the node of the CPU that triggered the allocation (the "local
node")." It would be weird, though not impossible to set this policy on
a CPU that has memoryless nodes. A more likely way to hit this is with
interleaving. The current interfaces will return some equally weird
thing, but at least it's symmetric. Therefore, in cases where the node
is being queried for the currently running process, it probably makes
sense to use numa_node_id(). For other cases however, when CPU is trying
to obtain the "local" memory, numa_mem_id() already contains this and
should be used instead.

This really should only effect configurations where
CONFIG_HAVE_MEMORYLESS_NODES=y, and even on those machines it's quite
possible the ultimate behavior would be identical.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 mm/mempolicy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36ee3267c25f..99e0f3f9c4a6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1991,7 +1991,7 @@ static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
 	int nid;
 
 	if (!nnodes)
-		return numa_node_id();
+		return numa_mem_id();
 	target = (unsigned int)n % nnodes;
 	nid = first_node(pol->v.nodes);
 	for (i = 0; i < target; i++)
@@ -2049,7 +2049,7 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
 		nid = interleave_nid(*mpol, vma, addr,
 					huge_page_shift(hstate_vma(vma)));
 	} else {
-		nid = policy_node(gfp_flags, *mpol, numa_node_id());
+		nid = policy_node(gfp_flags, *mpol, numa_mem_id());
 		if ((*mpol)->mode == MPOL_BIND)
 			*nodemask = &(*mpol)->v.nodes;
 	}
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 03/18] mm/page_alloc: start plumbing multi preferred node
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
  2020-06-19 16:24 ` [PATCH 01/18] mm/mempolicy: Add comment for missing LOCAL Ben Widawsky
  2020-06-19 16:24 ` [PATCH 02/18] mm/mempolicy: Use node_mem_id() instead of node_id() Ben Widawsky
@ 2020-06-19 16:24 ` Ben Widawsky
  2020-06-19 16:24 ` [PATCH 04/18] mm/page_alloc: add preferred pass to page allocation Ben Widawsky
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Ben Widawsky, Andi Kleen, Andrew Morton, Dave Hansen,
	Kuppuswamy Sathyanarayanan, Mel Gorman, Michal Hocko

In preparation for supporting multiple preferred nodes, we need the
internals to switch from taking a nid to a nodemask.

As mentioned in the code as a comment, __alloc_pages_nodemask() is the
heart of the page allocator. It takes a single node as a preferred node
to try to obtain a zonelist from first. This patch leaves that internal
interface in place, but changes the guts of the function to consider a
list of preferred nodes.

The local node is always most preferred. If the local node is either
restricted because of preference or binding, then the closest node that
meets both the binding and preference criteria is used. If the
intersection of binding and preference is the empty set, then fall back
to the first node the meets binding criteria.

As of this patch, multiple preferred nodes aren't actually supported as
it might seem initially. As an example, suppose your preferred nodes
are 0, and 1. Node 0's fallback zone list may have zones from nodes
ordered 0->2->1. If this code were to pick 0's zonelist, and all zones
from node 0 were full, you'd get a zone from node 2 instead of 1. As
multiple nodes aren't yet supported anyway, this is okay just as a
preparatory patch.

v2:
Fixed memory hotplug handling (Ben)

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 mm/page_alloc.c | 125 +++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 119 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48eb0f1410d4..280ca85dc4d8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -129,6 +129,10 @@ nodemask_t node_states[NR_NODE_STATES] __read_mostly = {
 };
 EXPORT_SYMBOL(node_states);
 
+#ifdef CONFIG_NUMA
+static int find_next_best_node(int node, nodemask_t *used_node_mask);
+#endif
+
 atomic_long_t _totalram_pages __read_mostly;
 EXPORT_SYMBOL(_totalram_pages);
 unsigned long totalreserve_pages __read_mostly;
@@ -4759,13 +4763,118 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
-		int preferred_nid, nodemask_t *nodemask,
-		struct alloc_context *ac, gfp_t *alloc_mask,
-		unsigned int *alloc_flags)
+#ifndef CONFIG_NUMA
+#define set_pref_bind_mask(out, pref, bind)                                    \
+	{                                                                      \
+		(out)->bits[0] = 1UL                                           \
+	}
+#else
+static void set_pref_bind_mask(nodemask_t *out, const nodemask_t *prefmask,
+			       const nodemask_t *bindmask)
+{
+	bool has_pref, has_bind;
+
+	has_pref = prefmask && !nodes_empty(*prefmask);
+	has_bind = bindmask && !nodes_empty(*bindmask);
+
+	if (has_pref && has_bind)
+		nodes_and(*out, *prefmask, *bindmask);
+	else if (has_pref && !has_bind)
+		*out = *prefmask;
+	else if (!has_pref && has_bind)
+		*out = *bindmask;
+	else if (!has_pref && !has_bind)
+		unreachable(); /* Handled above */
+	else
+		unreachable();
+}
+#endif
+
+/*
+ * Find a zonelist from a preferred node. Here is a truth table example using 2
+ * different masks. The choices are, NULL mask, empty mask, two masks with an
+ * intersection, and two masks with no intersection. If the local node is in the
+ * intersection, it is used, otherwise the first set node is used.
+ *
+ * +----------+----------+------------+
+ * | bindmask | prefmask |  zonelist  |
+ * +----------+----------+------------+
+ * | NULL/0   | NULL/0   | local node |
+ * | NULL/0   | 0x2      | 0x2        |
+ * | NULL/0   | 0x4      | 0x4        |
+ * | 0x2      | NULL/0   | 0x2        |
+ * | 0x2      | 0x2      | 0x2        |
+ * | 0x2      | 0x4      | local*     |
+ * | 0x4      | NULL/0   | 0x4        |
+ * | 0x4      | 0x2      | local*     |
+ * | 0x4      | 0x4      | 0x4        |
+ * +----------+----------+------------+
+ *
+ * NB: That zonelist will have *all* zones in the fallback case, and not all of
+ * those zones will belong to preferred nodes.
+ */
+static struct zonelist *preferred_zonelist(gfp_t gfp_mask,
+					   const nodemask_t *prefmask,
+					   const nodemask_t *bindmask)
+{
+	nodemask_t pref;
+	int nid, local_node = numa_mem_id();
+
+	/* Multi nodes not supported yet */
+	VM_BUG_ON(prefmask && nodes_weight(*prefmask) != 1);
+
+#define _isset(mask, node)                                                     \
+	(!(mask) || nodes_empty(*(mask)) ? 1 : node_isset(node, *(mask)))
+	/*
+	 * This will handle NULL masks, empty masks, and when the local node
+	 * match all constraints. It does most of the magic here.
+	 */
+	if (_isset(prefmask, local_node) && _isset(bindmask, local_node))
+		return node_zonelist(local_node, gfp_mask);
+#undef _isset
+
+	VM_BUG_ON(!prefmask && !bindmask);
+
+	set_pref_bind_mask(&pref, prefmask, bindmask);
+
+	/*
+	 * It is possible that the caller may ask for a preferred set that isn't
+	 * available. One such case is memory hotplug. Memory hotplug code tries
+	 * to do some allocations from the target node (what will be local to
+	 * the new node) before the pages are onlined (N_MEMORY).
+	 */
+	for_each_node_mask(nid, pref) {
+		if (!node_state(nid, N_MEMORY))
+			node_clear(nid, pref);
+	}
+
+	/*
+	 * If we couldn't manage to get anything reasonable, let later code
+	 * clean up our mess. Local node will be the best approximation for what
+	 * is desired, just use it.
+	 */
+	if (unlikely(nodes_empty(pref)))
+		return node_zonelist(local_node, gfp_mask);
+
+	/* Try to find the "closest" node in the list. */
+	nodes_complement(pref, pref);
+	while ((nid = find_next_best_node(local_node, &pref)) != NUMA_NO_NODE)
+		return node_zonelist(nid, gfp_mask);
+
+	/*
+	 * find_next_best_node() will have to have found something if the
+	 * node list isn't empty and so it isn't possible to get here unless
+	 * find_next_best_node() is modified and this function isn't updated.
+	 */
+	BUG();
+}
+
+static inline bool
+prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, nodemask_t *prefmask,
+		    nodemask_t *nodemask, struct alloc_context *ac,
+		    gfp_t *alloc_mask, unsigned int *alloc_flags)
 {
 	ac->highest_zoneidx = gfp_zone(gfp_mask);
-	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
 	ac->nodemask = nodemask;
 	ac->migratetype = gfp_migratetype(gfp_mask);
 
@@ -4777,6 +4886,8 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 			*alloc_flags |= ALLOC_CPUSET;
 	}
 
+	ac->zonelist = preferred_zonelist(gfp_mask, prefmask, ac->nodemask);
+
 	fs_reclaim_acquire(gfp_mask);
 	fs_reclaim_release(gfp_mask);
 
@@ -4817,6 +4928,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 	unsigned int alloc_flags = ALLOC_WMARK_LOW;
 	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
 	struct alloc_context ac = { };
+	nodemask_t prefmask = nodemask_of_node(preferred_nid);
 
 	/*
 	 * There are several places where we assume that the order value is sane
@@ -4829,7 +4941,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 
 	gfp_mask &= gfp_allowed_mask;
 	alloc_mask = gfp_mask;
-	if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
+	if (!prepare_alloc_pages(gfp_mask, order, &prefmask, nodemask, &ac,
+				 &alloc_mask, &alloc_flags))
 		return NULL;
 
 	finalise_ac(gfp_mask, &ac);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 04/18] mm/page_alloc: add preferred pass to page allocation
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
                   ` (2 preceding siblings ...)
  2020-06-19 16:24 ` [PATCH 03/18] mm/page_alloc: start plumbing multi preferred node Ben Widawsky
@ 2020-06-19 16:24 ` Ben Widawsky
  2020-06-19 16:24 ` [PATCH 05/18] mm/mempolicy: convert single preferred_node to full nodemask Ben Widawsky
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Ben Widawsky, Andi Kleen, Andrew Morton, Dave Hansen,
	Johannes Weiner, Mel Gorman, Michal Hocko, Vlastimil Babka

This patch updates the core part of page allocation (pulling from the
free list) to take preferred nodes into account first. If an allocation
from a preferred node cannot be found, the remaining nodes in the
nodemask are checked.

Intentionally not handled in this patch are OOM node scanning and
reclaim scanning. I am very open and receptive on comments as to whether
it is worth handling those cases with a preferred node ordering.

In this patch the code first scans the preferred nodes to make the
allocation, and then takes the subset of nodes in the remaining bound
nodes (often this is NULL aka all nodes) - potentially two passes.
Actually, the code was already two passes as it tries not to fragment on
the first pass, so now it's up to 4 passes.

Consider a 3 node system (0-2) passed the following masks:
Preferred: 	100
Bound:		110

pass 1: node 2 no fragmentation
pass 2: node 1 no fragmentation
pass 3: node 2 w/fragmentation
pass 4: node 1 w/fragmentation

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 mm/internal.h   |   1 +
 mm/page_alloc.c | 108 +++++++++++++++++++++++++++++++++++-------------
 2 files changed, 80 insertions(+), 29 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 9886db20d94f..8d16229c6cbb 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -138,6 +138,7 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
 struct alloc_context {
 	struct zonelist *zonelist;
 	nodemask_t *nodemask;
+	nodemask_t *prefmask;
 	struct zoneref *preferred_zoneref;
 	int migratetype;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 280ca85dc4d8..3cf44b6c31ae 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3675,6 +3675,69 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
 	return alloc_flags;
 }
 
+#ifdef CONFIG_NUMA
+static void set_pref_bind_mask(nodemask_t *out, const nodemask_t *prefmask,
+			       const nodemask_t *bindmask)
+{
+	bool has_pref, has_bind;
+
+	has_pref = prefmask && !nodes_empty(*prefmask);
+	has_bind = bindmask && !nodes_empty(*bindmask);
+
+	if (has_pref && has_bind)
+		nodes_and(*out, *prefmask, *bindmask);
+	else if (has_pref && !has_bind)
+		*out = *prefmask;
+	else if (!has_pref && has_bind)
+		*out = *bindmask;
+	else if (!has_pref && !has_bind)
+		*out = NODE_MASK_ALL;
+	else
+		unreachable();
+}
+#else
+#define set_pref_bind_mask(out, pref, bind)                                    \
+	{                                                                      \
+		(out)->bits[0] = 1UL                                           \
+	}
+#endif
+
+/* Helper to generate the preferred and fallback nodelists */
+static void __nodemask_for_freelist_scan(const struct alloc_context *ac,
+					 bool preferred, nodemask_t *outnodes)
+{
+	bool has_pref;
+	bool has_bind;
+
+	if (preferred) {
+		set_pref_bind_mask(outnodes, ac->prefmask, ac->nodemask);
+		return;
+	}
+
+	has_pref = ac->prefmask && !nodes_empty(*ac->prefmask);
+	has_bind = ac->nodemask && !nodes_empty(*ac->nodemask);
+
+	if (!has_bind && !has_pref) {
+		/*
+		 * If no preference, we already tried the full nodemask,
+		 * so we have to bail.
+		 */
+		nodes_clear(*outnodes);
+	} else if (!has_bind && has_pref) {
+		/* We tried preferred nodes only before. Invert that. */
+		nodes_complement(*outnodes, *ac->prefmask);
+	} else if (has_bind && !has_pref) {
+		/*
+		 * If preferred was empty, we've tried all bound nodes,
+		 * and there nothing further we can do.
+		 */
+		nodes_clear(*outnodes);
+	} else if (has_bind && has_pref) {
+		/* Try the bound nodes that weren't tried before. */
+		nodes_andnot(*outnodes, *ac->nodemask, *ac->prefmask);
+	}
+}
+
 /*
  * get_page_from_freelist goes through the zonelist trying to allocate
  * a page.
@@ -3686,7 +3749,10 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	struct zoneref *z;
 	struct zone *zone;
 	struct pglist_data *last_pgdat_dirty_limit = NULL;
-	bool no_fallback;
+	nodemask_t nodes;
+	bool no_fallback, preferred_nodes_exhausted = false;
+
+	__nodemask_for_freelist_scan(ac, true, &nodes);
 
 retry:
 	/*
@@ -3696,7 +3762,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
 	z = ac->preferred_zoneref;
 	for_next_zone_zonelist_nodemask(zone, z, ac->zonelist,
-					ac->highest_zoneidx, ac->nodemask) {
+					ac->highest_zoneidx, &nodes)
+	{
 		struct page *page;
 		unsigned long mark;
 
@@ -3816,12 +3883,20 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		}
 	}
 
+	if (!preferred_nodes_exhausted) {
+		__nodemask_for_freelist_scan(ac, false, &nodes);
+		preferred_nodes_exhausted = true;
+		goto retry;
+	}
+
 	/*
 	 * It's possible on a UMA machine to get through all zones that are
 	 * fragmented. If avoiding fragmentation, reset and try again.
 	 */
 	if (no_fallback) {
 		alloc_flags &= ~ALLOC_NOFRAGMENT;
+		__nodemask_for_freelist_scan(ac, true, &nodes);
+		preferred_nodes_exhausted = false;
 		goto retry;
 	}
 
@@ -4763,33 +4838,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-#ifndef CONFIG_NUMA
-#define set_pref_bind_mask(out, pref, bind)                                    \
-	{                                                                      \
-		(out)->bits[0] = 1UL                                           \
-	}
-#else
-static void set_pref_bind_mask(nodemask_t *out, const nodemask_t *prefmask,
-			       const nodemask_t *bindmask)
-{
-	bool has_pref, has_bind;
-
-	has_pref = prefmask && !nodes_empty(*prefmask);
-	has_bind = bindmask && !nodes_empty(*bindmask);
-
-	if (has_pref && has_bind)
-		nodes_and(*out, *prefmask, *bindmask);
-	else if (has_pref && !has_bind)
-		*out = *prefmask;
-	else if (!has_pref && has_bind)
-		*out = *bindmask;
-	else if (!has_pref && !has_bind)
-		unreachable(); /* Handled above */
-	else
-		unreachable();
-}
-#endif
-
 /*
  * Find a zonelist from a preferred node. Here is a truth table example using 2
  * different masks. The choices are, NULL mask, empty mask, two masks with an
@@ -4945,6 +4993,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 				 &alloc_mask, &alloc_flags))
 		return NULL;
 
+	ac.prefmask = &prefmask;
+
 	finalise_ac(gfp_mask, &ac);
 
 	/*
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 05/18] mm/mempolicy: convert single preferred_node to full nodemask
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
                   ` (3 preceding siblings ...)
  2020-06-19 16:24 ` [PATCH 04/18] mm/page_alloc: add preferred pass to page allocation Ben Widawsky
@ 2020-06-19 16:24 ` Ben Widawsky
  2020-06-19 16:24 ` [PATCH 06/18] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes Ben Widawsky
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:24 UTC (permalink / raw)
  To: linux-mm; +Cc: Dave Hansen, Andrew Morton, Ben Widawsky

From: Dave Hansen <dave.hansen@linux.intel.com>

The NUMA APIs currently allow passing in a "preferred node" as a
single bit set in a nodemask.  If more than one bit it set, bits
after the first are ignored.  Internally, this is implemented as
a single integer: mempolicy->preferred_node.

This single node is generally OK for location-based NUMA where
memory being allocated will eventually be operated on by a single
CPU.  However, in systems with multiple memory types, folks want
to target a *type* of memory instead of a location.  For instance,
someone might want some high-bandwidth memory but do not care about
the CPU next to which it is allocated.  Or, they want a cheap,
high capacity allocation and want to target all NUMA nodes which
have persistent memory in volatile mode.  In both of these cases,
the application wants to target a *set* of nodes, but does not
want strict MPOL_BIND behavior.

To get that behavior, a MPOL_PREFERRED mode is desirable, but one
that honors multiple nodes to be set in the nodemask.

The first step in that direction is to be able to internally store
multiple preferred nodes, which is implemented in this patch.

This should not have any function changes and just switches the
internal representation of mempolicy->preferred_node from an
integer to a nodemask called 'mempolicy->preferred_nodes'.

This is not a pie-in-the-sky dream for an API.  This was a response to a
specific ask of more than one group at Intel.  Specifically:

1. There are existing libraries that target memory types such as
   https://github.com/memkind/memkind.  These are known to suffer
   from SIGSEGV's when memory is low on targeted memory "kinds" that
   span more than one node.  The MCDRAM on a Xeon Phi in "Cluster on
   Die" mode is an example of this.
2. Volatile-use persistent memory users want to have a memory policy
   which is targeted at either "cheap and slow" (PMEM) or "expensive and
   fast" (DRAM).  However, they do not want to experience allocation
   failures when the targeted type is unavailable.
3. Allocate-then-run.  Generally, we let the process scheduler decide
   on which physical CPU to run a task.  That location provides a
   default allocation policy, and memory availability is not generally
   considered when placing tasks.  For situations where memory is
   valuable and constrained, some users want to allocate memory first,
   *then* allocate close compute resources to the allocation.  This is
   the reverse of the normal (CPU) model.  Accelerators such as GPUs
   that operate on core-mm-managed memory are interested in this model.

v2:
Fix spelling errors in commit message. (Ben)
clang-format. (Ben)
Integrated bit from another patch. (Ben)
Update the docs to reflect the internal data structure change (Ben)
Don't advertise MPOL_PREFERRED_MANY in UAPI until we can handle it (Ben)
Added more to the commit message (Dave)

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> (v2)
Co-developed-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     |  6 +--
 include/linux/mempolicy.h                     |  4 +-
 mm/mempolicy.c                                | 40 ++++++++++---------
 3 files changed, 27 insertions(+), 23 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 067a90a1499c..1ad020c459b8 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -205,9 +205,9 @@ MPOL_PREFERRED
 	of increasing distance from the preferred node based on
 	information provided by the platform firmware.
 
-	Internally, the Preferred policy uses a single node--the
-	preferred_node member of struct mempolicy.  When the internal
-	mode flag MPOL_F_LOCAL is set, the preferred_node is ignored
+	Internally, the Preferred policy uses a nodemask--the
+	preferred_nodes member of struct mempolicy.  When the internal
+	mode flag MPOL_F_LOCAL is set, the preferred_nodes are ignored
 	and the policy is interpreted as local allocation.  "Local"
 	allocation policy can be viewed as a Preferred policy that
 	starts at the node containing the cpu where the allocation
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index ea9c15b60a96..c66ea9f4c61e 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -47,8 +47,8 @@ struct mempolicy {
 	unsigned short mode; 	/* See MPOL_* above */
 	unsigned short flags;	/* See set_mempolicy() MPOL_F_* above */
 	union {
-		short 		 preferred_node; /* preferred */
-		nodemask_t	 nodes;		/* interleave/bind */
+		nodemask_t preferred_nodes; /* preferred */
+		nodemask_t nodes; /* interleave/bind */
 		/* undefined for default */
 	} v;
 	union {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 99e0f3f9c4a6..e0b576838e57 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -205,7 +205,7 @@ static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
 	else if (nodes_empty(*nodes))
 		return -EINVAL;			/*  no allowed nodes */
 	else
-		pol->v.preferred_node = first_node(*nodes);
+		pol->v.preferred_nodes = nodemask_of_node(first_node(*nodes));
 	return 0;
 }
 
@@ -345,22 +345,26 @@ static void mpol_rebind_preferred(struct mempolicy *pol,
 						const nodemask_t *nodes)
 {
 	nodemask_t tmp;
+	nodemask_t preferred_node;
+
+	/* MPOL_PREFERRED uses only the first node in the mask */
+	preferred_node = nodemask_of_node(first_node(*nodes));
 
 	if (pol->flags & MPOL_F_STATIC_NODES) {
 		int node = first_node(pol->w.user_nodemask);
 
 		if (node_isset(node, *nodes)) {
-			pol->v.preferred_node = node;
+			pol->v.preferred_nodes = nodemask_of_node(node);
 			pol->flags &= ~MPOL_F_LOCAL;
 		} else
 			pol->flags |= MPOL_F_LOCAL;
 	} else if (pol->flags & MPOL_F_RELATIVE_NODES) {
 		mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
-		pol->v.preferred_node = first_node(tmp);
+		pol->v.preferred_nodes = tmp;
 	} else if (!(pol->flags & MPOL_F_LOCAL)) {
-		pol->v.preferred_node = node_remap(pol->v.preferred_node,
-						   pol->w.cpuset_mems_allowed,
-						   *nodes);
+		nodes_remap(tmp, pol->v.preferred_nodes,
+			    pol->w.cpuset_mems_allowed, preferred_node);
+		pol->v.preferred_nodes = tmp;
 		pol->w.cpuset_mems_allowed = *nodes;
 	}
 }
@@ -913,7 +917,7 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
 		break;
 	case MPOL_PREFERRED:
 		if (!(p->flags & MPOL_F_LOCAL))
-			node_set(p->v.preferred_node, *nodes);
+			*nodes = p->v.preferred_nodes;
 		/* else return empty node mask for local allocation */
 		break;
 	default:
@@ -1906,9 +1910,9 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 static int policy_node(gfp_t gfp, struct mempolicy *policy,
 								int nd)
 {
-	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
-		nd = policy->v.preferred_node;
-	else {
+	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL)) {
+		nd = first_node(policy->v.preferred_nodes);
+	} else {
 		/*
 		 * __GFP_THISNODE shouldn't even be used with the bind policy
 		 * because we might easily break the expectation to stay on the
@@ -1953,7 +1957,7 @@ unsigned int mempolicy_slab_node(void)
 		/*
 		 * handled MPOL_F_LOCAL above
 		 */
-		return policy->v.preferred_node;
+		return first_node(policy->v.preferred_nodes);
 
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(policy);
@@ -2087,7 +2091,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 		if (mempolicy->flags & MPOL_F_LOCAL)
 			nid = numa_node_id();
 		else
-			nid = mempolicy->v.preferred_node;
+			nid = first_node(mempolicy->v.preferred_nodes);
 		init_nodemask_of_node(mask, nid);
 		break;
 
@@ -2225,7 +2229,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		 * node in its nodemask, we allocate the standard way.
 		 */
 		if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL))
-			hpage_node = pol->v.preferred_node;
+			hpage_node = first_node(pol->v.preferred_nodes);
 
 		nmask = policy_nodemask(gfp, pol);
 		if (!nmask || node_isset(hpage_node, *nmask)) {
@@ -2364,7 +2368,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 		/* a's ->flags is the same as b's */
 		if (a->flags & MPOL_F_LOCAL)
 			return true;
-		return a->v.preferred_node == b->v.preferred_node;
+		return nodes_equal(a->v.preferred_nodes, b->v.preferred_nodes);
 	default:
 		BUG();
 		return false;
@@ -2508,7 +2512,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		if (pol->flags & MPOL_F_LOCAL)
 			polnid = numa_node_id();
 		else
-			polnid = pol->v.preferred_node;
+			polnid = first_node(pol->v.preferred_nodes);
 		break;
 
 	case MPOL_BIND:
@@ -2825,7 +2829,7 @@ void __init numa_policy_init(void)
 			.refcnt = ATOMIC_INIT(1),
 			.mode = MPOL_PREFERRED,
 			.flags = MPOL_F_MOF | MPOL_F_MORON,
-			.v = { .preferred_node = nid, },
+			.v = { .preferred_nodes = nodemask_of_node(nid), },
 		};
 	}
 
@@ -2991,7 +2995,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 	if (mode != MPOL_PREFERRED)
 		new->v.nodes = nodes;
 	else if (nodelist)
-		new->v.preferred_node = first_node(nodes);
+		new->v.preferred_nodes = nodemask_of_node(first_node(nodes));
 	else
 		new->flags |= MPOL_F_LOCAL;
 
@@ -3044,7 +3048,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 		if (flags & MPOL_F_LOCAL)
 			mode = MPOL_LOCAL;
 		else
-			node_set(pol->v.preferred_node, nodes);
+			nodes_or(nodes, nodes, pol->v.preferred_nodes);
 		break;
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 06/18] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
                   ` (4 preceding siblings ...)
  2020-06-19 16:24 ` [PATCH 05/18] mm/mempolicy: convert single preferred_node to full nodemask Ben Widawsky
@ 2020-06-19 16:24 ` Ben Widawsky
  2020-06-19 16:24 ` [PATCH 07/18] mm/mempolicy: allow preferred code to take a nodemask Ben Widawsky
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:24 UTC (permalink / raw)
  To: linux-mm; +Cc: Dave Hansen, Andrew Morton, Ben Widawsky

From: Dave Hansen <dave.hansen@linux.intel.com>

MPOL_PREFERRED honors only a single node set in the nodemask.  Add the
bare define for a new mode which will allow more than one.

The patch does all the plumbing without actually adding the new policy
type.

v2:
Plumb most MPOL_PREFERRED_MANY without exposing UAPI (Ben)
Fixes for checkpatch (Ben)

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Co-developed-by: Ben Widawsky <ben.widawsky@intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 mm/mempolicy.c | 42 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e0b576838e57..6c7301cefeb6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -31,6 +31,9 @@
  *                but useful to set in a VMA when you have a non default
  *                process policy.
  *
+ * preferred many Try a set of nodes first before normal fallback. This is
+ *                similar to preferred without the special case.
+ *
  * default        Allocate on the local node first, or when on a VMA
  *                use the process policy. This is what Linux always did
  *		  in a NUMA aware kernel and still does by, ahem, default.
@@ -105,6 +108,8 @@
 
 #include "internal.h"
 
+#define MPOL_PREFERRED_MANY MPOL_MAX
+
 /* Internal flags */
 #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0)	/* Skip checks for continuous vmas */
 #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1)		/* Invert check for nodemask */
@@ -175,7 +180,7 @@ struct mempolicy *get_task_policy(struct task_struct *p)
 static const struct mempolicy_operations {
 	int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
 	void (*rebind)(struct mempolicy *pol, const nodemask_t *nodes);
-} mpol_ops[MPOL_MAX];
+} mpol_ops[MPOL_MAX + 1];
 
 static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
 {
@@ -415,7 +420,7 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
 	mmap_write_unlock(mm);
 }
 
-static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
+static const struct mempolicy_operations mpol_ops[MPOL_MAX + 1] = {
 	[MPOL_DEFAULT] = {
 		.rebind = mpol_rebind_default,
 	},
@@ -432,6 +437,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 		.rebind = mpol_rebind_nodemask,
 	},
 	/* MPOL_LOCAL is converted to MPOL_PREFERRED on policy creation */
+	[MPOL_PREFERRED_MANY] = {
+		.create = NULL,
+		.rebind = NULL,
+	},
 };
 
 static int migrate_page_add(struct page *page, struct list_head *pagelist,
@@ -915,6 +924,9 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
 	case MPOL_INTERLEAVE:
 		*nodes = p->v.nodes;
 		break;
+	case MPOL_PREFERRED_MANY:
+		*nodes = p->v.preferred_nodes;
+		break;
 	case MPOL_PREFERRED:
 		if (!(p->flags & MPOL_F_LOCAL))
 			*nodes = p->v.preferred_nodes;
@@ -1910,7 +1922,9 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 static int policy_node(gfp_t gfp, struct mempolicy *policy,
 								int nd)
 {
-	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL)) {
+	if ((policy->mode == MPOL_PREFERRED ||
+	     policy->mode == MPOL_PREFERRED_MANY) &&
+	    !(policy->flags & MPOL_F_LOCAL)) {
 		nd = first_node(policy->v.preferred_nodes);
 	} else {
 		/*
@@ -1953,6 +1967,7 @@ unsigned int mempolicy_slab_node(void)
 		return node;
 
 	switch (policy->mode) {
+	case MPOL_PREFERRED_MANY:
 	case MPOL_PREFERRED:
 		/*
 		 * handled MPOL_F_LOCAL above
@@ -2087,6 +2102,9 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 	task_lock(current);
 	mempolicy = current->mempolicy;
 	switch (mempolicy->mode) {
+	case MPOL_PREFERRED_MANY:
+		*mask = mempolicy->v.preferred_nodes;
+		break;
 	case MPOL_PREFERRED:
 		if (mempolicy->flags & MPOL_F_LOCAL)
 			nid = numa_node_id();
@@ -2141,6 +2159,9 @@ bool mempolicy_nodemask_intersects(struct task_struct *tsk,
 		 * nodes in mask.
 		 */
 		break;
+	case MPOL_PREFERRED_MANY:
+		ret = nodes_intersects(mempolicy->v.preferred_nodes, *mask);
+		break;
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 		ret = nodes_intersects(mempolicy->v.nodes, *mask);
@@ -2225,8 +2246,9 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		 * node and don't fall back to other nodes, as the cost of
 		 * remote accesses would likely offset THP benefits.
 		 *
-		 * If the policy is interleave, or does not allow the current
-		 * node in its nodemask, we allocate the standard way.
+		 * If the policy is interleave or multiple preferred nodes, or
+		 * does not allow the current node in its nodemask, we allocate
+		 * the standard way.
 		 */
 		if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL))
 			hpage_node = first_node(pol->v.preferred_nodes);
@@ -2364,6 +2386,9 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
 		return !!nodes_equal(a->v.nodes, b->v.nodes);
+	case MPOL_PREFERRED_MANY:
+		return !!nodes_equal(a->v.preferred_nodes,
+				     b->v.preferred_nodes);
 	case MPOL_PREFERRED:
 		/* a's ->flags is the same as b's */
 		if (a->flags & MPOL_F_LOCAL)
@@ -2532,6 +2557,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		polnid = zone_to_nid(z->zone);
 		break;
 
+	/* case MPOL_PREFERRED_MANY: */
+
 	default:
 		BUG();
 	}
@@ -2883,6 +2910,7 @@ static const char * const policy_modes[] =
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
 	[MPOL_LOCAL]      = "local",
+	[MPOL_PREFERRED_MANY]  = "prefer (many)",
 };
 
 
@@ -2962,6 +2990,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 		if (!nodelist)
 			err = 0;
 		goto out;
+	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 		/*
 		 * Insist on a nodelist
@@ -3044,6 +3073,9 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 	switch (mode) {
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_PREFERRED_MANY:
+		WARN_ON(flags & MPOL_F_LOCAL);
+		fallthrough;
 	case MPOL_PREFERRED:
 		if (flags & MPOL_F_LOCAL)
 			mode = MPOL_LOCAL;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 07/18] mm/mempolicy: allow preferred code to take a nodemask
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
                   ` (5 preceding siblings ...)
  2020-06-19 16:24 ` [PATCH 06/18] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes Ben Widawsky
@ 2020-06-19 16:24 ` Ben Widawsky
  2020-06-19 16:24 ` [PATCH 08/18] mm/mempolicy: refactor rebind code for PREFERRED_MANY Ben Widawsky
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:24 UTC (permalink / raw)
  To: linux-mm; +Cc: Dave Hansen, Andrew Morton, Ben Widawsky

From: Dave Hansen <dave.hansen@linux.intel.com>

Create a helper function (mpol_new_preferred_many()) which is usable
both by the old, single-node MPOL_PREFERRED and the new
MPOL_PREFERRED_MANY.

Enforce the old single-node MPOL_PREFERRED behavior in the "new"
version of mpol_new_preferred() which calls mpol_new_preferred_many().

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 mm/mempolicy.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 6c7301cefeb6..541675a5b947 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -203,17 +203,30 @@ static int mpol_new_interleave(struct mempolicy *pol, const nodemask_t *nodes)
 	return 0;
 }
 
-static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
+static int mpol_new_preferred_many(struct mempolicy *pol,
+				   const nodemask_t *nodes)
 {
 	if (!nodes)
 		pol->flags |= MPOL_F_LOCAL;	/* local allocation */
 	else if (nodes_empty(*nodes))
 		return -EINVAL;			/*  no allowed nodes */
 	else
-		pol->v.preferred_nodes = nodemask_of_node(first_node(*nodes));
+		pol->v.preferred_nodes = *nodes;
 	return 0;
 }
 
+static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
+{
+	if (nodes) {
+		/* MPOL_PREFERRED can only take a single node: */
+		nodemask_t tmp = nodemask_of_node(first_node(*nodes));
+
+		return mpol_new_preferred_many(pol, &tmp);
+	}
+
+	return mpol_new_preferred_many(pol, NULL);
+}
+
 static int mpol_new_bind(struct mempolicy *pol, const nodemask_t *nodes)
 {
 	if (nodes_empty(*nodes))
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 08/18] mm/mempolicy: refactor rebind code for PREFERRED_MANY
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
                   ` (6 preceding siblings ...)
  2020-06-19 16:24 ` [PATCH 07/18] mm/mempolicy: allow preferred code to take a nodemask Ben Widawsky
@ 2020-06-19 16:24 ` Ben Widawsky
  2020-06-19 16:24 ` [PATCH 09/18] mm: Finish handling MPOL_PREFERRED_MANY Ben Widawsky
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:24 UTC (permalink / raw)
  To: linux-mm; +Cc: Dave Hansen, Ben Widawsky

From: Dave Hansen <dave.hansen@linux.intel.com>

Again, this extracts the "only one node must be set" behavior of
MPOL_PREFERRED.  It retains virtually all of the existing code so it can
be used by MPOL_PREFERRED_MANY as well.

v2:
Fixed typos in commit message. (Ben)
Merged bits from other patches. (Ben)
annotate mpol_rebind_preferred_many as unused (Ben)

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 mm/mempolicy.c | 29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 541675a5b947..bfc4ef2af90d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -359,14 +359,11 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
 	pol->v.nodes = tmp;
 }
 
-static void mpol_rebind_preferred(struct mempolicy *pol,
-						const nodemask_t *nodes)
+static void mpol_rebind_preferred_common(struct mempolicy *pol,
+					 const nodemask_t *preferred_nodes,
+					 const nodemask_t *nodes)
 {
 	nodemask_t tmp;
-	nodemask_t preferred_node;
-
-	/* MPOL_PREFERRED uses only the first node in the mask */
-	preferred_node = nodemask_of_node(first_node(*nodes));
 
 	if (pol->flags & MPOL_F_STATIC_NODES) {
 		int node = first_node(pol->w.user_nodemask);
@@ -381,12 +378,30 @@ static void mpol_rebind_preferred(struct mempolicy *pol,
 		pol->v.preferred_nodes = tmp;
 	} else if (!(pol->flags & MPOL_F_LOCAL)) {
 		nodes_remap(tmp, pol->v.preferred_nodes,
-			    pol->w.cpuset_mems_allowed, preferred_node);
+			    pol->w.cpuset_mems_allowed, *preferred_nodes);
 		pol->v.preferred_nodes = tmp;
 		pol->w.cpuset_mems_allowed = *nodes;
 	}
 }
 
+/* MPOL_PREFERRED_MANY allows multiple nodes to be set in 'nodes' */
+static void __maybe_unused mpol_rebind_preferred_many(struct mempolicy *pol,
+						      const nodemask_t *nodes)
+{
+	mpol_rebind_preferred_common(pol, nodes, nodes);
+}
+
+static void mpol_rebind_preferred(struct mempolicy *pol,
+				  const nodemask_t *nodes)
+{
+	nodemask_t preferred_node;
+
+	/* MPOL_PREFERRED uses only the first node in 'nodes' */
+	preferred_node = nodemask_of_node(first_node(*nodes));
+
+	mpol_rebind_preferred_common(pol, &preferred_node, nodes);
+}
+
 /*
  * mpol_rebind_policy - Migrate a policy to a different set of nodes
  *
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 09/18] mm: Finish handling MPOL_PREFERRED_MANY
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
                   ` (7 preceding siblings ...)
  2020-06-19 16:24 ` [PATCH 08/18] mm/mempolicy: refactor rebind code for PREFERRED_MANY Ben Widawsky
@ 2020-06-19 16:24 ` Ben Widawsky
  2020-06-19 16:24 ` [PATCH 10/18] mm: clean up alloc_pages_vma (thp) Ben Widawsky
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Ben Widawsky, Andrew Morton, Dan Williams, Dave Hansen, Mel Gorman

Now that there is a function to generate the preferred zonelist given a
preferred mask, bindmask, and flags it is possible to support
MPOL_PREFERRED_MANY policy easily in more places.

This patch was developed on top of Dave's original work. When Dave wrote
his patches there was no clean way to implement MPOL_PREFERRED_MANY. Now
that the other bits are in place, this is easy to drop on top.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 include/linux/mmzone.h |  3 +++
 mm/mempolicy.c         | 20 ++++++++++++++++++--
 mm/page_alloc.c        |  5 ++---
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c4c37fd12104..6b62ee98bb96 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1001,6 +1001,9 @@ struct zoneref *__next_zones_zonelist(struct zoneref *z,
 					enum zone_type highest_zoneidx,
 					nodemask_t *nodes);
 
+struct zonelist *preferred_zonelist(gfp_t gfp_mask, const nodemask_t *prefmask,
+				    const nodemask_t *bindmask);
+
 /**
  * next_zones_zonelist - Returns the next zone at or below highest_zoneidx within the allowed nodemask using a cursor within a zonelist as a starting point
  * @z - The cursor used as a starting point for the search
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index bfc4ef2af90d..90bc9c93b1b9 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1995,7 +1995,6 @@ unsigned int mempolicy_slab_node(void)
 		return node;
 
 	switch (policy->mode) {
-	case MPOL_PREFERRED_MANY:
 	case MPOL_PREFERRED:
 		/*
 		 * handled MPOL_F_LOCAL above
@@ -2020,6 +2019,18 @@ unsigned int mempolicy_slab_node(void)
 		return z->zone ? zone_to_nid(z->zone) : node;
 	}
 
+	case MPOL_PREFERRED_MANY: {
+		struct zoneref *z;
+		struct zonelist *zonelist;
+		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
+
+		zonelist = preferred_zonelist(GFP_KERNEL,
+					      &policy->v.preferred_nodes, NULL);
+		z = first_zones_zonelist(zonelist, highest_zoneidx,
+					 &policy->v.nodes);
+		return z->zone ? zone_to_nid(z->zone) : node;
+	}
+
 	default:
 		BUG();
 	}
@@ -2585,7 +2596,12 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		polnid = zone_to_nid(z->zone);
 		break;
 
-	/* case MPOL_PREFERRED_MANY: */
+	case MPOL_PREFERRED_MANY:
+		z = first_zones_zonelist(preferred_zonelist(GFP_HIGHUSER,
+							    &pol->v.preferred_nodes, NULL),
+					 gfp_zone(GFP_HIGHUSER), &pol->v.preferred_nodes);
+		polnid = zone_to_nid(z->zone);
+		break;
 
 	default:
 		BUG();
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3cf44b6c31ae..c6f8f112a5d4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4861,9 +4861,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
  * NB: That zonelist will have *all* zones in the fallback case, and not all of
  * those zones will belong to preferred nodes.
  */
-static struct zonelist *preferred_zonelist(gfp_t gfp_mask,
-					   const nodemask_t *prefmask,
-					   const nodemask_t *bindmask)
+struct zonelist *preferred_zonelist(gfp_t gfp_mask, const nodemask_t *prefmask,
+				    const nodemask_t *bindmask)
 {
 	nodemask_t pref;
 	int nid, local_node = numa_mem_id();
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 10/18] mm: clean up alloc_pages_vma (thp)
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
                   ` (8 preceding siblings ...)
  2020-06-19 16:24 ` [PATCH 09/18] mm: Finish handling MPOL_PREFERRED_MANY Ben Widawsky
@ 2020-06-19 16:24 ` Ben Widawsky
  2020-06-19 16:24 ` [PATCH 11/18] mm: Extract THP hugepage allocation Ben Widawsky
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:24 UTC (permalink / raw)
  To: linux-mm; +Cc: Ben Widawsky, Andrew Morton, David Rientjes, Dave Hansen

__alloc_pages_nodemask() already does the right thing for a preferred
node and bind nodemask. Calling it directly allows us to simplify much
of this. The handling occurs in prepare_alloc_pages()

A VM assertion is added to prove correctness.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 mm/mempolicy.c | 40 +++++++++++++++++++++-------------------
 1 file changed, 21 insertions(+), 19 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 90bc9c93b1b9..408ba78c8424 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2293,27 +2293,29 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 			hpage_node = first_node(pol->v.preferred_nodes);
 
 		nmask = policy_nodemask(gfp, pol);
-		if (!nmask || node_isset(hpage_node, *nmask)) {
-			mpol_cond_put(pol);
-			/*
-			 * First, try to allocate THP only on local node, but
-			 * don't reclaim unnecessarily, just compact.
-			 */
-			page = __alloc_pages_node(hpage_node,
-				gfp | __GFP_THISNODE | __GFP_NORETRY, order);
+		mpol_cond_put(pol);
 
-			/*
-			 * If hugepage allocations are configured to always
-			 * synchronous compact or the vma has been madvised
-			 * to prefer hugepage backing, retry allowing remote
-			 * memory with both reclaim and compact as well.
-			 */
-			if (!page && (gfp & __GFP_DIRECT_RECLAIM))
-				page = __alloc_pages_node(hpage_node,
-								gfp, order);
+		/*
+		 * First, try to allocate THP only on local node, but
+		 * don't reclaim unnecessarily, just compact.
+		 */
+		page = __alloc_pages_nodemask(gfp | __GFP_THISNODE |
+						      __GFP_NORETRY,
+					      order, hpage_node, nmask);
 
-			goto out;
-		}
+		/*
+		 * If hugepage allocations are configured to always synchronous
+		 * compact or the vma has been madvised to prefer hugepage
+		 * backing, retry allowing remote memory with both reclaim and
+		 * compact as well.
+		 */
+		if (!page && (gfp & __GFP_DIRECT_RECLAIM))
+			page = __alloc_pages_nodemask(gfp, order, hpage_node,
+						      nmask);
+
+		VM_BUG_ON(page && nmask &&
+			  !node_isset(page_to_nid(page), *nmask));
+		goto out;
 	}
 
 	nmask = policy_nodemask(gfp, pol);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 11/18] mm: Extract THP hugepage allocation
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
                   ` (9 preceding siblings ...)
  2020-06-19 16:24 ` [PATCH 10/18] mm: clean up alloc_pages_vma (thp) Ben Widawsky
@ 2020-06-19 16:24 ` Ben Widawsky
  2020-06-19 16:24 ` [PATCH 12/18] mm/mempolicy: Use __alloc_page_node for interleaved Ben Widawsky
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:24 UTC (permalink / raw)
  To: linux-mm; +Cc: Ben Widawsky, Andrew Morton, Dave Hansen, Michal Hocko

The next patch is going to rework this code to support
MPOL_PREFERRED_MANY. This refactor makes the that change much more
readable.

After the extraction, the resulting code makes it apparent that this can
be converted to a simple if ladder and thus allows removing the goto.

There is not meant to be any functional or behavioral changes.

Note that still at this point MPOL_PREFERRED_MANY isn't specially
handled for huge pages.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 mm/mempolicy.c | 96 ++++++++++++++++++++++++++------------------------
 1 file changed, 49 insertions(+), 47 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 408ba78c8424..3ce2354fed44 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2232,6 +2232,48 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
 	return page;
 }
 
+static struct page *alloc_pages_vma_thp(gfp_t gfp, struct mempolicy *pol,
+					int order, int node)
+{
+	nodemask_t *nmask;
+	struct page *page;
+	int hpage_node = node;
+
+	/*
+	 * For hugepage allocation and non-interleave policy which allows the
+	 * current node (or other explicitly preferred node) we only try to
+	 * allocate from the current/preferred node and don't fall back to other
+	 * nodes, as the cost of remote accesses would likely offset THP
+	 * benefits.
+	 *
+	 * If the policy is interleave or multiple preferred nodes, or does not
+	 * allow the current node in its nodemask, we allocate the standard way.
+	 */
+	if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL))
+		hpage_node = first_node(pol->v.preferred_nodes);
+
+	nmask = policy_nodemask(gfp, pol);
+
+	/*
+	 * First, try to allocate THP only on local node, but don't reclaim
+	 * unnecessarily, just compact.
+	 */
+	page = __alloc_pages_nodemask(gfp | __GFP_THISNODE | __GFP_NORETRY,
+				      order, hpage_node, nmask);
+
+	/*
+	 * If hugepage allocations are configured to always synchronous compact
+	 * or the vma has been madvised to prefer hugepage backing, retry
+	 * allowing remote memory with both reclaim and compact as well.
+	 */
+	if (!page && (gfp & __GFP_DIRECT_RECLAIM))
+		page = __alloc_pages_nodemask(gfp, order, hpage_node, nmask);
+
+	VM_BUG_ON(page && nmask && !node_isset(page_to_nid(page), *nmask));
+
+	return page;
+}
+
 /**
  * 	alloc_pages_vma	- Allocate a page for a VMA.
  *
@@ -2272,57 +2314,17 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
 		mpol_cond_put(pol);
 		page = alloc_page_interleave(gfp, order, nid);
-		goto out;
-	}
-
-	if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
-		int hpage_node = node;
-
-		/*
-		 * For hugepage allocation and non-interleave policy which
-		 * allows the current node (or other explicitly preferred
-		 * node) we only try to allocate from the current/preferred
-		 * node and don't fall back to other nodes, as the cost of
-		 * remote accesses would likely offset THP benefits.
-		 *
-		 * If the policy is interleave or multiple preferred nodes, or
-		 * does not allow the current node in its nodemask, we allocate
-		 * the standard way.
-		 */
-		if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL))
-			hpage_node = first_node(pol->v.preferred_nodes);
-
+	} else if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+			    hugepage)) {
+		page = alloc_pages_vma_thp(gfp, pol, order, node);
+		mpol_cond_put(pol);
+	} else {
 		nmask = policy_nodemask(gfp, pol);
+		preferred_nid = policy_node(gfp, pol, node);
+		page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
 		mpol_cond_put(pol);
-
-		/*
-		 * First, try to allocate THP only on local node, but
-		 * don't reclaim unnecessarily, just compact.
-		 */
-		page = __alloc_pages_nodemask(gfp | __GFP_THISNODE |
-						      __GFP_NORETRY,
-					      order, hpage_node, nmask);
-
-		/*
-		 * If hugepage allocations are configured to always synchronous
-		 * compact or the vma has been madvised to prefer hugepage
-		 * backing, retry allowing remote memory with both reclaim and
-		 * compact as well.
-		 */
-		if (!page && (gfp & __GFP_DIRECT_RECLAIM))
-			page = __alloc_pages_nodemask(gfp, order, hpage_node,
-						      nmask);
-
-		VM_BUG_ON(page && nmask &&
-			  !node_isset(page_to_nid(page), *nmask));
-		goto out;
 	}
 
-	nmask = policy_nodemask(gfp, pol);
-	preferred_nid = policy_node(gfp, pol, node);
-	page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
-	mpol_cond_put(pol);
-out:
 	return page;
 }
 EXPORT_SYMBOL(alloc_pages_vma);
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 12/18] mm/mempolicy: Use __alloc_page_node for interleaved
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
                   ` (10 preceding siblings ...)
  2020-06-19 16:24 ` [PATCH 11/18] mm: Extract THP hugepage allocation Ben Widawsky
@ 2020-06-19 16:24 ` Ben Widawsky
  2020-06-19 16:24 ` [PATCH 13/18] mm: kill __alloc_pages Ben Widawsky
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:24 UTC (permalink / raw)
  To: linux-mm; +Cc: Ben Widawsky, Andrew Morton, Vlastimil Babka

This helps reduce the consumers of the interface and get us in better
shape to clean up some of the low level page allocation routines. The
goal in doing that is to eventually limit the places we'll need to
declare nodemask_t variables on the stack (more on that later).

Currently the only distinction between __alloc_pages_node and
__alloc_pages is that the former does sanity checks on the gfp flags and
the nid. In the case of interleave nodes, this isn't necessary because
the caller has already figured out the right nid and flags with
interleave_nodes(),

This kills the only real user of __alloc_pages, which can then be
removed later.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 mm/mempolicy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3ce2354fed44..eb2520d68a04 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2220,7 +2220,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
 {
 	struct page *page;
 
-	page = __alloc_pages(gfp, order, nid);
+	page = __alloc_pages_node(nid, gfp, order);
 	/* skip NUMA_INTERLEAVE_HIT counter update if numa stats is disabled */
 	if (!static_branch_likely(&vm_numa_stat_key))
 		return page;
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 13/18] mm: kill __alloc_pages
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
                   ` (11 preceding siblings ...)
  2020-06-19 16:24 ` [PATCH 12/18] mm/mempolicy: Use __alloc_page_node for interleaved Ben Widawsky
@ 2020-06-19 16:24 ` Ben Widawsky
  2020-06-19 16:24 ` [PATCH 14/18] mm/mempolicy: Introduce policy_preferred_nodes() Ben Widawsky
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:24 UTC (permalink / raw)
  To: linux-mm; +Cc: Ben Widawsky, Andrew Morton, Michal Hocko

IMPORTANT NOTE: It's unclear how safe it is to declare nodemask_t on the
stack, when nodemask_t can be relatively large in huge NUMA systems.
Upcoming patches will try to limit this.

The primary purpose of this patch is to clear up which interfaces should
be used for page allocation.

There are several attributes in page allocation after the obvious gfp
and order:
1. node mask: set of nodes to try to allocate from, fail if unavailable
2. preferred nid: a preferred node to try to allocate from, falling back
to node mask if unavailable
3. (soon) preferred mask: like preferred nid, but multiple nodes.

Here's a summary of the existing interfaces, and which they cover
*alloc_pages: 		()
*alloc_pages_node:	(2)
__alloc_pages_nodemask: (1,2,3)

I am instead proposing instead the following interfaces as a reasonable
set. Generally node binding isn't used by kernel code, it's only used
for mempolicy. On the other hand, the kernel does have preferred nodes
(today it's only one), and that is why those interfaces exist while an
interface to specify binding does not.

alloc_pages: () I don't care, give me pages.
alloc_pages_node: (2) I want pages from this particular node first
alloc_pages_nodes: (3) I want pages from *these* nodes first
__alloc_pages_nodemask: (1,2,3) I'm picky about my pages

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 include/linux/gfp.h | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 67a0774e080b..9ab5c07579bd 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -504,9 +504,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 							nodemask_t *nodemask);
 
 static inline struct page *
-__alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
+__alloc_pages_nodes(nodemask_t *nodes, gfp_t gfp_mask, unsigned int order)
 {
-	return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
+	return __alloc_pages_nodemask(gfp_mask, order, first_node(*nodes),
+				      NULL);
 }
 
 /*
@@ -516,10 +517,12 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
 static inline struct page *
 __alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order)
 {
+	nodemask_t tmp;
 	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
 	VM_WARN_ON((gfp_mask & __GFP_THISNODE) && !node_online(nid));
 
-	return __alloc_pages(gfp_mask, order, nid);
+	tmp = nodemask_of_node(nid);
+	return __alloc_pages_nodes(&tmp, gfp_mask, order);
 }
 
 /*
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 14/18] mm/mempolicy: Introduce policy_preferred_nodes()
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
                   ` (12 preceding siblings ...)
  2020-06-19 16:24 ` [PATCH 13/18] mm: kill __alloc_pages Ben Widawsky
@ 2020-06-19 16:24 ` Ben Widawsky
  2020-06-19 16:24 ` [PATCH 15/18] mm: convert callers of __alloc_pages_nodemask to pmask Ben Widawsky
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Ben Widawsky, Andrew Morton, Dave Hansen, Li Xinhai,
	Michal Hocko, Vlastimil Babka

Current code provides a policy_node() helper which given a preferred
node, flags, and policy will help determine the preferred node. Going
forward it is desirable to have this same functionality given a set of
nodes, rather than a single node. policy_node is then implemented in
terms of the now more generic policy_preferred_nodes.

I went back and forth as to whether this function should take in a set
of preferred nodes and modify that. Something like:
policy_preferred_nodes(gfp, *policy, *mask);

That idea was nice as it allowed the policy function to create the mask
to be used. Ultimately, it turns out callers don't need such fanciness,
and those callers would use this mask directly in page allocation
functions that can accept NULL for a preference mask. So having this
function return NULL when there is no ideal mask turns out to be
beneficial.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Li Xinhai <lixinhai.lxh@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 mm/mempolicy.c | 57 +++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 47 insertions(+), 10 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index eb2520d68a04..3c48f299d344 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1946,24 +1946,61 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
 	return NULL;
 }
 
-/* Return the node id preferred by the given mempolicy, or the given id */
-static int policy_node(gfp_t gfp, struct mempolicy *policy,
-								int nd)
+/*
+ * Returns a nodemask to be used for preference if the given policy dictates.
+ * Otherwise, returns NULL and the caller should likely use
+ * nodemask_of_node(numa_mem_id());
+ */
+static nodemask_t *policy_preferred_nodes(gfp_t gfp, struct mempolicy *policy)
 {
-	if ((policy->mode == MPOL_PREFERRED ||
-	     policy->mode == MPOL_PREFERRED_MANY) &&
-	    !(policy->flags & MPOL_F_LOCAL)) {
-		nd = first_node(policy->v.preferred_nodes);
-	} else {
+	nodemask_t *pol_pref = &policy->v.preferred_nodes;
+
+	/*
+	 * There are 2 "levels" of policy. What the callers asked for
+	 * (prefmask), and what the memory policy should be for the given gfp.
+	 * The memory policy takes preference in the case that prefmask isn't a
+	 * subset of the mem policy.
+	 */
+	switch (policy->mode) {
+	case MPOL_PREFERRED:
+		/* local, or buggy policy */
+		if (policy->flags & MPOL_F_LOCAL ||
+		    WARN_ON(nodes_weight(*pol_pref) != 1))
+			return NULL;
+		else
+			return pol_pref;
+		break;
+	case MPOL_PREFERRED_MANY:
+		if (WARN_ON(nodes_weight(*pol_pref) == 0))
+			return NULL;
+		else
+			return pol_pref;
+		break;
+	default:
+	case MPOL_INTERLEAVE:
+	case MPOL_BIND:
 		/*
 		 * __GFP_THISNODE shouldn't even be used with the bind policy
 		 * because we might easily break the expectation to stay on the
 		 * requested node and not break the policy.
 		 */
-		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
+		WARN_ON_ONCE(gfp & __GFP_THISNODE);
+		break;
 	}
 
-	return nd;
+	return NULL;
+}
+
+/* Return the node id preferred by the given mempolicy, or the given id */
+static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
+{
+	nodemask_t *tmp;
+
+	tmp = policy_preferred_nodes(gfp, policy);
+	if (tmp)
+		return first_node(*tmp);
+	else
+		return nd;
 }
 
 /* Do dynamic interleaving for a process */
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 15/18] mm: convert callers of __alloc_pages_nodemask to pmask
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
                   ` (13 preceding siblings ...)
  2020-06-19 16:24 ` [PATCH 14/18] mm/mempolicy: Introduce policy_preferred_nodes() Ben Widawsky
@ 2020-06-19 16:24 ` Ben Widawsky
  2020-06-19 16:24 ` [PATCH 16/18] alloc_pages_nodemask: turn preferred nid into a nodemask Ben Widawsky
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Ben Widawsky, Andrew Morton, Dave Hansen, Mike Kravetz,
	Mina Almasry, Vlastimil Babka

Now that the infrastructure is in place to both select, and allocate a
set of preferred nodes as specified by policy (or perhaps in the future,
the calling function), start transitioning over functions that can
benefit from this.

This patch looks stupid. It seems to artificially insert a nodemask on
the stack, then just use the first node from that mask - in other words,
a nop just adding overhead. It does. The reason for this is it's a
preparatory patch for when we switch over to __alloc_pages_nodemask() to
using a mask for preferences. This helps with readability and
bisectability.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 mm/hugetlb.c   | 11 ++++++++---
 mm/mempolicy.c | 38 +++++++++++++++++++++++---------------
 2 files changed, 31 insertions(+), 18 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 57ece74e3aae..71b6750661df 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1687,6 +1687,12 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 	int order = huge_page_order(h);
 	struct page *page;
 	bool alloc_try_hard = true;
+	nodemask_t pmask;
+
+	if (nid == NUMA_NO_NODE)
+		nid = numa_mem_id();
+
+	pmask = nodemask_of_node(nid);
 
 	/*
 	 * By default we always try hard to allocate the page with
@@ -1700,9 +1706,8 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 	gfp_mask |= __GFP_COMP|__GFP_NOWARN;
 	if (alloc_try_hard)
 		gfp_mask |= __GFP_RETRY_MAYFAIL;
-	if (nid == NUMA_NO_NODE)
-		nid = numa_mem_id();
-	page = __alloc_pages_nodemask(gfp_mask, order, nid, nmask);
+	page = __alloc_pages_nodemask(gfp_mask, order, first_node(pmask),
+				      nmask);
 	if (page)
 		__count_vm_event(HTLB_BUDDY_PGALLOC);
 	else
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3c48f299d344..9521bb46aa00 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2270,11 +2270,11 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
 }
 
 static struct page *alloc_pages_vma_thp(gfp_t gfp, struct mempolicy *pol,
-					int order, int node)
+					int order, nodemask_t *prefmask)
 {
 	nodemask_t *nmask;
 	struct page *page;
-	int hpage_node = node;
+	int hpage_node = first_node(*prefmask);
 
 	/*
 	 * For hugepage allocation and non-interleave policy which allows the
@@ -2286,9 +2286,6 @@ static struct page *alloc_pages_vma_thp(gfp_t gfp, struct mempolicy *pol,
 	 * If the policy is interleave or multiple preferred nodes, or does not
 	 * allow the current node in its nodemask, we allocate the standard way.
 	 */
-	if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL))
-		hpage_node = first_node(pol->v.preferred_nodes);
-
 	nmask = policy_nodemask(gfp, pol);
 
 	/*
@@ -2340,10 +2337,14 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 {
 	struct mempolicy *pol;
 	struct page *page;
-	int preferred_nid;
-	nodemask_t *nmask;
+	nodemask_t *nmask, *pmask, tmp;
 
 	pol = get_vma_policy(vma, addr);
+	pmask = policy_preferred_nodes(gfp, pol);
+	if (!pmask) {
+		tmp = nodemask_of_node(node);
+		pmask = &tmp;
+	}
 
 	if (pol->mode == MPOL_INTERLEAVE) {
 		unsigned nid;
@@ -2353,12 +2354,12 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		page = alloc_page_interleave(gfp, order, nid);
 	} else if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
 			    hugepage)) {
-		page = alloc_pages_vma_thp(gfp, pol, order, node);
+		page = alloc_pages_vma_thp(gfp, pol, order, pmask);
 		mpol_cond_put(pol);
 	} else {
 		nmask = policy_nodemask(gfp, pol);
-		preferred_nid = policy_node(gfp, pol, node);
-		page = __alloc_pages_nodemask(gfp, order, preferred_nid, nmask);
+		page = __alloc_pages_nodemask(gfp, order, first_node(*pmask),
+					      nmask);
 		mpol_cond_put(pol);
 	}
 
@@ -2393,12 +2394,19 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 	 * No reference counting needed for current->mempolicy
 	 * nor system default_policy
 	 */
-	if (pol->mode == MPOL_INTERLEAVE)
+	if (pol->mode == MPOL_INTERLEAVE) {
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
-	else
-		page = __alloc_pages_nodemask(gfp, order,
-				policy_node(gfp, pol, numa_node_id()),
-				policy_nodemask(gfp, pol));
+	} else {
+		nodemask_t tmp, *pmask;
+
+		pmask = policy_preferred_nodes(gfp, pol);
+		if (!pmask) {
+			tmp = nodemask_of_node(numa_node_id());
+			pmask = &tmp;
+		}
+		page = __alloc_pages_nodemask(gfp, order, first_node(*pmask),
+					      policy_nodemask(gfp, pol));
+	}
 
 	return page;
 }
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 16/18] alloc_pages_nodemask: turn preferred nid into a nodemask
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
                   ` (14 preceding siblings ...)
  2020-06-19 16:24 ` [PATCH 15/18] mm: convert callers of __alloc_pages_nodemask to pmask Ben Widawsky
@ 2020-06-19 16:24 ` Ben Widawsky
  2020-06-19 16:24 ` [PATCH 17/18] mm: Use less stack for page allocations Ben Widawsky
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Ben Widawsky, Andrew Morton, Dave Hansen, Jason Gunthorpe,
	Michal Hocko, Mike Kravetz

The guts of the page allocator already understands that the memory
policy might provide multiple preferred nodes. Ideally, the alloc
function itself wouldn't take multiple nodes until one of the callers
decided it would be useful. Unfortunately, the way the callstack is
today is the caller of __alloc_pages_nodemask is responsible for
figuring out the preferred nodes (almost always without policy in place,
this is numa_node_id()). The purpose of this patch is to allow multiple
preferred nodes while keeping the existing logical preference
assignments in place.

In other words, everything at, and below __alloc_pages_nodemask() has no
concept of policy, and this patch maintains that division.

Like bindmask, NULL and empty set for preference are allowed.

A note on allocation. One of the obvious fallouts from this is some
callers are now going to allocate nodemasks on their stack. When no
policy is in place, these nodemasks are simply the
nodemask_of_node(numa_node_id()). Some amount of this is addressed in
the next patch. The alternatives are kmalloc which is unsafe in these
paths, a percpu variable can't work because a nodemask today can be 128B
at the max NODE_SHIFT of 10 on x86 cnd ia64 is too large for a percpu
variable, or a lookup table. There's no reason a lookup table can't
work, but it seems like a premature optimization. If you were to make a
lookup table for the more extreme cases of large systems, each nodemask
would be 128B, and you have 1024 nodes - so the size of just that is
128K.

I'm very open to better solutions.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 include/linux/gfp.h     |  8 +++-----
 include/linux/migrate.h |  4 ++--
 mm/hugetlb.c            |  3 +--
 mm/mempolicy.c          | 27 ++++++---------------------
 mm/page_alloc.c         | 10 ++++------
 5 files changed, 16 insertions(+), 36 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 9ab5c07579bd..47e9c02c17ae 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -499,15 +499,13 @@ static inline int arch_make_page_accessible(struct page *page)
 }
 #endif
 
-struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
-							nodemask_t *nodemask);
+struct page *__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+				    nodemask_t *prefmask, nodemask_t *nodemask);
 
 static inline struct page *
 __alloc_pages_nodes(nodemask_t *nodes, gfp_t gfp_mask, unsigned int order)
 {
-	return __alloc_pages_nodemask(gfp_mask, order, first_node(*nodes),
-				      NULL);
+	return __alloc_pages_nodemask(gfp_mask, order, nodes, NULL);
 }
 
 /*
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 3e546cbf03dd..91b399ec9249 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -37,6 +37,7 @@ static inline struct page *new_page_nodemask(struct page *page,
 	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL;
 	unsigned int order = 0;
 	struct page *new_page = NULL;
+	nodemask_t pmask = nodemask_of_node(preferred_nid);
 
 	if (PageHuge(page))
 		return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
@@ -50,8 +51,7 @@ static inline struct page *new_page_nodemask(struct page *page,
 	if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE))
 		gfp_mask |= __GFP_HIGHMEM;
 
-	new_page = __alloc_pages_nodemask(gfp_mask, order,
-				preferred_nid, nodemask);
+	new_page = __alloc_pages_nodemask(gfp_mask, order, &pmask, nodemask);
 
 	if (new_page && PageTransHuge(new_page))
 		prep_transhuge_page(new_page);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 71b6750661df..52e097aed7ed 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1706,8 +1706,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 	gfp_mask |= __GFP_COMP|__GFP_NOWARN;
 	if (alloc_try_hard)
 		gfp_mask |= __GFP_RETRY_MAYFAIL;
-	page = __alloc_pages_nodemask(gfp_mask, order, first_node(pmask),
-				      nmask);
+	page = __alloc_pages_nodemask(gfp_mask, order, &pmask, nmask);
 	if (page)
 		__count_vm_event(HTLB_BUDDY_PGALLOC);
 	else
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 9521bb46aa00..fb49bea41ab8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2274,7 +2274,6 @@ static struct page *alloc_pages_vma_thp(gfp_t gfp, struct mempolicy *pol,
 {
 	nodemask_t *nmask;
 	struct page *page;
-	int hpage_node = first_node(*prefmask);
 
 	/*
 	 * For hugepage allocation and non-interleave policy which allows the
@@ -2282,9 +2281,6 @@ static struct page *alloc_pages_vma_thp(gfp_t gfp, struct mempolicy *pol,
 	 * allocate from the current/preferred node and don't fall back to other
 	 * nodes, as the cost of remote accesses would likely offset THP
 	 * benefits.
-	 *
-	 * If the policy is interleave or multiple preferred nodes, or does not
-	 * allow the current node in its nodemask, we allocate the standard way.
 	 */
 	nmask = policy_nodemask(gfp, pol);
 
@@ -2293,7 +2289,7 @@ static struct page *alloc_pages_vma_thp(gfp_t gfp, struct mempolicy *pol,
 	 * unnecessarily, just compact.
 	 */
 	page = __alloc_pages_nodemask(gfp | __GFP_THISNODE | __GFP_NORETRY,
-				      order, hpage_node, nmask);
+				      order, prefmask, nmask);
 
 	/*
 	 * If hugepage allocations are configured to always synchronous compact
@@ -2301,7 +2297,7 @@ static struct page *alloc_pages_vma_thp(gfp_t gfp, struct mempolicy *pol,
 	 * allowing remote memory with both reclaim and compact as well.
 	 */
 	if (!page && (gfp & __GFP_DIRECT_RECLAIM))
-		page = __alloc_pages_nodemask(gfp, order, hpage_node, nmask);
+		page = __alloc_pages_nodemask(gfp, order, prefmask, nmask);
 
 	VM_BUG_ON(page && nmask && !node_isset(page_to_nid(page), *nmask));
 
@@ -2337,14 +2333,10 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 {
 	struct mempolicy *pol;
 	struct page *page;
-	nodemask_t *nmask, *pmask, tmp;
+	nodemask_t *nmask, *pmask;
 
 	pol = get_vma_policy(vma, addr);
 	pmask = policy_preferred_nodes(gfp, pol);
-	if (!pmask) {
-		tmp = nodemask_of_node(node);
-		pmask = &tmp;
-	}
 
 	if (pol->mode == MPOL_INTERLEAVE) {
 		unsigned nid;
@@ -2358,9 +2350,8 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		mpol_cond_put(pol);
 	} else {
 		nmask = policy_nodemask(gfp, pol);
-		page = __alloc_pages_nodemask(gfp, order, first_node(*pmask),
-					      nmask);
 		mpol_cond_put(pol);
+		page = __alloc_pages_nodemask(gfp, order, pmask, nmask);
 	}
 
 	return page;
@@ -2397,14 +2388,8 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 	if (pol->mode == MPOL_INTERLEAVE) {
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
 	} else {
-		nodemask_t tmp, *pmask;
-
-		pmask = policy_preferred_nodes(gfp, pol);
-		if (!pmask) {
-			tmp = nodemask_of_node(numa_node_id());
-			pmask = &tmp;
-		}
-		page = __alloc_pages_nodemask(gfp, order, first_node(*pmask),
+		page = __alloc_pages_nodemask(gfp, order,
+					      policy_preferred_nodes(gfp, pol),
 					      policy_nodemask(gfp, pol));
 	}
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c6f8f112a5d4..0f90419fe0d8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4967,15 +4967,13 @@ static inline void finalise_ac(gfp_t gfp_mask, struct alloc_context *ac)
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
-struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
-							nodemask_t *nodemask)
+struct page *__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+				    nodemask_t *prefmask, nodemask_t *nodemask)
 {
 	struct page *page;
 	unsigned int alloc_flags = ALLOC_WMARK_LOW;
 	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
 	struct alloc_context ac = { };
-	nodemask_t prefmask = nodemask_of_node(preferred_nid);
 
 	/*
 	 * There are several places where we assume that the order value is sane
@@ -4988,11 +4986,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 
 	gfp_mask &= gfp_allowed_mask;
 	alloc_mask = gfp_mask;
-	if (!prepare_alloc_pages(gfp_mask, order, &prefmask, nodemask, &ac,
+	if (!prepare_alloc_pages(gfp_mask, order, prefmask, nodemask, &ac,
 				 &alloc_mask, &alloc_flags))
 		return NULL;
 
-	ac.prefmask = &prefmask;
+	ac.prefmask = prefmask;
 
 	finalise_ac(gfp_mask, &ac);
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 17/18] mm: Use less stack for page allocations
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
                   ` (15 preceding siblings ...)
  2020-06-19 16:24 ` [PATCH 16/18] alloc_pages_nodemask: turn preferred nid into a nodemask Ben Widawsky
@ 2020-06-19 16:24 ` Ben Widawsky
  2020-06-19 16:24 ` [PATCH 18/18] mm/mempolicy: Advertise new MPOL_PREFERRED_MANY Ben Widawsky
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:24 UTC (permalink / raw)
  To: linux-mm; +Cc: Ben Widawsky, Andrew Morton, Michal Hocko, Tejun Heo

After converting __alloc_pages_nodemask to take in a preferred
nodoemask, __alloc_pages_node is left holding the bag as requiring stack
space since it needs to generate a nodemask for the specific node.
The patch attempts to remove all callers of it unless absolutely
necessary to avoid using stack space which is theoretically significant
in huge NUMA systems.

It turns out there aren't too many opportunities to do this as all
callers know exactly what they want. The difference between
__alloc_pages_node and alloc_pages_node is the former is meant for
explicit node allocation while the latter support providing no
preference (by specifying NUMA_NO_NODE as nid). Now it becomes clear
that NUMA_NO_NODE can be implemented without using stack space via some
of the newer functions that have been added, in particular,
__alloc_pages_nodes and __alloc_pages_nodemask.

In the non NUMA case, alloc_pages used numa_node_id(), which is 0.
Switching to NUMA_NO_NODE allows us to avoid using the stack.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 include/linux/gfp.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 47e9c02c17ae..e78982ef9349 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -532,7 +532,7 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
 	if (nid == NUMA_NO_NODE)
-		nid = numa_mem_id();
+		return __alloc_pages_nodes(NULL, gfp_mask, order);
 
 	return __alloc_pages_node(nid, gfp_mask, order);
 }
@@ -551,8 +551,8 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
 #define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
 	alloc_pages_vma(gfp_mask, order, vma, addr, numa_node_id(), true)
 #else
-#define alloc_pages(gfp_mask, order) \
-		alloc_pages_node(numa_node_id(), gfp_mask, order)
+#define alloc_pages(gfp_mask, order)                                           \
+	alloc_pages_node(NUMA_NO_NODE, gfp_mask, order)
 #define alloc_pages_vma(gfp_mask, order, vma, addr, node, false)\
 	alloc_pages(gfp_mask, order)
 #define alloc_hugepage_vma(gfp_mask, vma, addr, order) \
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 18/18] mm/mempolicy: Advertise new MPOL_PREFERRED_MANY
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
                   ` (16 preceding siblings ...)
  2020-06-19 16:24 ` [PATCH 17/18] mm: Use less stack for page allocations Ben Widawsky
@ 2020-06-19 16:24 ` Ben Widawsky
  2020-06-22  7:09 ` [PATCH 00/18] multiple preferred nodes Michal Hocko
  2020-06-22 20:54 ` Andi Kleen
  19 siblings, 0 replies; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Ben Widawsky, Andrew Morton, Dave Hansen, David Hildenbrand,
	Jonathan Corbet, Michal Hocko, Vlastimil Babka

See comments in code, and previous commit messages for details of
implementation and usage.

Fix whitespace while here.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 .../admin-guide/mm/numa_memory_policy.rst        | 16 ++++++++++++----
 include/uapi/linux/mempolicy.h                   |  6 +++---
 mm/mempolicy.c                                   | 14 ++++++--------
 mm/page_alloc.c                                  |  3 ---
 4 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 1ad020c459b8..b69963a37fc8 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -245,6 +245,14 @@ MPOL_INTERLEAVED
 	address range or file.  During system boot up, the temporary
 	interleaved system default policy works in this mode.
 
+MPOL_PREFERRED_MANY
+        This mode specifies that the allocation should be attempted from the
+        nodemask specified in the policy. If that allocation fails, the kernel
+        will search other nodes, in order of increasing distance from the first
+        set bit in the nodemask based on information provided by the platform
+        firmware. It is similar to MPOL_PREFERRED with the main exception that
+        is is an error to have an empty nodemask.
+
 NUMA memory policy supports the following optional mode flags:
 
 MPOL_F_STATIC_NODES
@@ -253,10 +261,10 @@ MPOL_F_STATIC_NODES
 	nodes changes after the memory policy has been defined.
 
 	Without this flag, any time a mempolicy is rebound because of a
-	change in the set of allowed nodes, the node (Preferred) or
-	nodemask (Bind, Interleave) is remapped to the new set of
-	allowed nodes.  This may result in nodes being used that were
-	previously undesired.
+        change in the set of allowed nodes, the preferred nodemask (Preferred
+        Many), preferred node (Preferred) or nodemask (Bind, Interleave) is
+        remapped to the new set of allowed nodes.  This may result in nodes
+        being used that were previously undesired.
 
 	With this flag, if the user-specified nodes overlap with the
 	nodes allowed by the task's cpuset, then the memory policy is
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 3354774af61e..ad3eee651d4e 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -16,13 +16,13 @@
  */
 
 /* Policies */
-enum {
-	MPOL_DEFAULT,
+enum { MPOL_DEFAULT,
 	MPOL_PREFERRED,
 	MPOL_BIND,
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
-	MPOL_MAX,	/* always last member of enum */
+	MPOL_PREFERRED_MANY,
+	MPOL_MAX, /* always last member of enum */
 };
 
 /* Flags for set_mempolicy */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index fb49bea41ab8..07e916f8f6b7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -108,8 +108,6 @@
 
 #include "internal.h"
 
-#define MPOL_PREFERRED_MANY MPOL_MAX
-
 /* Internal flags */
 #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0)	/* Skip checks for continuous vmas */
 #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1)		/* Invert check for nodemask */
@@ -180,7 +178,7 @@ struct mempolicy *get_task_policy(struct task_struct *p)
 static const struct mempolicy_operations {
 	int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
 	void (*rebind)(struct mempolicy *pol, const nodemask_t *nodes);
-} mpol_ops[MPOL_MAX + 1];
+} mpol_ops[MPOL_MAX];
 
 static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
 {
@@ -385,8 +383,8 @@ static void mpol_rebind_preferred_common(struct mempolicy *pol,
 }
 
 /* MPOL_PREFERRED_MANY allows multiple nodes to be set in 'nodes' */
-static void __maybe_unused mpol_rebind_preferred_many(struct mempolicy *pol,
-						      const nodemask_t *nodes)
+static void mpol_rebind_preferred_many(struct mempolicy *pol,
+				       const nodemask_t *nodes)
 {
 	mpol_rebind_preferred_common(pol, nodes, nodes);
 }
@@ -448,7 +446,7 @@ void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
 	mmap_write_unlock(mm);
 }
 
-static const struct mempolicy_operations mpol_ops[MPOL_MAX + 1] = {
+static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 	[MPOL_DEFAULT] = {
 		.rebind = mpol_rebind_default,
 	},
@@ -466,8 +464,8 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX + 1] = {
 	},
 	/* MPOL_LOCAL is converted to MPOL_PREFERRED on policy creation */
 	[MPOL_PREFERRED_MANY] = {
-		.create = NULL,
-		.rebind = NULL,
+		.create = mpol_new_preferred_many,
+		.rebind = mpol_rebind_preferred_many,
 	},
 };
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0f90419fe0d8..b89c9c2637bf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4867,9 +4867,6 @@ struct zonelist *preferred_zonelist(gfp_t gfp_mask, const nodemask_t *prefmask,
 	nodemask_t pref;
 	int nid, local_node = numa_mem_id();
 
-	/* Multi nodes not supported yet */
-	VM_BUG_ON(prefmask && nodes_weight(*prefmask) != 1);
-
 #define _isset(mask, node)                                                     \
 	(!(mask) || nodes_empty(*(mask)) ? 1 : node_isset(node, *(mask)))
 	/*
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
                   ` (17 preceding siblings ...)
  2020-06-19 16:24 ` [PATCH 18/18] mm/mempolicy: Advertise new MPOL_PREFERRED_MANY Ben Widawsky
@ 2020-06-22  7:09 ` Michal Hocko
  2020-06-23 11:20   ` Michal Hocko
  2020-06-22 20:54 ` Andi Kleen
  19 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2020-06-22  7:09 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

User visible APIs changes/additions should be posted to the linux-api
mailing list. Now added.

On Fri 19-06-20 09:24:07, Ben Widawsky wrote:
> This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy.
> This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2)
> interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a
> preference for nodes which will fulfil memory allocation requests. Like the
> MPOL_BIND interface, it works over a set of nodes.
> 
> Summary:
> 1-2: Random fixes I found along the way
> 3-4: Logic to handle many preferred nodes in page allocation
> 5-9: Plumbing to allow multiple preferred nodes in mempolicy
> 10-13: Teach page allocation APIs about nodemasks
> 14: Provide a helper to generate preferred nodemasks
> 15: Have page allocation callers generate preferred nodemasks
> 16-17: Flip the switch to have __alloc_pages_nodemask take preferred mask.
> 18: Expose the new uapi
> 
> Along with these patches are patches for libnuma, numactl, numademo, and memhog.
> They still need some polish, but can be found here:
> https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many
> It allows new usage: `numactl -P 0,3,4`
> 
> The goal of the new mode is to enable some use-cases when using tiered memory
> usage models which I've lovingly named.
> 1a. The Hare - The interconnect is fast enough to meet bandwidth and latency
> requirements allowing preference to be given to all nodes with "fast" memory.
> 1b. The Indiscriminate Hare - An application knows it wants fast memory (or
> perhaps slow memory), but doesn't care which node it runs on. The application
> can prefer a set of nodes and then xpu bind to the local node (cpu, accelerator,
> etc). This reverses the nodes are chosen today where the kernel attempts to use
> local memory to the CPU whenever possible. This will attempt to use the local
> accelerator to the memory.
> 2. The Tortoise - The administrator (or the application itself) is aware it only
> needs slow memory, and so can prefer that.
> 
> Much of this is almost achievable with the bind interface, but the bind
> interface suffers from an inability to fallback to another set of nodes if
> binding fails to all nodes in the nodemask.
> 
> Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
> preference.
> 
> > /* Set first two nodes as preferred in an 8 node system. */
> > const unsigned long nodes = 0x3
> > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
> 
> > /* Mimic interleave policy, but have fallback *.
> > const unsigned long nodes = 0xaa
> > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
> 
> Some internal discussion took place around the interface. There are two
> alternatives which we have discussed, plus one I stuck in:
> 1. Ordered list of nodes. Currently it's believed that the added complexity is
>    nod needed for expected usecases.
> 2. A flag for bind to allow falling back to other nodes. This confuses the
>    notion of binding and is less flexible than the current solution.
> 3. Create flags or new modes that helps with some ordering. This offers both a
>    friendlier API as well as a solution for more customized usage. It's unknown
>    if it's worth the complexity to support this. Here is sample code for how
>    this might work:
> 
> > // Default
> > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
> > // which is the same as
> > set_mempolicy(MPOL_DEFAULT, NULL, 0);
> >
> > // The Hare
> > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
> >
> > // The Tortoise
> > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
> >
> > // Prefer the fast memory of the first two sockets
> > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
> >
> > // Prefer specific nodes for some something wacky
> > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_CUSTOM, 0x17c, 1024);
> 
> ---
> 
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
> Cc: Li Xinhai <lixinhai.lxh@gmail.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Mina Almasry <almasrymina@google.com>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> 
> Ben Widawsky (14):
>   mm/mempolicy: Add comment for missing LOCAL
>   mm/mempolicy: Use node_mem_id() instead of node_id()
>   mm/page_alloc: start plumbing multi preferred node
>   mm/page_alloc: add preferred pass to page allocation
>   mm: Finish handling MPOL_PREFERRED_MANY
>   mm: clean up alloc_pages_vma (thp)
>   mm: Extract THP hugepage allocation
>   mm/mempolicy: Use __alloc_page_node for interleaved
>   mm: kill __alloc_pages
>   mm/mempolicy: Introduce policy_preferred_nodes()
>   mm: convert callers of __alloc_pages_nodemask to pmask
>   alloc_pages_nodemask: turn preferred nid into a nodemask
>   mm: Use less stack for page allocations
>   mm/mempolicy: Advertise new MPOL_PREFERRED_MANY
> 
> Dave Hansen (4):
>   mm/mempolicy: convert single preferred_node to full nodemask
>   mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes
>   mm/mempolicy: allow preferred code to take a nodemask
>   mm/mempolicy: refactor rebind code for PREFERRED_MANY
> 
>  .../admin-guide/mm/numa_memory_policy.rst     |  22 +-
>  include/linux/gfp.h                           |  19 +-
>  include/linux/mempolicy.h                     |   4 +-
>  include/linux/migrate.h                       |   4 +-
>  include/linux/mmzone.h                        |   3 +
>  include/uapi/linux/mempolicy.h                |   6 +-
>  mm/hugetlb.c                                  |  10 +-
>  mm/internal.h                                 |   1 +
>  mm/mempolicy.c                                | 271 +++++++++++++-----
>  mm/page_alloc.c                               | 179 +++++++++++-
>  10 files changed, 403 insertions(+), 116 deletions(-)
> 
> 
> -- 
> 2.27.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
                   ` (18 preceding siblings ...)
  2020-06-22  7:09 ` [PATCH 00/18] multiple preferred nodes Michal Hocko
@ 2020-06-22 20:54 ` Andi Kleen
  2020-06-22 21:02   ` Ben Widawsky
  2020-06-22 21:07   ` Dave Hansen
  19 siblings, 2 replies; 44+ messages in thread
From: Andi Kleen @ 2020-06-22 20:54 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-mm, Andrew Morton, Christoph Lameter, Dan Williams,
	Dave Hansen, David Hildenbrand, David Rientjes, Jason Gunthorpe,
	Johannes Weiner, Jonathan Corbet, Kuppuswamy Sathyanarayanan,
	Lee Schermerhorn, Li Xinhai, Mel Gorman, Michal Hocko,
	Mike Kravetz, Mina Almasry, Tejun Heo, Vlastimil Babka

On Fri, Jun 19, 2020 at 09:24:07AM -0700, Ben Widawsky wrote:
> This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy.

So the reason for having a new policy is that you're worried some legacy
application passes multiple nodes to MPOL_PREFERRED, where all but the
first would be currently ignored. Is that right?

Is there any indication that this is actually the case?

If not I would prefer to just extend the semantics of the existing MPOL_PREFERRED.

Even if there was such an legacy application any legacy behavior changes
are likely not fatal, because preferred is only a hint anyways. Anybody
who really requires the right nodes would use _BIND.

-Andi


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-22 20:54 ` Andi Kleen
@ 2020-06-22 21:02   ` Ben Widawsky
  2020-06-22 21:07   ` Dave Hansen
  1 sibling, 0 replies; 44+ messages in thread
From: Ben Widawsky @ 2020-06-22 21:02 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-mm, Andrew Morton, Christoph Lameter, Dan Williams,
	Dave Hansen, David Hildenbrand, David Rientjes, Jason Gunthorpe,
	Johannes Weiner, Jonathan Corbet, Kuppuswamy Sathyanarayanan,
	Lee Schermerhorn, Li Xinhai, Mel Gorman, Michal Hocko,
	Mike Kravetz, Mina Almasry, Tejun Heo, Vlastimil Babka

On 20-06-22 13:54:30, Andi Kleen wrote:
> On Fri, Jun 19, 2020 at 09:24:07AM -0700, Ben Widawsky wrote:
> > This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy.
> 
> So the reason for having a new policy is that you're worried some legacy
> application passes multiple nodes to MPOL_PREFERRED, where all but the
> first would be currently ignored. Is that right?
> 
> Is there any indication that this is actually the case?
> 
> If not I would prefer to just extend the semantics of the existing MPOL_PREFERRED.
> 
> Even if there was such an legacy application any legacy behavior changes
> are likely not fatal, because preferred is only a hint anyways. Anybody
> who really requires the right nodes would use _BIND.
> 
> -Andi
> 

It does break ABI in a sense as the existing MPOL_PREFERRED does in fact specify
it should fail if more than one node is specified. A decent compromise might be
to add a new flag to be used with MPOL_PREFERRED,  hinting you want like a v2
version of the interface.

Honestly, I don't care either way. I inherited it from Dave Hansen, and I can
only guess at his motivation.

Perhaps more valuable to consider was #3 alternative in the snipped part of the
cover letter:

> 3. Create flags or new modes that helps with some ordering. This offers both a
>    friendlier API as well as a solution for more customized usage. It's unknown
>    if it's worth the complexity to support this. Here is sample code for how
>    this might work:
> 
> // Default
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
> // which is the same as
> set_mempolicy(MPOL_DEFAULT, NULL, 0);
>
> // The Hare
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
>
> // The Tortoise
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
>
> // Prefer the fast memory of the first two sockets
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
>
> // Prefer specific nodes for some something wacky
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_CUSTOM, 0x17c, 1024);


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-22 20:54 ` Andi Kleen
  2020-06-22 21:02   ` Ben Widawsky
@ 2020-06-22 21:07   ` Dave Hansen
  2020-06-22 22:02     ` Andi Kleen
  1 sibling, 1 reply; 44+ messages in thread
From: Dave Hansen @ 2020-06-22 21:07 UTC (permalink / raw)
  To: Andi Kleen, Ben Widawsky
  Cc: linux-mm, Andrew Morton, Christoph Lameter, Dan Williams,
	Dave Hansen, David Hildenbrand, David Rientjes, Jason Gunthorpe,
	Johannes Weiner, Jonathan Corbet, Kuppuswamy Sathyanarayanan,
	Lee Schermerhorn, Li Xinhai, Mel Gorman, Michal Hocko,
	Mike Kravetz, Mina Almasry, Tejun Heo, Vlastimil Babka

On 6/22/20 1:54 PM, Andi Kleen wrote:
> On Fri, Jun 19, 2020 at 09:24:07AM -0700, Ben Widawsky wrote:
>> This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy.
> So the reason for having a new policy is that you're worried some legacy
> application passes multiple nodes to MPOL_PREFERRED, where all but the
> first would be currently ignored. Is that right?

It's one thing if our internal implementation threw away all but the
first preferred node silently.  We do that, but we also go pretty far to
blab in the manpage to solidify this behavior:

	If nodemask  specifies  more  than  one  node ID, the first node
	in the mask will be selected as the preferred node.

I think it's dangerous to go and actively break that promise.  It means
that newer apps will silently not get the behavior they ask for if they
run on an old kernel.

We also promise:

	The kernel will try to allocate pages from this node first and
	fall back to "near by" nodes if the preferred node is low on
	free memory.

A user could legitimately say that the kernel breaks this promise if
they passed multiple preferred nodes, one of them had lots of free
memory, and that node node was not preferred.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-22 21:07   ` Dave Hansen
@ 2020-06-22 22:02     ` Andi Kleen
  0 siblings, 0 replies; 44+ messages in thread
From: Andi Kleen @ 2020-06-22 22:02 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ben Widawsky, linux-mm, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Michal Hocko, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka

> I think it's dangerous to go and actively break that promise.  It means
> that newer apps will silently not get the behavior they ask for if they
> run on an old kernel.

That can happen any time anyways because they didn't use BIND,
so they better deal with it.

Don't see any point in making the API uglier just for this. Normally
we care more about real compatibility problems instead of abstract
promises.

-Andi 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-22  7:09 ` [PATCH 00/18] multiple preferred nodes Michal Hocko
@ 2020-06-23 11:20   ` Michal Hocko
  2020-06-23 16:12     ` Ben Widawsky
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2020-06-23 11:20 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On Mon 22-06-20 09:10:00, Michal Hocko wrote:
[...]
> > The goal of the new mode is to enable some use-cases when using tiered memory
> > usage models which I've lovingly named.
> > 1a. The Hare - The interconnect is fast enough to meet bandwidth and latency
> > requirements allowing preference to be given to all nodes with "fast" memory.
> > 1b. The Indiscriminate Hare - An application knows it wants fast memory (or
> > perhaps slow memory), but doesn't care which node it runs on. The application
> > can prefer a set of nodes and then xpu bind to the local node (cpu, accelerator,
> > etc). This reverses the nodes are chosen today where the kernel attempts to use
> > local memory to the CPU whenever possible. This will attempt to use the local
> > accelerator to the memory.
> > 2. The Tortoise - The administrator (or the application itself) is aware it only
> > needs slow memory, and so can prefer that.
> >
> > Much of this is almost achievable with the bind interface, but the bind
> > interface suffers from an inability to fallback to another set of nodes if
> > binding fails to all nodes in the nodemask.

Yes, and probably worth mentioning explicitly that this might lead to
the OOM killer invocation so a failure would be disruptive to any
workload which is allowed to allocate from the specific node mask (so
even tasks without any mempolicy).

> > Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
> > preference.
> > 
> > > /* Set first two nodes as preferred in an 8 node system. */
> > > const unsigned long nodes = 0x3
> > > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
> > 
> > > /* Mimic interleave policy, but have fallback *.
> > > const unsigned long nodes = 0xaa
> > > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
> > 
> > Some internal discussion took place around the interface. There are two
> > alternatives which we have discussed, plus one I stuck in:
> > 1. Ordered list of nodes. Currently it's believed that the added complexity is
> >    nod needed for expected usecases.

There is no ordering in MPOL_BIND either and even though numa apis tend
to be screwed up from multiple aspects this is not a problem I have ever
stumbled over.

> > 2. A flag for bind to allow falling back to other nodes. This confuses the
> >    notion of binding and is less flexible than the current solution.

Agreed.

> > 3. Create flags or new modes that helps with some ordering. This offers both a
> >    friendlier API as well as a solution for more customized usage. It's unknown
> >    if it's worth the complexity to support this. Here is sample code for how
> >    this might work:
> > 
> > > // Default
> > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
> > > // which is the same as
> > > set_mempolicy(MPOL_DEFAULT, NULL, 0);

OK

> > > // The Hare
> > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
> > >
> > > // The Tortoise
> > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
> > >
> > > // Prefer the fast memory of the first two sockets
> > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
> > >
> > > // Prefer specific nodes for some something wacky
> > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_CUSTOM, 0x17c, 1024);

I am not so sure about these though. It would be much more easier to
start without additional modifiers and provide MPOL_PREFER_MANY without
any additional restrictions first (btw. I would like MPOL_PREFER_MASK
more but I do understand that naming is not the top priority now).

It would be also great to provide a high level semantic description
here. I have very quickly glanced through patches and they are not
really trivial to follow with many incremental steps so the higher level
intention is lost easily.

Do I get it right that the default semantic is essentially
	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
	  semantic)
	- fallback to numa unrestricted allocation with the default
	  numa policy on the failure

Or are there any usecases to modify how hard to keep the preference over
the fallback?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-23 11:20   ` Michal Hocko
@ 2020-06-23 16:12     ` Ben Widawsky
  2020-06-24  7:52       ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Ben Widawsky @ 2020-06-23 16:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On 20-06-23 13:20:48, Michal Hocko wrote:
> On Mon 22-06-20 09:10:00, Michal Hocko wrote:
> [...]
> > > The goal of the new mode is to enable some use-cases when using tiered memory
> > > usage models which I've lovingly named.
> > > 1a. The Hare - The interconnect is fast enough to meet bandwidth and latency
> > > requirements allowing preference to be given to all nodes with "fast" memory.
> > > 1b. The Indiscriminate Hare - An application knows it wants fast memory (or
> > > perhaps slow memory), but doesn't care which node it runs on. The application
> > > can prefer a set of nodes and then xpu bind to the local node (cpu, accelerator,
> > > etc). This reverses the nodes are chosen today where the kernel attempts to use
> > > local memory to the CPU whenever possible. This will attempt to use the local
> > > accelerator to the memory.
> > > 2. The Tortoise - The administrator (or the application itself) is aware it only
> > > needs slow memory, and so can prefer that.
> > >
> > > Much of this is almost achievable with the bind interface, but the bind
> > > interface suffers from an inability to fallback to another set of nodes if
> > > binding fails to all nodes in the nodemask.
> 
> Yes, and probably worth mentioning explicitly that this might lead to
> the OOM killer invocation so a failure would be disruptive to any
> workload which is allowed to allocate from the specific node mask (so
> even tasks without any mempolicy).

Thanks. I don't believe I mention this fact in any of the commit messages or
comments (and perhaps this is an indication I should have). I'll find a place to
mention this outside of the cover letter.

> 
> > > Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
> > > preference.
> > > 
> > > > /* Set first two nodes as preferred in an 8 node system. */
> > > > const unsigned long nodes = 0x3
> > > > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
> > > 
> > > > /* Mimic interleave policy, but have fallback *.
> > > > const unsigned long nodes = 0xaa
> > > > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
> > > 
> > > Some internal discussion took place around the interface. There are two
> > > alternatives which we have discussed, plus one I stuck in:
> > > 1. Ordered list of nodes. Currently it's believed that the added complexity is
> > >    nod needed for expected usecases.
> 
> There is no ordering in MPOL_BIND either and even though numa apis tend
> to be screwed up from multiple aspects this is not a problem I have ever
> stumbled over.
> 
> > > 2. A flag for bind to allow falling back to other nodes. This confuses the
> > >    notion of binding and is less flexible than the current solution.
> 
> Agreed.
> 
> > > 3. Create flags or new modes that helps with some ordering. This offers both a
> > >    friendlier API as well as a solution for more customized usage. It's unknown
> > >    if it's worth the complexity to support this. Here is sample code for how
> > >    this might work:
> > > 
> > > > // Default
> > > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
> > > > // which is the same as
> > > > set_mempolicy(MPOL_DEFAULT, NULL, 0);
> 
> OK
> 
> > > > // The Hare
> > > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
> > > >
> > > > // The Tortoise
> > > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
> > > >
> > > > // Prefer the fast memory of the first two sockets
> > > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
> > > >
> > > > // Prefer specific nodes for some something wacky
> > > > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_CUSTOM, 0x17c, 1024);
> 
> I am not so sure about these though. It would be much more easier to
> start without additional modifiers and provide MPOL_PREFER_MANY without
> any additional restrictions first (btw. I would like MPOL_PREFER_MASK
> more but I do understand that naming is not the top priority now).

True. In fact, this is the same as making MPOL_F_PREFER_ORDER_TYPE_CUSTOM the
implicit default, and adding the others later. Luckily for me, this is
effectively what I have already :-).

It's a new domain for me, so I'm very flexible on the name. MASK seems like an
altogether better name to me as well, but I've been using "MANY" long enough now
that it seems natural.

> 
> It would be also great to provide a high level semantic description
> here. I have very quickly glanced through patches and they are not
> really trivial to follow with many incremental steps so the higher level
> intention is lost easily.
> 
> Do I get it right that the default semantic is essentially
> 	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
> 	  semantic)
> 	- fallback to numa unrestricted allocation with the default
> 	  numa policy on the failure
> 
> Or are there any usecases to modify how hard to keep the preference over
> the fallback?

tl;dr is: yes, and no usecases.

Longer answer:
Internal APIs (specifically, __alloc_pages_nodemask()) keep all the same
semantics for trying to allocate with the exception that it will first try the
preferred nodes, and next try the bound nodes. It should be noted here that an
empty preferred mask is the same as saying, traverse nodes in distance order
starting from local. Therefore, both for preferred mask, and bound mask the
universe set is equivalent to the empty set (∅ == U). [1]

| prefmask | bindmask | how                                    |
|----------|----------|----------------------------------------|
| ∅        | ∅        | Page allocation without policy         |
| ∅        | N ≠ ∅    | MPOL_BIND                              |
| N ≠ ∅    | ∅        | MPOL_PREFERRED* or internal preference |
| N ≠ ∅    | N ≠ ∅    | MPOL_BIND + internal preference        |
|----------|----------|----------------------------------------|

At the end of this patch series, there is never a case (that I can contrive
anyway) where prefmask is multiple nodes, and bindmask is multiple nodes. In the
future, if internal callers wanted to try to get clever, this could be the case.
The UAPI won't allow having both a bind and preferred node. "This system call
defines the default policy for the thread.  The thread policy governs allocation
of pages in the process's address space outside of memory ranges controlled  by
a more specific policy set by mbind(2)."

To your second question. There isn't any usecase. Sans bugs and oversights,
preferred nodes are always tried before fallback. I consider that almost the
hardest level of preference. The one thing I can think of that would be "harder"
would be some sort of mechanism to try all preferred nodes before any tricks are
used, like reclaim. I fear doing this will make the already scary
get_page_from_freelist() even more scary.

On this topic, I haven't changed anything for fragmentation. In the code right
now, fragmentation is enabled as soon as the zone chosen for allocation doesn't
match the preferred_zoneref->zone.

```
if (no_fallback && nr_online_nodes > 1 &&
		zone != ac->preferred_zoneref->zone) {
```

What might be more optimal is to move on to the next node and not allow
fragmentation yet, unless zone ∉  prefmask. Like the above, I think this will
add a decent amount of complexity.

The last thing, which I mention in a commit message but not here, OOM will scan
all nodes, and not just preferred nodes first. This seemed like a premature
optimization to me.


[1] There is an underlying assumption that the geodesic distance between any two
nodes is the same for all zonelists. IOW, if you have nodes M, N, P each with zones
A, and B the zonelists will be as follows:
M zonelist: MA -> MB -> NA -> NB -> PA -> PB
N zonelist: NA -> NB -> PA -> PB -> MA -> MB
P zonelist: PA -> PB -> MA -> MB -> NA -> NC


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-23 16:12     ` Ben Widawsky
@ 2020-06-24  7:52       ` Michal Hocko
  2020-06-24 16:16         ` Ben Widawsky
  2020-06-26 21:39         ` Ben Widawsky
  0 siblings, 2 replies; 44+ messages in thread
From: Michal Hocko @ 2020-06-24  7:52 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On Tue 23-06-20 09:12:11, Ben Widawsky wrote:
> On 20-06-23 13:20:48, Michal Hocko wrote:
[...]
> > It would be also great to provide a high level semantic description
> > here. I have very quickly glanced through patches and they are not
> > really trivial to follow with many incremental steps so the higher level
> > intention is lost easily.
> > 
> > Do I get it right that the default semantic is essentially
> > 	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
> > 	  semantic)
> > 	- fallback to numa unrestricted allocation with the default
> > 	  numa policy on the failure
> > 
> > Or are there any usecases to modify how hard to keep the preference over
> > the fallback?
> 
> tl;dr is: yes, and no usecases.

OK, then I am wondering why the change has to be so involved. Except for
syscall plumbing the only real change to the allocator path would be
something like

static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
{
	/* Lower zones don't get a nodemask applied for MPOL_BIND */
	if (unlikely(policy->mode == MPOL_BIND || 
	   	     policy->mode == MPOL_PREFERED_MANY) &&
			apply_policy_zone(policy, gfp_zone(gfp)) &&
			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
		return &policy->v.nodes;

	return NULL;
}

alloc_pages_current

	if (pol->mode == MPOL_INTERLEAVE)
		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
	else {
		gfp_t gfp_attempt = gfp;

		/*
		 * Make sure the first allocation attempt will try hard
		 * but eventually fail without OOM killer or other
		 * disruption before falling back to the full nodemask
		 */
		if (pol->mode == MPOL_PREFERED_MANY)
			gfp_attempt |= __GFP_RETRY_MAYFAIL;	

		page = __alloc_pages_nodemask(gfp_attempt, order,
				policy_node(gfp, pol, numa_node_id()),
				policy_nodemask(gfp, pol));
		if (!page && pol->mode == MPOL_PREFERED_MANY)
			page = __alloc_pages_nodemask(gfp, order,
				numa_node_id(), NULL);
	}

	return page;

similar (well slightly more hairy) in alloc_pages_vma

Or do I miss something that really requires more involved approach like
building custom zonelists and other larger changes to the allocator?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 01/18] mm/mempolicy: Add comment for missing LOCAL
  2020-06-19 16:24 ` [PATCH 01/18] mm/mempolicy: Add comment for missing LOCAL Ben Widawsky
@ 2020-06-24  7:55   ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2020-06-24  7:55 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: linux-mm, Christoph Lameter, Andrew Morton, David Rientjes

On Fri 19-06-20 09:24:08, Ben Widawsky wrote:
> MPOL_LOCAL is a bit weird because it is simply a different name for an
> existing behavior (preferred policy with no node mask). It has been this
> way since it was added here:
> commit 479e2802d09f ("mm: mempolicy: Make MPOL_LOCAL a real policy")
> 
> It is so similar to MPOL_PREFERRED in fact that when the policy is
> created in mpol_new, the mode is set as PREFERRED, and an internal state
> representing LOCAL doesn't exist.
> 
> To prevent future explorers from scratching their head as to why
> MPOL_LOCAL isn't defined in the mpol_ops table, add a small comment
> explaining the situations.

Agreed. MPOL_LOCAL can be really confusing for whoever looks at the code
the first time.
 
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Rientjes <rientjes@google.com>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/mempolicy.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 381320671677..36ee3267c25f 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -427,6 +427,7 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
>  		.create = mpol_new_bind,
>  		.rebind = mpol_rebind_nodemask,
>  	},
> +	/* MPOL_LOCAL is converted to MPOL_PREFERRED on policy creation */

I would just add. See mpol_new()

>  };
>  
>  static int migrate_page_add(struct page *page, struct list_head *pagelist,
> -- 
> 2.27.0
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 02/18] mm/mempolicy: Use node_mem_id() instead of node_id()
  2020-06-19 16:24 ` [PATCH 02/18] mm/mempolicy: Use node_mem_id() instead of node_id() Ben Widawsky
@ 2020-06-24  8:25   ` Michal Hocko
  2020-06-24 16:48     ` Ben Widawsky
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2020-06-24  8:25 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: linux-mm, Andrew Morton, Lee Schermerhorn

On Fri 19-06-20 09:24:09, Ben Widawsky wrote:
> Calling out some distinctions first as I understand it, and the
> reasoning of the patch:
> numa_node_id() - The node id for the currently running CPU.
> numa_mem_id() - The node id for the closest memory node.

Correct

> The case where they are not the same is CONFIG_HAVE_MEMORYLESS_NODES.
> Only ia64 and powerpc support this option, so it is perhaps not a very
> interesting situation to most.

Other arches can have nodes without any memory as well. Just offline all
the managed memory via hotplug... (please note that such node still
might have memory present! Just not useable by the page allocator)

> The question is, when you do want which?

Short answer is that you shouldn't really care. The fact that we do and
that we have a distinct API for that is IMHO a mistake of the past IMHO.

A slightly longer answer would be that the allocator should fallback to
a proper node(s) even if you specify a memory less node as a preferred
one. That is achieved by proper zonelist for each existing NUMA node.
There are bugs here and there when some nodes do not get their zonelists
initialized but fundamentally this should be nobrainer. There are also
corner cases where an application might have been bound to a node which
went offline completely which would be "fun"

> numa_node_id() is definitely
> what's desired if MPOL_PREFERRED, or MPOL_LOCAL were used, since the ABI
> states "This mode specifies "local allocation"; the memory is allocated
> on the node of the CPU that triggered the allocation (the "local
> node")."

In fact from the allocator point of view there is no real difference
because there is nothing to allocate from the node without memory,
obviously so it would fallback to the next node/zones from the closest
node...

> It would be weird, though not impossible to set this policy on
> a CPU that has memoryless nodes.

Keep in mind that the memory might went away via hotplug.

> A more likely way to hit this is with
> interleaving. The current interfaces will return some equally weird
> thing, but at least it's symmetric. Therefore, in cases where the node
> is being queried for the currently running process, it probably makes
> sense to use numa_node_id(). For other cases however, when CPU is trying
> to obtain the "local" memory, numa_mem_id() already contains this and
> should be used instead.
> 
> This really should only effect configurations where
> CONFIG_HAVE_MEMORYLESS_NODES=y, and even on those machines it's quite
> possible the ultimate behavior would be identical.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
> Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>

Well, mempolicy.c uses numa_node_id in most cases and I do not see these
two being special. So if we want to change that it should be done
consistently. I even suspect that these changes are mostly nops because
respective zonelists will do the right thing, But there might be land
mines here and there - e.g. if __GFP_THISNODE was used then somebody
might expect a failure rather than a misplaced allocation because there
is other fallback mechanism on a depleted numa node.

All that being said I am not sure this is an actual improvement.

> ---
>I  mm/mempolicy.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 36ee3267c25f..99e0f3f9c4a6 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1991,7 +1991,7 @@ static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
>  	int nid;
>  
>  	if (!nnodes)
> -		return numa_node_id();
> +		return numa_mem_id();
>  	target = (unsigned int)n % nnodes;
>  	nid = first_node(pol->v.nodes);
>  	for (i = 0; i < target; i++)
> @@ -2049,7 +2049,7 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
>  		nid = interleave_nid(*mpol, vma, addr,
>  					huge_page_shift(hstate_vma(vma)));
>  	} else {
> -		nid = policy_node(gfp_flags, *mpol, numa_node_id());
> +		nid = policy_node(gfp_flags, *mpol, numa_mem_id());
>  		if ((*mpol)->mode == MPOL_BIND)
>  			*nodemask = &(*mpol)->v.nodes;
>  	}
> -- 
> 2.27.0
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24  7:52       ` Michal Hocko
@ 2020-06-24 16:16         ` Ben Widawsky
  2020-06-24 18:39           ` Michal Hocko
  2020-06-26 21:39         ` Ben Widawsky
  1 sibling, 1 reply; 44+ messages in thread
From: Ben Widawsky @ 2020-06-24 16:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On 20-06-24 09:52:16, Michal Hocko wrote:
> On Tue 23-06-20 09:12:11, Ben Widawsky wrote:
> > On 20-06-23 13:20:48, Michal Hocko wrote:
> [...]
> > > It would be also great to provide a high level semantic description
> > > here. I have very quickly glanced through patches and they are not
> > > really trivial to follow with many incremental steps so the higher level
> > > intention is lost easily.
> > > 
> > > Do I get it right that the default semantic is essentially
> > > 	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
> > > 	  semantic)
> > > 	- fallback to numa unrestricted allocation with the default
> > > 	  numa policy on the failure
> > > 
> > > Or are there any usecases to modify how hard to keep the preference over
> > > the fallback?
> > 
> > tl;dr is: yes, and no usecases.
> 
> OK, then I am wondering why the change has to be so involved. Except for
> syscall plumbing the only real change to the allocator path would be
> something like
> 
> static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> {
> 	/* Lower zones don't get a nodemask applied for MPOL_BIND */
> 	if (unlikely(policy->mode == MPOL_BIND || 
> 	   	     policy->mode == MPOL_PREFERED_MANY) &&
> 			apply_policy_zone(policy, gfp_zone(gfp)) &&
> 			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
> 		return &policy->v.nodes;
> 
> 	return NULL;
> }
> 
> alloc_pages_current
> 
> 	if (pol->mode == MPOL_INTERLEAVE)
> 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> 	else {
> 		gfp_t gfp_attempt = gfp;
> 
> 		/*
> 		 * Make sure the first allocation attempt will try hard
> 		 * but eventually fail without OOM killer or other
> 		 * disruption before falling back to the full nodemask
> 		 */
> 		if (pol->mode == MPOL_PREFERED_MANY)
> 			gfp_attempt |= __GFP_RETRY_MAYFAIL;	
> 
> 		page = __alloc_pages_nodemask(gfp_attempt, order,
> 				policy_node(gfp, pol, numa_node_id()),
> 				policy_nodemask(gfp, pol));
> 		if (!page && pol->mode == MPOL_PREFERED_MANY)
> 			page = __alloc_pages_nodemask(gfp, order,
> 				numa_node_id(), NULL);
> 	}
> 
> 	return page;
> 
> similar (well slightly more hairy) in alloc_pages_vma
> 
> Or do I miss something that really requires more involved approach like
> building custom zonelists and other larger changes to the allocator?

I think I'm missing how this allows selecting from multiple preferred nodes. In
this case when you try to get the page from the freelist, you'll get the
zonelist of the preferred node, and when you actually scan through on page
allocation, you have no way to filter out the non-preferred nodes. I think the
plumbing of multiple nodes has to go all the way through
__alloc_pages_nodemask(). But it's possible I've missed the point.

I do have a branch where I build a custom zonelist, but that's not the reason
here :-)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 02/18] mm/mempolicy: Use node_mem_id() instead of node_id()
  2020-06-24  8:25   ` Michal Hocko
@ 2020-06-24 16:48     ` Ben Widawsky
  2020-06-26 12:30       ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Ben Widawsky @ 2020-06-24 16:48 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Andrew Morton, Lee Schermerhorn

On 20-06-24 10:25:59, Michal Hocko wrote:
> On Fri 19-06-20 09:24:09, Ben Widawsky wrote:
> > Calling out some distinctions first as I understand it, and the
> > reasoning of the patch:
> > numa_node_id() - The node id for the currently running CPU.
> > numa_mem_id() - The node id for the closest memory node.
> 
> Correct
> 
> > The case where they are not the same is CONFIG_HAVE_MEMORYLESS_NODES.
> > Only ia64 and powerpc support this option, so it is perhaps not a very
> > interesting situation to most.
> 
> Other arches can have nodes without any memory as well. Just offline all
> the managed memory via hotplug... (please note that such node still
> might have memory present! Just not useable by the page allocator)

You must have CONFIG_HAVE_MEMORYLESS_NODES defined. So I believe that this
change is limited to ia64 and powerpc. I don't think there is a way to set it
outside of those arches.

> 
> > The question is, when you do want which?
> 
> Short answer is that you shouldn't really care. The fact that we do and
> that we have a distinct API for that is IMHO a mistake of the past IMHO.
> 
> A slightly longer answer would be that the allocator should fallback to
> a proper node(s) even if you specify a memory less node as a preferred
> one. That is achieved by proper zonelist for each existing NUMA node.
> There are bugs here and there when some nodes do not get their zonelists
> initialized but fundamentally this should be nobrainer. There are also
> corner cases where an application might have been bound to a node which
> went offline completely which would be "fun"

I'd be willing to try to review a patch that removes it. Generally, I think
you're correct and zonelists solve what this attempts to solve. Maybe you need
something like this if a cpu has no memory:
zone_to_nid(first_zones_zonelist(NODE_DATA(nid)->node_zonelists).

> 
> > numa_node_id() is definitely
> > what's desired if MPOL_PREFERRED, or MPOL_LOCAL were used, since the ABI
> > states "This mode specifies "local allocation"; the memory is allocated
> > on the node of the CPU that triggered the allocation (the "local
> > node")."
> 
> In fact from the allocator point of view there is no real difference
> because there is nothing to allocate from the node without memory,
> obviously so it would fallback to the next node/zones from the closest
> node...
> 
> > It would be weird, though not impossible to set this policy on
> > a CPU that has memoryless nodes.
> 
> Keep in mind that the memory might went away via hotplug.
> 
> > A more likely way to hit this is with
> > interleaving. The current interfaces will return some equally weird
> > thing, but at least it's symmetric. Therefore, in cases where the node
> > is being queried for the currently running process, it probably makes
> > sense to use numa_node_id(). For other cases however, when CPU is trying
> > to obtain the "local" memory, numa_mem_id() already contains this and
> > should be used instead.
> > 
> > This really should only effect configurations where
> > CONFIG_HAVE_MEMORYLESS_NODES=y, and even on those machines it's quite
> > possible the ultimate behavior would be identical.
> > 
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
> > Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
> 
> Well, mempolicy.c uses numa_node_id in most cases and I do not see these
> two being special. So if we want to change that it should be done
> consistently. I even suspect that these changes are mostly nops because
> respective zonelists will do the right thing, But there might be land
> mines here and there - e.g. if __GFP_THISNODE was used then somebody
> might expect a failure rather than a misplaced allocation because there
> is other fallback mechanism on a depleted numa node.
> 
> All that being said I am not sure this is an actual improvement.

It wasn't entirely arbitrary. I tried to cherry pick the cases where the local
memory node was wanted as opposed to the local numa node. As I went through, it
seemed some cases this is correct, for instance, getting the node for "LOCAL"
should probably be on the closest node, not the closest memory node, but getting
the next closest interleave node should be a memory node. It was definitely not
perfect.

The only reason I added this patch is because I use numa_mem_id by the end of
the series and I wanted to make sure my usage of it was correct to add new
usages of numa_mem_id (because like you said, mempolicy mostly uses
numa_node_id). I'm happy to just keep using numa_node_id if that's the right
thing to do.

> 
> > ---
> >I  mm/mempolicy.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 36ee3267c25f..99e0f3f9c4a6 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1991,7 +1991,7 @@ static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
> >  	int nid;
> >  
> >  	if (!nnodes)
> > -		return numa_node_id();
> > +		return numa_mem_id();
> >  	target = (unsigned int)n % nnodes;
> >  	nid = first_node(pol->v.nodes);
> >  	for (i = 0; i < target; i++)
> > @@ -2049,7 +2049,7 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
> >  		nid = interleave_nid(*mpol, vma, addr,
> >  					huge_page_shift(hstate_vma(vma)));
> >  	} else {
> > -		nid = policy_node(gfp_flags, *mpol, numa_node_id());
> > +		nid = policy_node(gfp_flags, *mpol, numa_mem_id());
> >  		if ((*mpol)->mode == MPOL_BIND)
> >  			*nodemask = &(*mpol)->v.nodes;
> >  	}
> > -- 
> > 2.27.0
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24 16:16         ` Ben Widawsky
@ 2020-06-24 18:39           ` Michal Hocko
  2020-06-24 19:37             ` Ben Widawsky
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2020-06-24 18:39 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On Wed 24-06-20 09:16:43, Ben Widawsky wrote:
> On 20-06-24 09:52:16, Michal Hocko wrote:
> > On Tue 23-06-20 09:12:11, Ben Widawsky wrote:
> > > On 20-06-23 13:20:48, Michal Hocko wrote:
> > [...]
> > > > It would be also great to provide a high level semantic description
> > > > here. I have very quickly glanced through patches and they are not
> > > > really trivial to follow with many incremental steps so the higher level
> > > > intention is lost easily.
> > > > 
> > > > Do I get it right that the default semantic is essentially
> > > > 	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
> > > > 	  semantic)
> > > > 	- fallback to numa unrestricted allocation with the default
> > > > 	  numa policy on the failure
> > > > 
> > > > Or are there any usecases to modify how hard to keep the preference over
> > > > the fallback?
> > > 
> > > tl;dr is: yes, and no usecases.
> > 
> > OK, then I am wondering why the change has to be so involved. Except for
> > syscall plumbing the only real change to the allocator path would be
> > something like
> > 
> > static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > {
> > 	/* Lower zones don't get a nodemask applied for MPOL_BIND */
> > 	if (unlikely(policy->mode == MPOL_BIND || 
> > 	   	     policy->mode == MPOL_PREFERED_MANY) &&
> > 			apply_policy_zone(policy, gfp_zone(gfp)) &&
> > 			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
> > 		return &policy->v.nodes;
> > 
> > 	return NULL;
> > }
> > 
> > alloc_pages_current
> > 
> > 	if (pol->mode == MPOL_INTERLEAVE)
> > 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> > 	else {
> > 		gfp_t gfp_attempt = gfp;
> > 
> > 		/*
> > 		 * Make sure the first allocation attempt will try hard
> > 		 * but eventually fail without OOM killer or other
> > 		 * disruption before falling back to the full nodemask
> > 		 */
> > 		if (pol->mode == MPOL_PREFERED_MANY)
> > 			gfp_attempt |= __GFP_RETRY_MAYFAIL;	
> > 
> > 		page = __alloc_pages_nodemask(gfp_attempt, order,
> > 				policy_node(gfp, pol, numa_node_id()),
> > 				policy_nodemask(gfp, pol));
> > 		if (!page && pol->mode == MPOL_PREFERED_MANY)
> > 			page = __alloc_pages_nodemask(gfp, order,
> > 				numa_node_id(), NULL);
> > 	}
> > 
> > 	return page;
> > 
> > similar (well slightly more hairy) in alloc_pages_vma
> > 
> > Or do I miss something that really requires more involved approach like
> > building custom zonelists and other larger changes to the allocator?
> 
> I think I'm missing how this allows selecting from multiple preferred nodes. In
> this case when you try to get the page from the freelist, you'll get the
> zonelist of the preferred node, and when you actually scan through on page
> allocation, you have no way to filter out the non-preferred nodes. I think the
> plumbing of multiple nodes has to go all the way through
> __alloc_pages_nodemask(). But it's possible I've missed the point.

policy_nodemask() will provide the nodemask which will be used as a
filter on the policy_node.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24 18:39           ` Michal Hocko
@ 2020-06-24 19:37             ` Ben Widawsky
  2020-06-24 19:51               ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Ben Widawsky @ 2020-06-24 19:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On 20-06-24 20:39:17, Michal Hocko wrote:
> On Wed 24-06-20 09:16:43, Ben Widawsky wrote:
> > On 20-06-24 09:52:16, Michal Hocko wrote:
> > > On Tue 23-06-20 09:12:11, Ben Widawsky wrote:
> > > > On 20-06-23 13:20:48, Michal Hocko wrote:
> > > [...]
> > > > > It would be also great to provide a high level semantic description
> > > > > here. I have very quickly glanced through patches and they are not
> > > > > really trivial to follow with many incremental steps so the higher level
> > > > > intention is lost easily.
> > > > > 
> > > > > Do I get it right that the default semantic is essentially
> > > > > 	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
> > > > > 	  semantic)
> > > > > 	- fallback to numa unrestricted allocation with the default
> > > > > 	  numa policy on the failure
> > > > > 
> > > > > Or are there any usecases to modify how hard to keep the preference over
> > > > > the fallback?
> > > > 
> > > > tl;dr is: yes, and no usecases.
> > > 
> > > OK, then I am wondering why the change has to be so involved. Except for
> > > syscall plumbing the only real change to the allocator path would be
> > > something like
> > > 
> > > static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > > {
> > > 	/* Lower zones don't get a nodemask applied for MPOL_BIND */
> > > 	if (unlikely(policy->mode == MPOL_BIND || 
> > > 	   	     policy->mode == MPOL_PREFERED_MANY) &&
> > > 			apply_policy_zone(policy, gfp_zone(gfp)) &&
> > > 			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
> > > 		return &policy->v.nodes;
> > > 
> > > 	return NULL;
> > > }
> > > 
> > > alloc_pages_current
> > > 
> > > 	if (pol->mode == MPOL_INTERLEAVE)
> > > 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> > > 	else {
> > > 		gfp_t gfp_attempt = gfp;
> > > 
> > > 		/*
> > > 		 * Make sure the first allocation attempt will try hard
> > > 		 * but eventually fail without OOM killer or other
> > > 		 * disruption before falling back to the full nodemask
> > > 		 */
> > > 		if (pol->mode == MPOL_PREFERED_MANY)
> > > 			gfp_attempt |= __GFP_RETRY_MAYFAIL;	
> > > 
> > > 		page = __alloc_pages_nodemask(gfp_attempt, order,
> > > 				policy_node(gfp, pol, numa_node_id()),
> > > 				policy_nodemask(gfp, pol));
> > > 		if (!page && pol->mode == MPOL_PREFERED_MANY)
> > > 			page = __alloc_pages_nodemask(gfp, order,
> > > 				numa_node_id(), NULL);
> > > 	}
> > > 
> > > 	return page;
> > > 
> > > similar (well slightly more hairy) in alloc_pages_vma
> > > 
> > > Or do I miss something that really requires more involved approach like
> > > building custom zonelists and other larger changes to the allocator?
> > 
> > I think I'm missing how this allows selecting from multiple preferred nodes. In
> > this case when you try to get the page from the freelist, you'll get the
> > zonelist of the preferred node, and when you actually scan through on page
> > allocation, you have no way to filter out the non-preferred nodes. I think the
> > plumbing of multiple nodes has to go all the way through
> > __alloc_pages_nodemask(). But it's possible I've missed the point.
> 
> policy_nodemask() will provide the nodemask which will be used as a
> filter on the policy_node.

Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me
to that point. UAPI cannot get independent masks, and callers of these functions
don't yet use them.

So let me ask before I actually type it up and find it's much much simpler, is
there not some perceived benefit to having both masks being independent?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24 19:37             ` Ben Widawsky
@ 2020-06-24 19:51               ` Michal Hocko
  2020-06-24 20:01                 ` Ben Widawsky
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2020-06-24 19:51 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On Wed 24-06-20 12:37:33, Ben Widawsky wrote:
> On 20-06-24 20:39:17, Michal Hocko wrote:
> > On Wed 24-06-20 09:16:43, Ben Widawsky wrote:
> > > On 20-06-24 09:52:16, Michal Hocko wrote:
> > > > On Tue 23-06-20 09:12:11, Ben Widawsky wrote:
> > > > > On 20-06-23 13:20:48, Michal Hocko wrote:
> > > > [...]
> > > > > > It would be also great to provide a high level semantic description
> > > > > > here. I have very quickly glanced through patches and they are not
> > > > > > really trivial to follow with many incremental steps so the higher level
> > > > > > intention is lost easily.
> > > > > > 
> > > > > > Do I get it right that the default semantic is essentially
> > > > > > 	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
> > > > > > 	  semantic)
> > > > > > 	- fallback to numa unrestricted allocation with the default
> > > > > > 	  numa policy on the failure
> > > > > > 
> > > > > > Or are there any usecases to modify how hard to keep the preference over
> > > > > > the fallback?
> > > > > 
> > > > > tl;dr is: yes, and no usecases.
> > > > 
> > > > OK, then I am wondering why the change has to be so involved. Except for
> > > > syscall plumbing the only real change to the allocator path would be
> > > > something like
> > > > 
> > > > static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > > > {
> > > > 	/* Lower zones don't get a nodemask applied for MPOL_BIND */
> > > > 	if (unlikely(policy->mode == MPOL_BIND || 
> > > > 	   	     policy->mode == MPOL_PREFERED_MANY) &&
> > > > 			apply_policy_zone(policy, gfp_zone(gfp)) &&
> > > > 			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
> > > > 		return &policy->v.nodes;
> > > > 
> > > > 	return NULL;
> > > > }
> > > > 
> > > > alloc_pages_current
> > > > 
> > > > 	if (pol->mode == MPOL_INTERLEAVE)
> > > > 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> > > > 	else {
> > > > 		gfp_t gfp_attempt = gfp;
> > > > 
> > > > 		/*
> > > > 		 * Make sure the first allocation attempt will try hard
> > > > 		 * but eventually fail without OOM killer or other
> > > > 		 * disruption before falling back to the full nodemask
> > > > 		 */
> > > > 		if (pol->mode == MPOL_PREFERED_MANY)
> > > > 			gfp_attempt |= __GFP_RETRY_MAYFAIL;	
> > > > 
> > > > 		page = __alloc_pages_nodemask(gfp_attempt, order,
> > > > 				policy_node(gfp, pol, numa_node_id()),
> > > > 				policy_nodemask(gfp, pol));
> > > > 		if (!page && pol->mode == MPOL_PREFERED_MANY)
> > > > 			page = __alloc_pages_nodemask(gfp, order,
> > > > 				numa_node_id(), NULL);
> > > > 	}
> > > > 
> > > > 	return page;
> > > > 
> > > > similar (well slightly more hairy) in alloc_pages_vma
> > > > 
> > > > Or do I miss something that really requires more involved approach like
> > > > building custom zonelists and other larger changes to the allocator?
> > > 
> > > I think I'm missing how this allows selecting from multiple preferred nodes. In
> > > this case when you try to get the page from the freelist, you'll get the
> > > zonelist of the preferred node, and when you actually scan through on page
> > > allocation, you have no way to filter out the non-preferred nodes. I think the
> > > plumbing of multiple nodes has to go all the way through
> > > __alloc_pages_nodemask(). But it's possible I've missed the point.
> > 
> > policy_nodemask() will provide the nodemask which will be used as a
> > filter on the policy_node.
> 
> Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me
> to that point. UAPI cannot get independent masks, and callers of these functions
> don't yet use them.
> 
> So let me ask before I actually type it up and find it's much much simpler, is
> there not some perceived benefit to having both masks being independent?

I am not sure I follow. Which two masks do you have in mind? zonelist
and user provided nodemask?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24 19:51               ` Michal Hocko
@ 2020-06-24 20:01                 ` Ben Widawsky
  2020-06-24 20:07                   ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Ben Widawsky @ 2020-06-24 20:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On 20-06-24 21:51:58, Michal Hocko wrote:
> On Wed 24-06-20 12:37:33, Ben Widawsky wrote:
> > On 20-06-24 20:39:17, Michal Hocko wrote:
> > > On Wed 24-06-20 09:16:43, Ben Widawsky wrote:
> > > > On 20-06-24 09:52:16, Michal Hocko wrote:
> > > > > On Tue 23-06-20 09:12:11, Ben Widawsky wrote:
> > > > > > On 20-06-23 13:20:48, Michal Hocko wrote:
> > > > > [...]
> > > > > > > It would be also great to provide a high level semantic description
> > > > > > > here. I have very quickly glanced through patches and they are not
> > > > > > > really trivial to follow with many incremental steps so the higher level
> > > > > > > intention is lost easily.
> > > > > > > 
> > > > > > > Do I get it right that the default semantic is essentially
> > > > > > > 	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
> > > > > > > 	  semantic)
> > > > > > > 	- fallback to numa unrestricted allocation with the default
> > > > > > > 	  numa policy on the failure
> > > > > > > 
> > > > > > > Or are there any usecases to modify how hard to keep the preference over
> > > > > > > the fallback?
> > > > > > 
> > > > > > tl;dr is: yes, and no usecases.
> > > > > 
> > > > > OK, then I am wondering why the change has to be so involved. Except for
> > > > > syscall plumbing the only real change to the allocator path would be
> > > > > something like
> > > > > 
> > > > > static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > > > > {
> > > > > 	/* Lower zones don't get a nodemask applied for MPOL_BIND */
> > > > > 	if (unlikely(policy->mode == MPOL_BIND || 
> > > > > 	   	     policy->mode == MPOL_PREFERED_MANY) &&
> > > > > 			apply_policy_zone(policy, gfp_zone(gfp)) &&
> > > > > 			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
> > > > > 		return &policy->v.nodes;
> > > > > 
> > > > > 	return NULL;
> > > > > }
> > > > > 
> > > > > alloc_pages_current
> > > > > 
> > > > > 	if (pol->mode == MPOL_INTERLEAVE)
> > > > > 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> > > > > 	else {
> > > > > 		gfp_t gfp_attempt = gfp;
> > > > > 
> > > > > 		/*
> > > > > 		 * Make sure the first allocation attempt will try hard
> > > > > 		 * but eventually fail without OOM killer or other
> > > > > 		 * disruption before falling back to the full nodemask
> > > > > 		 */
> > > > > 		if (pol->mode == MPOL_PREFERED_MANY)
> > > > > 			gfp_attempt |= __GFP_RETRY_MAYFAIL;	
> > > > > 
> > > > > 		page = __alloc_pages_nodemask(gfp_attempt, order,
> > > > > 				policy_node(gfp, pol, numa_node_id()),
> > > > > 				policy_nodemask(gfp, pol));
> > > > > 		if (!page && pol->mode == MPOL_PREFERED_MANY)
> > > > > 			page = __alloc_pages_nodemask(gfp, order,
> > > > > 				numa_node_id(), NULL);
> > > > > 	}
> > > > > 
> > > > > 	return page;
> > > > > 
> > > > > similar (well slightly more hairy) in alloc_pages_vma
> > > > > 
> > > > > Or do I miss something that really requires more involved approach like
> > > > > building custom zonelists and other larger changes to the allocator?
> > > > 
> > > > I think I'm missing how this allows selecting from multiple preferred nodes. In
> > > > this case when you try to get the page from the freelist, you'll get the
> > > > zonelist of the preferred node, and when you actually scan through on page
> > > > allocation, you have no way to filter out the non-preferred nodes. I think the
> > > > plumbing of multiple nodes has to go all the way through
> > > > __alloc_pages_nodemask(). But it's possible I've missed the point.
> > > 
> > > policy_nodemask() will provide the nodemask which will be used as a
> > > filter on the policy_node.
> > 
> > Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me
> > to that point. UAPI cannot get independent masks, and callers of these functions
> > don't yet use them.
> > 
> > So let me ask before I actually type it up and find it's much much simpler, is
> > there not some perceived benefit to having both masks being independent?
> 
> I am not sure I follow. Which two masks do you have in mind? zonelist
> and user provided nodemask?

Internally, a nodemask_t for preferred node, and a nodemask_t for bound nodes.

> 
> -- 
> Michal Hocko
> SUSE Labs
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24 20:01                 ` Ben Widawsky
@ 2020-06-24 20:07                   ` Michal Hocko
  2020-06-24 20:23                     ` Ben Widawsky
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2020-06-24 20:07 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On Wed 24-06-20 13:01:40, Ben Widawsky wrote:
> On 20-06-24 21:51:58, Michal Hocko wrote:
> > On Wed 24-06-20 12:37:33, Ben Widawsky wrote:
> > > On 20-06-24 20:39:17, Michal Hocko wrote:
> > > > On Wed 24-06-20 09:16:43, Ben Widawsky wrote:
[...]
> > > > > > Or do I miss something that really requires more involved approach like
> > > > > > building custom zonelists and other larger changes to the allocator?
> > > > > 
> > > > > I think I'm missing how this allows selecting from multiple preferred nodes. In
> > > > > this case when you try to get the page from the freelist, you'll get the
> > > > > zonelist of the preferred node, and when you actually scan through on page
> > > > > allocation, you have no way to filter out the non-preferred nodes. I think the
> > > > > plumbing of multiple nodes has to go all the way through
> > > > > __alloc_pages_nodemask(). But it's possible I've missed the point.
> > > > 
> > > > policy_nodemask() will provide the nodemask which will be used as a
> > > > filter on the policy_node.
> > > 
> > > Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me
> > > to that point. UAPI cannot get independent masks, and callers of these functions
> > > don't yet use them.
> > > 
> > > So let me ask before I actually type it up and find it's much much simpler, is
> > > there not some perceived benefit to having both masks being independent?
> > 
> > I am not sure I follow. Which two masks do you have in mind? zonelist
> > and user provided nodemask?
> 
> Internally, a nodemask_t for preferred node, and a nodemask_t for bound nodes.

Each mask is a local to its policy object.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24 20:07                   ` Michal Hocko
@ 2020-06-24 20:23                     ` Ben Widawsky
  2020-06-24 20:42                       ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Ben Widawsky @ 2020-06-24 20:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On 20-06-24 22:07:50, Michal Hocko wrote:
> On Wed 24-06-20 13:01:40, Ben Widawsky wrote:
> > On 20-06-24 21:51:58, Michal Hocko wrote:
> > > On Wed 24-06-20 12:37:33, Ben Widawsky wrote:
> > > > On 20-06-24 20:39:17, Michal Hocko wrote:
> > > > > On Wed 24-06-20 09:16:43, Ben Widawsky wrote:
> [...]
> > > > > > > Or do I miss something that really requires more involved approach like
> > > > > > > building custom zonelists and other larger changes to the allocator?
> > > > > > 
> > > > > > I think I'm missing how this allows selecting from multiple preferred nodes. In
> > > > > > this case when you try to get the page from the freelist, you'll get the
> > > > > > zonelist of the preferred node, and when you actually scan through on page
> > > > > > allocation, you have no way to filter out the non-preferred nodes. I think the
> > > > > > plumbing of multiple nodes has to go all the way through
> > > > > > __alloc_pages_nodemask(). But it's possible I've missed the point.
> > > > > 
> > > > > policy_nodemask() will provide the nodemask which will be used as a
> > > > > filter on the policy_node.
> > > > 
> > > > Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me
> > > > to that point. UAPI cannot get independent masks, and callers of these functions
> > > > don't yet use them.
> > > > 
> > > > So let me ask before I actually type it up and find it's much much simpler, is
> > > > there not some perceived benefit to having both masks being independent?
> > > 
> > > I am not sure I follow. Which two masks do you have in mind? zonelist
> > > and user provided nodemask?
> > 
> > Internally, a nodemask_t for preferred node, and a nodemask_t for bound nodes.
> 
> Each mask is a local to its policy object.

I mean for __alloc_pages_nodemask as an internal API. That is irrespective of
policy. Policy decisions are all made beforehand. The question from a few mails
ago was whether there is any use in keeping that change to
__alloc_pages_nodemask accepting two nodemasks.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24 20:23                     ` Ben Widawsky
@ 2020-06-24 20:42                       ` Michal Hocko
  2020-06-24 20:55                         ` Ben Widawsky
  0 siblings, 1 reply; 44+ messages in thread
From: Michal Hocko @ 2020-06-24 20:42 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On Wed 24-06-20 13:23:44, Ben Widawsky wrote:
> On 20-06-24 22:07:50, Michal Hocko wrote:
> > On Wed 24-06-20 13:01:40, Ben Widawsky wrote:
> > > On 20-06-24 21:51:58, Michal Hocko wrote:
> > > > On Wed 24-06-20 12:37:33, Ben Widawsky wrote:
> > > > > On 20-06-24 20:39:17, Michal Hocko wrote:
> > > > > > On Wed 24-06-20 09:16:43, Ben Widawsky wrote:
> > [...]
> > > > > > > > Or do I miss something that really requires more involved approach like
> > > > > > > > building custom zonelists and other larger changes to the allocator?
> > > > > > > 
> > > > > > > I think I'm missing how this allows selecting from multiple preferred nodes. In
> > > > > > > this case when you try to get the page from the freelist, you'll get the
> > > > > > > zonelist of the preferred node, and when you actually scan through on page
> > > > > > > allocation, you have no way to filter out the non-preferred nodes. I think the
> > > > > > > plumbing of multiple nodes has to go all the way through
> > > > > > > __alloc_pages_nodemask(). But it's possible I've missed the point.
> > > > > > 
> > > > > > policy_nodemask() will provide the nodemask which will be used as a
> > > > > > filter on the policy_node.
> > > > > 
> > > > > Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me
> > > > > to that point. UAPI cannot get independent masks, and callers of these functions
> > > > > don't yet use them.
> > > > > 
> > > > > So let me ask before I actually type it up and find it's much much simpler, is
> > > > > there not some perceived benefit to having both masks being independent?
> > > > 
> > > > I am not sure I follow. Which two masks do you have in mind? zonelist
> > > > and user provided nodemask?
> > > 
> > > Internally, a nodemask_t for preferred node, and a nodemask_t for bound nodes.
> > 
> > Each mask is a local to its policy object.
> 
> I mean for __alloc_pages_nodemask as an internal API. That is irrespective of
> policy. Policy decisions are all made beforehand. The question from a few mails
> ago was whether there is any use in keeping that change to
> __alloc_pages_nodemask accepting two nodemasks.

It is probably too late for me because I am still not following you
mean. Maybe it would be better to provide a pseudo code what you have in
mind. Anyway all that I am saying is that for the functionality that you
propose and _if_ the fallback strategy is fixed then all you should need
is to use the preferred nodemask for the __alloc_pages_nodemask and a
fallback allocation to the full (NULL nodemask). So you first try what
the userspace prefers - __GFP_RETRY_MAYFAIL will give you try hard but
do not OOM if the memory is depleted semantic and the fallback
allocation goes all the way to OOM on the complete memory depletion.
So I do not see much point in a custom zonelist for the policy. Maybe as
a micro-optimization to save some branches here and there.

If you envision usecases which might want to control the fallback
allocation strategy then this would get more complex because you
would need a sorted list of zones to try but this would really require
some solid usecase and it should build on top of a trivial
implementation which really is BIND with the fallback.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24 20:42                       ` Michal Hocko
@ 2020-06-24 20:55                         ` Ben Widawsky
  2020-06-25  6:28                           ` Michal Hocko
  0 siblings, 1 reply; 44+ messages in thread
From: Ben Widawsky @ 2020-06-24 20:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On 20-06-24 22:42:32, Michal Hocko wrote:
> On Wed 24-06-20 13:23:44, Ben Widawsky wrote:
> > On 20-06-24 22:07:50, Michal Hocko wrote:
> > > On Wed 24-06-20 13:01:40, Ben Widawsky wrote:
> > > > On 20-06-24 21:51:58, Michal Hocko wrote:
> > > > > On Wed 24-06-20 12:37:33, Ben Widawsky wrote:
> > > > > > On 20-06-24 20:39:17, Michal Hocko wrote:
> > > > > > > On Wed 24-06-20 09:16:43, Ben Widawsky wrote:
> > > [...]
> > > > > > > > > Or do I miss something that really requires more involved approach like
> > > > > > > > > building custom zonelists and other larger changes to the allocator?
> > > > > > > > 
> > > > > > > > I think I'm missing how this allows selecting from multiple preferred nodes. In
> > > > > > > > this case when you try to get the page from the freelist, you'll get the
> > > > > > > > zonelist of the preferred node, and when you actually scan through on page
> > > > > > > > allocation, you have no way to filter out the non-preferred nodes. I think the
> > > > > > > > plumbing of multiple nodes has to go all the way through
> > > > > > > > __alloc_pages_nodemask(). But it's possible I've missed the point.
> > > > > > > 
> > > > > > > policy_nodemask() will provide the nodemask which will be used as a
> > > > > > > filter on the policy_node.
> > > > > > 
> > > > > > Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me
> > > > > > to that point. UAPI cannot get independent masks, and callers of these functions
> > > > > > don't yet use them.
> > > > > > 
> > > > > > So let me ask before I actually type it up and find it's much much simpler, is
> > > > > > there not some perceived benefit to having both masks being independent?
> > > > > 
> > > > > I am not sure I follow. Which two masks do you have in mind? zonelist
> > > > > and user provided nodemask?
> > > > 
> > > > Internally, a nodemask_t for preferred node, and a nodemask_t for bound nodes.
> > > 
> > > Each mask is a local to its policy object.
> > 
> > I mean for __alloc_pages_nodemask as an internal API. That is irrespective of
> > policy. Policy decisions are all made beforehand. The question from a few mails
> > ago was whether there is any use in keeping that change to
> > __alloc_pages_nodemask accepting two nodemasks.
> 
> It is probably too late for me because I am still not following you
> mean. Maybe it would be better to provide a pseudo code what you have in
> mind. Anyway all that I am saying is that for the functionality that you
> propose and _if_ the fallback strategy is fixed then all you should need
> is to use the preferred nodemask for the __alloc_pages_nodemask and a
> fallback allocation to the full (NULL nodemask). So you first try what
> the userspace prefers - __GFP_RETRY_MAYFAIL will give you try hard but
> do not OOM if the memory is depleted semantic and the fallback
> allocation goes all the way to OOM on the complete memory depletion.
> So I do not see much point in a custom zonelist for the policy. Maybe as
> a micro-optimization to save some branches here and there.
> 
> If you envision usecases which might want to control the fallback
> allocation strategy then this would get more complex because you
> would need a sorted list of zones to try but this would really require
> some solid usecase and it should build on top of a trivial
> implementation which really is BIND with the fallback.
> 

I will implement what you suggest. I think it's a good suggestion. Here is what
I mean though:
-struct page *
-__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
-                                                       nodemask_t *nodemask);
+struct page *
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, nodemask_t *prefmask,
+		       nodemask_t *nodemask);

Is there any value in keeping two nodemasks as part of the interface?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24 20:55                         ` Ben Widawsky
@ 2020-06-25  6:28                           ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2020-06-25  6:28 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On Wed 24-06-20 13:55:18, Ben Widawsky wrote:
> On 20-06-24 22:42:32, Michal Hocko wrote:
> > On Wed 24-06-20 13:23:44, Ben Widawsky wrote:
> > > On 20-06-24 22:07:50, Michal Hocko wrote:
> > > > On Wed 24-06-20 13:01:40, Ben Widawsky wrote:
> > > > > On 20-06-24 21:51:58, Michal Hocko wrote:
> > > > > > On Wed 24-06-20 12:37:33, Ben Widawsky wrote:
> > > > > > > On 20-06-24 20:39:17, Michal Hocko wrote:
> > > > > > > > On Wed 24-06-20 09:16:43, Ben Widawsky wrote:
> > > > [...]
> > > > > > > > > > Or do I miss something that really requires more involved approach like
> > > > > > > > > > building custom zonelists and other larger changes to the allocator?
> > > > > > > > > 
> > > > > > > > > I think I'm missing how this allows selecting from multiple preferred nodes. In
> > > > > > > > > this case when you try to get the page from the freelist, you'll get the
> > > > > > > > > zonelist of the preferred node, and when you actually scan through on page
> > > > > > > > > allocation, you have no way to filter out the non-preferred nodes. I think the
> > > > > > > > > plumbing of multiple nodes has to go all the way through
> > > > > > > > > __alloc_pages_nodemask(). But it's possible I've missed the point.
> > > > > > > > 
> > > > > > > > policy_nodemask() will provide the nodemask which will be used as a
> > > > > > > > filter on the policy_node.
> > > > > > > 
> > > > > > > Ah, gotcha. Enabling independent masks seemed useful. Some bad decisions got me
> > > > > > > to that point. UAPI cannot get independent masks, and callers of these functions
> > > > > > > don't yet use them.
> > > > > > > 
> > > > > > > So let me ask before I actually type it up and find it's much much simpler, is
> > > > > > > there not some perceived benefit to having both masks being independent?
> > > > > > 
> > > > > > I am not sure I follow. Which two masks do you have in mind? zonelist
> > > > > > and user provided nodemask?
> > > > > 
> > > > > Internally, a nodemask_t for preferred node, and a nodemask_t for bound nodes.
> > > > 
> > > > Each mask is a local to its policy object.
> > > 
> > > I mean for __alloc_pages_nodemask as an internal API. That is irrespective of
> > > policy. Policy decisions are all made beforehand. The question from a few mails
> > > ago was whether there is any use in keeping that change to
> > > __alloc_pages_nodemask accepting two nodemasks.
> > 
> > It is probably too late for me because I am still not following you
> > mean. Maybe it would be better to provide a pseudo code what you have in
> > mind. Anyway all that I am saying is that for the functionality that you
> > propose and _if_ the fallback strategy is fixed then all you should need
> > is to use the preferred nodemask for the __alloc_pages_nodemask and a
> > fallback allocation to the full (NULL nodemask). So you first try what
> > the userspace prefers - __GFP_RETRY_MAYFAIL will give you try hard but
> > do not OOM if the memory is depleted semantic and the fallback
> > allocation goes all the way to OOM on the complete memory depletion.
> > So I do not see much point in a custom zonelist for the policy. Maybe as
> > a micro-optimization to save some branches here and there.
> > 
> > If you envision usecases which might want to control the fallback
> > allocation strategy then this would get more complex because you
> > would need a sorted list of zones to try but this would really require
> > some solid usecase and it should build on top of a trivial
> > implementation which really is BIND with the fallback.
> > 
> 
> I will implement what you suggest. I think it's a good suggestion. Here is what
> I mean though:
> -struct page *
> -__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> -                                                       nodemask_t *nodemask);
> +struct page *
> +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, nodemask_t *prefmask,
> +		       nodemask_t *nodemask);
> 
> Is there any value in keeping two nodemasks as part of the interface?

I do not see any advantage. The first thing you would have to do is
either intersect the two or special case the code to use one over
another and then you would need a clear criterion on how to do that.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 02/18] mm/mempolicy: Use node_mem_id() instead of node_id()
  2020-06-24 16:48     ` Ben Widawsky
@ 2020-06-26 12:30       ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2020-06-26 12:30 UTC (permalink / raw)
  To: Ben Widawsky; +Cc: linux-mm, Andrew Morton, Lee Schermerhorn

On Wed 24-06-20 09:48:37, Ben Widawsky wrote:
> On 20-06-24 10:25:59, Michal Hocko wrote:
> > On Fri 19-06-20 09:24:09, Ben Widawsky wrote:
> > > Calling out some distinctions first as I understand it, and the
> > > reasoning of the patch:
> > > numa_node_id() - The node id for the currently running CPU.
> > > numa_mem_id() - The node id for the closest memory node.
> > 
> > Correct
> > 
> > > The case where they are not the same is CONFIG_HAVE_MEMORYLESS_NODES.
> > > Only ia64 and powerpc support this option, so it is perhaps not a very
> > > interesting situation to most.
> > 
> > Other arches can have nodes without any memory as well. Just offline all
> > the managed memory via hotplug... (please note that such node still
> > might have memory present! Just not useable by the page allocator)
> 
> You must have CONFIG_HAVE_MEMORYLESS_NODES defined. So I believe that this
> change is limited to ia64 and powerpc. I don't think there is a way to set it
> outside of those arches.

I have tried to say that while other arches (like x86) do not have
CONFIG_HAVE_MEMORYLESS_NODES defined they still can end up with a memory
node without any memory. Just use memory hotplug...
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-24  7:52       ` Michal Hocko
  2020-06-24 16:16         ` Ben Widawsky
@ 2020-06-26 21:39         ` Ben Widawsky
  2020-06-29 10:16           ` Michal Hocko
  1 sibling, 1 reply; 44+ messages in thread
From: Ben Widawsky @ 2020-06-26 21:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On 20-06-24 09:52:16, Michal Hocko wrote:
> On Tue 23-06-20 09:12:11, Ben Widawsky wrote:
> > On 20-06-23 13:20:48, Michal Hocko wrote:
> [...]
> > > It would be also great to provide a high level semantic description
> > > here. I have very quickly glanced through patches and they are not
> > > really trivial to follow with many incremental steps so the higher level
> > > intention is lost easily.
> > > 
> > > Do I get it right that the default semantic is essentially
> > > 	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
> > > 	  semantic)
> > > 	- fallback to numa unrestricted allocation with the default
> > > 	  numa policy on the failure
> > > 
> > > Or are there any usecases to modify how hard to keep the preference over
> > > the fallback?
> > 
> > tl;dr is: yes, and no usecases.
> 
> OK, then I am wondering why the change has to be so involved. Except for
> syscall plumbing the only real change to the allocator path would be
> something like
> 
> static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> {
> 	/* Lower zones don't get a nodemask applied for MPOL_BIND */
> 	if (unlikely(policy->mode == MPOL_BIND || 
> 	   	     policy->mode == MPOL_PREFERED_MANY) &&
> 			apply_policy_zone(policy, gfp_zone(gfp)) &&
> 			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
> 		return &policy->v.nodes;
> 
> 	return NULL;
> }
> 
> alloc_pages_current
> 
> 	if (pol->mode == MPOL_INTERLEAVE)
> 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> 	else {
> 		gfp_t gfp_attempt = gfp;
> 
> 		/*
> 		 * Make sure the first allocation attempt will try hard
> 		 * but eventually fail without OOM killer or other
> 		 * disruption before falling back to the full nodemask
> 		 */
> 		if (pol->mode == MPOL_PREFERED_MANY)
> 			gfp_attempt |= __GFP_RETRY_MAYFAIL;	
> 
> 		page = __alloc_pages_nodemask(gfp_attempt, order,
> 				policy_node(gfp, pol, numa_node_id()),
> 				policy_nodemask(gfp, pol));
> 		if (!page && pol->mode == MPOL_PREFERED_MANY)
> 			page = __alloc_pages_nodemask(gfp, order,
> 				numa_node_id(), NULL);
> 	}
> 
> 	return page;
> 
> similar (well slightly more hairy) in alloc_pages_vma
> 
> Or do I miss something that really requires more involved approach like
> building custom zonelists and other larger changes to the allocator?

Hi Michal,

I'm mostly done implementing this change. It looks good, and so far I think it's
functionally equivalent. One thing though, above you use NULL for the fallback.
That actually should not be NULL because of the logic in policy_node to restrict
zones, and obey cpusets. I've implemented it as such, but I was hoping someone
with a deeper understanding, and more experience can confirm that was the
correct thing to do.

Thanks.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/18] multiple preferred nodes
  2020-06-26 21:39         ` Ben Widawsky
@ 2020-06-29 10:16           ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2020-06-29 10:16 UTC (permalink / raw)
  To: Ben Widawsky
  Cc: linux-mm, Andi Kleen, Andrew Morton, Christoph Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, David Rientjes,
	Jason Gunthorpe, Johannes Weiner, Jonathan Corbet,
	Kuppuswamy Sathyanarayanan, Lee Schermerhorn, Li Xinhai,
	Mel Gorman, Mike Kravetz, Mina Almasry, Tejun Heo,
	Vlastimil Babka, linux-api

On Fri 26-06-20 14:39:05, Ben Widawsky wrote:
> On 20-06-24 09:52:16, Michal Hocko wrote:
> > On Tue 23-06-20 09:12:11, Ben Widawsky wrote:
> > > On 20-06-23 13:20:48, Michal Hocko wrote:
> > [...]
> > > > It would be also great to provide a high level semantic description
> > > > here. I have very quickly glanced through patches and they are not
> > > > really trivial to follow with many incremental steps so the higher level
> > > > intention is lost easily.
> > > > 
> > > > Do I get it right that the default semantic is essentially
> > > > 	- allocate page from the given nodemask (with __GFP_RETRY_MAYFAIL
> > > > 	  semantic)
> > > > 	- fallback to numa unrestricted allocation with the default
> > > > 	  numa policy on the failure
> > > > 
> > > > Or are there any usecases to modify how hard to keep the preference over
> > > > the fallback?
> > > 
> > > tl;dr is: yes, and no usecases.
> > 
> > OK, then I am wondering why the change has to be so involved. Except for
> > syscall plumbing the only real change to the allocator path would be
> > something like
> > 
> > static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> > {
> > 	/* Lower zones don't get a nodemask applied for MPOL_BIND */
> > 	if (unlikely(policy->mode == MPOL_BIND || 
> > 	   	     policy->mode == MPOL_PREFERED_MANY) &&
> > 			apply_policy_zone(policy, gfp_zone(gfp)) &&
> > 			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
> > 		return &policy->v.nodes;
> > 
> > 	return NULL;
> > }
> > 
> > alloc_pages_current
> > 
> > 	if (pol->mode == MPOL_INTERLEAVE)
> > 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
> > 	else {
> > 		gfp_t gfp_attempt = gfp;
> > 
> > 		/*
> > 		 * Make sure the first allocation attempt will try hard
> > 		 * but eventually fail without OOM killer or other
> > 		 * disruption before falling back to the full nodemask
> > 		 */
> > 		if (pol->mode == MPOL_PREFERED_MANY)
> > 			gfp_attempt |= __GFP_RETRY_MAYFAIL;	
> > 
> > 		page = __alloc_pages_nodemask(gfp_attempt, order,
> > 				policy_node(gfp, pol, numa_node_id()),
> > 				policy_nodemask(gfp, pol));
> > 		if (!page && pol->mode == MPOL_PREFERED_MANY)
> > 			page = __alloc_pages_nodemask(gfp, order,
> > 				numa_node_id(), NULL);
> > 	}
> > 
> > 	return page;
> > 
> > similar (well slightly more hairy) in alloc_pages_vma
> > 
> > Or do I miss something that really requires more involved approach like
> > building custom zonelists and other larger changes to the allocator?
> 
> Hi Michal,
> 
> I'm mostly done implementing this change. It looks good, and so far I think it's
> functionally equivalent. One thing though, above you use NULL for the fallback.
> That actually should not be NULL because of the logic in policy_node to restrict
> zones, and obey cpusets. I've implemented it as such, but I was hoping someone
> with a deeper understanding, and more experience can confirm that was the
> correct thing to do.

Cpusets are just plumbed into the allocator directly. Have a look at
__cpuset_zone_allowed call inside get_page_from_freelist. Anyway
functionally what you are looking for here is that the fallback
allocation should be exactly as if there was no mempolicy in place. And
that is expressed by NULL nodemask. The rest is done automagically...

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 02/18] mm/mempolicy: Use node_mem_id() instead of node_id()
  2020-06-19 16:23 Ben Widawsky
@ 2020-06-19 16:23 ` Ben Widawsky
  0 siblings, 0 replies; 44+ messages in thread
From: Ben Widawsky @ 2020-06-19 16:23 UTC (permalink / raw)
  To: linux-mm

Calling out some distinctions first as I understand it, and the
reasoning of the patch:
numa_node_id() - The node id for the currently running CPU.
numa_mem_id() - The node id for the closest memory node.

The case where they are not the same is CONFIG_HAVE_MEMORYLESS_NODES.
Only ia64 and powerpc support this option, so it is perhaps not a very
interesting situation to most.

The question is, when you do want which? numa_node_id() is definitely
what's desired if MPOL_PREFERRED, or MPOL_LOCAL were used, since the ABI
states "This mode specifies "local allocation"; the memory is allocated
on the node of the CPU that triggered the allocation (the "local
node")." It would be weird, though not impossible to set this policy on
a CPU that has memoryless nodes. A more likely way to hit this is with
interleaving. The current interfaces will return some equally weird
thing, but at least it's symmetric. Therefore, in cases where the node
is being queried for the currently running process, it probably makes
sense to use numa_node_id(). For other cases however, when CPU is trying
to obtain the "local" memory, numa_mem_id() already contains this and
should be used instead.

This really should only effect configurations where
CONFIG_HAVE_MEMORYLESS_NODES=y, and even on those machines it's quite
possible the ultimate behavior would be identical.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Ben Widawsky <ben.widawsky@intel.com>
---
 mm/mempolicy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 36ee3267c25f..99e0f3f9c4a6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1991,7 +1991,7 @@ static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
 	int nid;
 
 	if (!nnodes)
-		return numa_node_id();
+		return numa_mem_id();
 	target = (unsigned int)n % nnodes;
 	nid = first_node(pol->v.nodes);
 	for (i = 0; i < target; i++)
@@ -2049,7 +2049,7 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
 		nid = interleave_nid(*mpol, vma, addr,
 					huge_page_shift(hstate_vma(vma)));
 	} else {
-		nid = policy_node(gfp_flags, *mpol, numa_node_id());
+		nid = policy_node(gfp_flags, *mpol, numa_mem_id());
 		if ((*mpol)->mode == MPOL_BIND)
 			*nodemask = &(*mpol)->v.nodes;
 	}
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2020-06-29 10:16 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-19 16:24 [PATCH 00/18] multiple preferred nodes Ben Widawsky
2020-06-19 16:24 ` [PATCH 01/18] mm/mempolicy: Add comment for missing LOCAL Ben Widawsky
2020-06-24  7:55   ` Michal Hocko
2020-06-19 16:24 ` [PATCH 02/18] mm/mempolicy: Use node_mem_id() instead of node_id() Ben Widawsky
2020-06-24  8:25   ` Michal Hocko
2020-06-24 16:48     ` Ben Widawsky
2020-06-26 12:30       ` Michal Hocko
2020-06-19 16:24 ` [PATCH 03/18] mm/page_alloc: start plumbing multi preferred node Ben Widawsky
2020-06-19 16:24 ` [PATCH 04/18] mm/page_alloc: add preferred pass to page allocation Ben Widawsky
2020-06-19 16:24 ` [PATCH 05/18] mm/mempolicy: convert single preferred_node to full nodemask Ben Widawsky
2020-06-19 16:24 ` [PATCH 06/18] mm/mempolicy: Add MPOL_PREFERRED_MANY for multiple preferred nodes Ben Widawsky
2020-06-19 16:24 ` [PATCH 07/18] mm/mempolicy: allow preferred code to take a nodemask Ben Widawsky
2020-06-19 16:24 ` [PATCH 08/18] mm/mempolicy: refactor rebind code for PREFERRED_MANY Ben Widawsky
2020-06-19 16:24 ` [PATCH 09/18] mm: Finish handling MPOL_PREFERRED_MANY Ben Widawsky
2020-06-19 16:24 ` [PATCH 10/18] mm: clean up alloc_pages_vma (thp) Ben Widawsky
2020-06-19 16:24 ` [PATCH 11/18] mm: Extract THP hugepage allocation Ben Widawsky
2020-06-19 16:24 ` [PATCH 12/18] mm/mempolicy: Use __alloc_page_node for interleaved Ben Widawsky
2020-06-19 16:24 ` [PATCH 13/18] mm: kill __alloc_pages Ben Widawsky
2020-06-19 16:24 ` [PATCH 14/18] mm/mempolicy: Introduce policy_preferred_nodes() Ben Widawsky
2020-06-19 16:24 ` [PATCH 15/18] mm: convert callers of __alloc_pages_nodemask to pmask Ben Widawsky
2020-06-19 16:24 ` [PATCH 16/18] alloc_pages_nodemask: turn preferred nid into a nodemask Ben Widawsky
2020-06-19 16:24 ` [PATCH 17/18] mm: Use less stack for page allocations Ben Widawsky
2020-06-19 16:24 ` [PATCH 18/18] mm/mempolicy: Advertise new MPOL_PREFERRED_MANY Ben Widawsky
2020-06-22  7:09 ` [PATCH 00/18] multiple preferred nodes Michal Hocko
2020-06-23 11:20   ` Michal Hocko
2020-06-23 16:12     ` Ben Widawsky
2020-06-24  7:52       ` Michal Hocko
2020-06-24 16:16         ` Ben Widawsky
2020-06-24 18:39           ` Michal Hocko
2020-06-24 19:37             ` Ben Widawsky
2020-06-24 19:51               ` Michal Hocko
2020-06-24 20:01                 ` Ben Widawsky
2020-06-24 20:07                   ` Michal Hocko
2020-06-24 20:23                     ` Ben Widawsky
2020-06-24 20:42                       ` Michal Hocko
2020-06-24 20:55                         ` Ben Widawsky
2020-06-25  6:28                           ` Michal Hocko
2020-06-26 21:39         ` Ben Widawsky
2020-06-29 10:16           ` Michal Hocko
2020-06-22 20:54 ` Andi Kleen
2020-06-22 21:02   ` Ben Widawsky
2020-06-22 21:07   ` Dave Hansen
2020-06-22 22:02     ` Andi Kleen
  -- strict thread matches above, loose matches on Subject: below --
2020-06-19 16:23 Ben Widawsky
2020-06-19 16:23 ` [PATCH 02/18] mm/mempolicy: Use node_mem_id() instead of node_id() Ben Widawsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).