linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v1 00/19] Modify zonelist to nodelist v1
@ 2019-11-21 15:17 Pengfei Li
  2019-11-21 15:17 ` [RFC v1 01/19] mm, mmzone: modify zonelist to nodelist Pengfei Li
                   ` (21 more replies)
  0 siblings, 22 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:17 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

Motivation
----------
Currently if we want to iterate through all the nodes we have to
traverse all the zones from the zonelist.

So in order to reduce the number of loops required to traverse node,
this series of patches modified the zonelist to nodelist.

Two new macros have been introduced:
1) for_each_node_nlist
2) for_each_node_nlist_nodemask


Benefit
-------
1. For a NUMA system with N nodes, each node has M zones, the number
   of loops is reduced from N*M times to N times when traversing node.

2. The size of pg_data_t is much reduced.


Test Result
-----------
Currently I have only performed a simple page allocation benchmark
test on my laptop, and the results show that the performance of a
system with only one node is almost unaffected.


Others
------
Next I will do more performance testing and add it to the Test Result.

Since I don't currently have multiple node NUMA systems, I would be
grateful if anyone would like to test this series of patches.

I am still not sure this series of patches is on the right way so I
am sending this as an RFC. 

Any comments are highly appreciated.

Pengfei Li (19):
  mm, mmzone: modify zonelist to nodelist
  mm, hugetlb: use for_each_node in dequeue_huge_page_nodemask()
  mm, oom_kill: use for_each_node in constrained_alloc()
  mm, slub: use for_each_node in get_any_partial()
  mm, slab: use for_each_node in fallback_alloc()
  mm, vmscan: use for_each_node in do_try_to_free_pages()
  mm, vmscan: use first_node in throttle_direct_reclaim()
  mm, vmscan: pass pgdat to wakeup_kswapd()
  mm, vmscan: use for_each_node in shrink_zones()
  mm, page_alloc: use for_each_node in wake_all_kswapds()
  mm, mempolicy: use first_node in mempolicy_slab_node()
  mm, mempolicy: use first_node in mpol_misplaced()
  mm, page_alloc: use first_node in local_memory_node()
  mm, compaction: rename compaction_zonelist_suitable
  mm, mm_init: rename mminit_verify_zonelist
  mm, page_alloc: cleanup build_zonelists
  mm, memory_hotplug: cleanup online_pages()
  kernel, sysctl: cleanup numa_zonelist_order
  mm, mmzone: cleanup zonelist in comments

 arch/hexagon/mm/init.c           |   2 +-
 arch/ia64/include/asm/topology.h |   2 +-
 arch/x86/mm/numa.c               |   2 +-
 drivers/tty/sysrq.c              |   2 +-
 include/linux/compaction.h       |   2 +-
 include/linux/gfp.h              |  18 +-
 include/linux/mmzone.h           | 249 +++++++++++++------------
 include/linux/oom.h              |   4 +-
 include/linux/swap.h             |   2 +-
 include/trace/events/oom.h       |   9 +-
 init/main.c                      |   2 +-
 kernel/cgroup/cpuset.c           |   4 +-
 kernel/sysctl.c                  |   8 +-
 mm/compaction.c                  |  20 +-
 mm/hugetlb.c                     |  21 +--
 mm/internal.h                    |  13 +-
 mm/memcontrol.c                  |   2 +-
 mm/memory_hotplug.c              |  24 +--
 mm/mempolicy.c                   |  26 ++-
 mm/mm_init.c                     |  74 +++++---
 mm/mmzone.c                      |  30 ++-
 mm/oom_kill.c                    |  16 +-
 mm/page_alloc.c                  | 301 ++++++++++++++++---------------
 mm/slab.c                        |  13 +-
 mm/slub.c                        |  14 +-
 mm/vmscan.c                      | 149 ++++++++-------
 26 files changed, 518 insertions(+), 491 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC v1 01/19] mm, mmzone: modify zonelist to nodelist
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
@ 2019-11-21 15:17 ` Pengfei Li
  2019-11-21 15:17 ` [RFC v1 02/19] mm, hugetlb: use for_each_node in dequeue_huge_page_nodemask() Pengfei Li
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:17 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

Motivation
----------
Currently if we want to iterate through all the nodes we have to
traverse all the zones from the zonelist.

So in order to reduce the number of loops required to traverse node,
this series of patches modified the zonelist to nodelist.

Two new macros have been introduced:
1) for_each_node_nlist
2) for_each_node_nlist_nodemask

Benefit
-------
1. For a NUMA system with N nodes, each node has M zones, the number
   of loops is reduced from N*M times to N times when traversing node.

2. The size of pg_data_t is much reduced.

Signed-off-by: Pengfei Li <fly@kernel.page>
---
 drivers/tty/sysrq.c        |   2 +-
 include/linux/gfp.h        |  10 +-
 include/linux/mmzone.h     | 239 ++++++++++++++++++++-----------------
 include/linux/oom.h        |   4 +-
 include/linux/swap.h       |   2 +-
 include/trace/events/oom.h |   9 +-
 mm/compaction.c            |  18 +--
 mm/hugetlb.c               |   8 +-
 mm/internal.h              |   7 +-
 mm/memcontrol.c            |   2 +-
 mm/mempolicy.c             |  18 +--
 mm/mm_init.c               |  70 ++++++-----
 mm/mmzone.c                |  30 ++---
 mm/oom_kill.c              |  10 +-
 mm/page_alloc.c            | 207 +++++++++++++++++---------------
 mm/slab.c                  |   8 +-
 mm/slub.c                  |   8 +-
 mm/vmscan.c                |  34 +++---
 18 files changed, 367 insertions(+), 319 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 573b2055173c..6c6fa8ba7397 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -362,7 +362,7 @@ static void moom_callback(struct work_struct *ignored)
 {
 	const gfp_t gfp_mask = GFP_KERNEL;
 	struct oom_control oc = {
-		.zonelist = node_zonelist(first_memory_node, gfp_mask),
+		.nodelist = node_nodelist(first_memory_node, gfp_mask),
 		.nodemask = NULL,
 		.memcg = NULL,
 		.gfp_mask = gfp_mask,
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index e5b817cb86e7..6caab5a30f39 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -456,13 +456,13 @@ static inline enum zone_type gfp_zone(gfp_t flags)
  * virtual kernel addresses to the allocated page(s).
  */
 
-static inline int gfp_zonelist(gfp_t flags)
+static inline int gfp_nodelist(gfp_t flags)
 {
 #ifdef CONFIG_NUMA
 	if (unlikely(flags & __GFP_THISNODE))
-		return ZONELIST_NOFALLBACK;
+		return NODELIST_NOFALLBACK;
 #endif
-	return ZONELIST_FALLBACK;
+	return NODELIST_FALLBACK;
 }
 
 /*
@@ -474,9 +474,9 @@ static inline int gfp_zonelist(gfp_t flags)
  * For the normal case of non-DISCONTIGMEM systems the NODE_DATA() gets
  * optimized to &contig_page_data at compile-time.
  */
-static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
+static inline struct nodelist *node_nodelist(int nid, gfp_t flags)
 {
-	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
+	return NODE_DATA(nid)->node_nodelists + gfp_nodelist(flags);
 }
 
 #ifndef HAVE_ARCH_FREE_PAGE
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 89d8ff06c9ce..dd493239b8b2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -644,46 +644,36 @@ static inline bool zone_intersects(struct zone *zone,
  */
 #define DEF_PRIORITY 12
 
-/* Maximum number of zones on a zonelist */
-#define MAX_ZONES_PER_ZONELIST (MAX_NUMNODES * MAX_NR_ZONES)
-
 enum {
-	ZONELIST_FALLBACK,	/* zonelist with fallback */
+	NODELIST_FALLBACK,	/* nodelist with fallback */
 #ifdef CONFIG_NUMA
 	/*
-	 * The NUMA zonelists are doubled because we need zonelists that
+	 * The NUMA zonelists are doubled because we need nodelists that
 	 * restrict the allocations to a single node for __GFP_THISNODE.
 	 */
-	ZONELIST_NOFALLBACK,	/* zonelist without fallback (__GFP_THISNODE) */
+	NODELIST_NOFALLBACK,	/* nodelist without fallback (__GFP_THISNODE) */
 #endif
-	MAX_ZONELISTS
+	MAX_NODELISTS
 };
 
 /*
- * This struct contains information about a zone in a zonelist. It is stored
+ * This struct contains information about a node in a zonelist. It is stored
  * here to avoid dereferences into large structures and lookups of tables
  */
-struct zoneref {
-	struct zone *zone;	/* Pointer to actual zone */
-	int zone_idx;		/* zone_idx(zoneref->zone) */
+struct nlist_entry {
+	unsigned int node_usable_zones;
+	int nid;
+	struct zone *zones;
 };
 
-/*
- * One allocation request operates on a zonelist. A zonelist
- * is a list of zones, the first one is the 'goal' of the
- * allocation, the other zones are fallback zones, in decreasing
- * priority.
- *
- * To speed the reading of the zonelist, the zonerefs contain the zone index
- * of the entry being read. Helper functions to access information given
- * a struct zoneref are
- *
- * zonelist_zone()	- Return the struct zone * for an entry in _zonerefs
- * zonelist_zone_idx()	- Return the index of the zone for an entry
- * zonelist_node_idx()	- Return the index of the node for an entry
- */
-struct zonelist {
-	struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
+struct nodelist {
+	struct nlist_entry _nlist_entries[MAX_NUMNODES + 1];
+};
+
+struct nlist_traverser {
+	const struct nlist_entry *entry;
+	unsigned int highidx_mask;
+	unsigned int usable_zones;
 };
 
 #ifndef CONFIG_DISCONTIGMEM
@@ -710,7 +700,7 @@ struct deferred_split {
 struct bootmem_data;
 typedef struct pglist_data {
 	struct zone node_zones[MAX_NR_ZONES];
-	struct zonelist node_zonelists[MAX_ZONELISTS];
+	struct nodelist node_nodelists[MAX_NODELISTS];
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
@@ -1018,105 +1008,138 @@ extern struct zone *next_zone(struct zone *zone);
 			; /* do nothing */		\
 		else
 
-static inline struct zone *zonelist_zone(struct zoneref *zoneref)
+#define usable_zones_add(ztype, usable_zones)			\
+	__set_bit(ztype, (unsigned long *)&usable_zones)
+
+#define usable_zones_remove(ztype, usable_zones)		\
+	__clear_bit(ztype, (unsigned long *)&usable_zones)
+
+#define usable_zones_highest(usable_zones)	\
+	__fls(usable_zones)
+
+void __next_nlist_entry_nodemask(struct nlist_traverser *t,
+					nodemask_t *nodemask);
+
+static inline int
+traverser_node(struct nlist_traverser *t)
 {
-	return zoneref->zone;
+	return t->entry->nid;
 }
 
-static inline int zonelist_zone_idx(struct zoneref *zoneref)
+static inline void
+init_nlist_traverser(struct nlist_traverser *t,
+			struct nodelist *nlist, enum zone_type highidx)
 {
-	return zoneref->zone_idx;
+	t->entry = nlist->_nlist_entries;
+	t->highidx_mask = (~(UINT_MAX << (highidx + 1)));
 }
 
-static inline int zonelist_node_idx(struct zoneref *zoneref)
+static inline unsigned int
+node_has_usable_zones(struct nlist_traverser *t)
 {
-	return zone_to_nid(zoneref->zone);
+	t->usable_zones = t->entry->node_usable_zones & t->highidx_mask;
+	return t->usable_zones;
 }
 
-struct zoneref *__next_zones_zonelist(struct zoneref *z,
-					enum zone_type highest_zoneidx,
-					nodemask_t *nodes);
+static __always_inline void
+next_nlist_entry(struct nlist_traverser *t)
+{
+	while (!node_has_usable_zones(t))
+		t->entry++;
+}
 
-/**
- * next_zones_zonelist - Returns the next zone at or below highest_zoneidx within the allowed nodemask using a cursor within a zonelist as a starting point
- * @z - The cursor used as a starting point for the search
- * @highest_zoneidx - The zone index of the highest zone to return
- * @nodes - An optional nodemask to filter the zonelist with
- *
- * This function returns the next zone at or below a given zone index that is
- * within the allowed nodemask using a cursor as the starting point for the
- * search. The zoneref returned is a cursor that represents the current zone
- * being examined. It should be advanced by one before calling
- * next_zones_zonelist again.
- */
-static __always_inline struct zoneref *next_zones_zonelist(struct zoneref *z,
-					enum zone_type highest_zoneidx,
-					nodemask_t *nodes)
+static __always_inline void
+next_nlist_entry_nodemask(struct nlist_traverser *t, nodemask_t *nodemask)
 {
-	if (likely(!nodes && zonelist_zone_idx(z) <= highest_zoneidx))
-		return z;
-	return __next_zones_zonelist(z, highest_zoneidx, nodes);
+	if (likely(!nodemask))
+		next_nlist_entry(t);
+	else
+		__next_nlist_entry_nodemask(t, nodemask);
 }
 
-/**
- * first_zones_zonelist - Returns the first zone at or below highest_zoneidx within the allowed nodemask in a zonelist
- * @zonelist - The zonelist to search for a suitable zone
- * @highest_zoneidx - The zone index of the highest zone to return
- * @nodes - An optional nodemask to filter the zonelist with
- * @return - Zoneref pointer for the first suitable zone found (see below)
- *
- * This function returns the first zone at or below a given zone index that is
- * within the allowed nodemask. The zoneref returned is a cursor that can be
- * used to iterate the zonelist with next_zones_zonelist by advancing it by
- * one before calling.
- *
- * When no eligible zone is found, zoneref->zone is NULL (zoneref itself is
- * never NULL). This may happen either genuinely, or due to concurrent nodemask
- * update due to cpuset modification.
- */
-static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
-					enum zone_type highest_zoneidx,
-					nodemask_t *nodes)
+#define for_each_node_nlist(node, t, nlist, highidx)		\
+	for (init_nlist_traverser(t, nlist, highidx),		\
+		next_nlist_entry(t), node = traverser_node(t);	\
+	     node != NUMA_NO_NODE;				\
+	     (t)->entry++,					\
+		next_nlist_entry(t), node = traverser_node(t))
+
+#define for_each_node_nlist_nodemask(node, t, nlist, highidx, nodemask) \
+	for (init_nlist_traverser(t, nlist, highidx),			\
+		next_nlist_entry_nodemask(t, nodemask),			\
+		node = traverser_node(t);				\
+	     node != NUMA_NO_NODE;					\
+	     (t)->entry++, next_nlist_entry_nodemask(t, nodemask),	\
+		node = traverser_node(t))
+
+static inline int
+first_node_nlist_nodemask(struct nodelist *nlist,
+				enum zone_type highidx, nodemask_t *nodemask)
 {
-	return next_zones_zonelist(zonelist->_zonerefs,
-							highest_zoneidx, nodes);
+	struct nlist_traverser t;
+
+	init_nlist_traverser(&t, nlist, highidx);
+	next_nlist_entry_nodemask(&t, nodemask);
+
+	return traverser_node(&t);
 }
 
-/**
- * for_each_zone_zonelist_nodemask - helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask
- * @zone - The current zone in the iterator
- * @z - The current pointer within zonelist->_zonerefs being iterated
- * @zlist - The zonelist being iterated
- * @highidx - The zone index of the highest zone to return
- * @nodemask - Nodemask allowed by the allocator
- *
- * This iterator iterates though all zones at or below a given zone index and
- * within a given nodemask
- */
-#define for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
-	for (z = first_zones_zonelist(zlist, highidx, nodemask), zone = zonelist_zone(z);	\
-		zone;							\
-		z = next_zones_zonelist(++z, highidx, nodemask),	\
-			zone = zonelist_zone(z))
+static inline struct zone *
+traverser_zone(struct nlist_traverser *t)
+{
+	enum zone_type ztype = usable_zones_highest(t->usable_zones);
 
-#define for_next_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
-	for (zone = z->zone;	\
-		zone;							\
-		z = next_zones_zonelist(++z, highidx, nodemask),	\
-			zone = zonelist_zone(z))
+	usable_zones_remove(ztype, t->usable_zones);
 
+	return (t->entry->zones + ztype);
+}
 
-/**
- * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
- * @zone - The current zone in the iterator
- * @z - The current pointer within zonelist->zones being iterated
- * @zlist - The zonelist being iterated
- * @highidx - The zone index of the highest zone to return
- *
- * This iterator iterates though all zones at or below a given zone index.
- */
-#define for_each_zone_zonelist(zone, z, zlist, highidx) \
-	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
+static inline struct zone *
+get_zone_nlist(struct nlist_traverser *t)
+{
+	next_nlist_entry(t);
+
+	if (!t->entry->zones)
+		return NULL;
+
+	return traverser_zone(t);
+}
+
+static inline struct zone *
+get_zone_nlist_nodemask(struct nlist_traverser *t, nodemask_t *nodemask)
+{
+	next_nlist_entry_nodemask(t, nodemask);
+
+	if (!t->entry->zones)
+		return NULL;
+
+	return traverser_zone(t);
+}
+
+#define for_each_zone_nlist(zone, t, nlist, highidx)			\
+	for (init_nlist_traverser(t, nlist, highidx),			\
+		zone = get_zone_nlist(t);				\
+	     zone;							\
+	     zone = ((t)->usable_zones ? traverser_zone(t)		\
+				: ((t)->entry++, get_zone_nlist(t))))
+
+#define for_each_zone_nlist_nodemask(zone, t, nlist, highidx, nodemask)	\
+	for (init_nlist_traverser(t, nlist, highidx),				\
+		zone = get_zone_nlist_nodemask(t, nodemask);			\
+	     zone;								\
+	     zone = ((t)->usable_zones ? traverser_zone(t)			\
+			: ((t)->entry++, get_zone_nlist_nodemask(t, nodemask))))
+
+static inline struct zone *
+first_zone_nlist_nodemask(struct nodelist *nlist,
+				enum zone_type highidx, nodemask_t *nodemask)
+{
+	struct nlist_traverser t;
+
+	init_nlist_traverser(&t, nlist, highidx);
+
+	return get_zone_nlist_nodemask(&t, nodemask);
+}
 
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
diff --git a/include/linux/oom.h b/include/linux/oom.h
index c696c265f019..634b1d2a81fa 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -10,7 +10,7 @@
 #include <linux/sched/coredump.h> /* MMF_* */
 #include <linux/mm.h> /* VM_FAULT* */
 
-struct zonelist;
+struct nodelist;
 struct notifier_block;
 struct mem_cgroup;
 struct task_struct;
@@ -28,7 +28,7 @@ enum oom_constraint {
  */
 struct oom_control {
 	/* Used to determine cpuset */
-	struct zonelist *zonelist;
+	struct nodelist *nodelist;
 
 	/* Used to determine mempolicy */
 	nodemask_t *nodemask;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1e99f7ac1d7e..c041d7478ec8 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -349,7 +349,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 
 /* linux/mm/vmscan.c */
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
-extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
+extern unsigned long try_to_free_pages(struct nodelist *nodelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
diff --git a/include/trace/events/oom.h b/include/trace/events/oom.h
index 26a11e4a2c36..08007ea34243 100644
--- a/include/trace/events/oom.h
+++ b/include/trace/events/oom.h
@@ -31,7 +31,8 @@ TRACE_EVENT(oom_score_adj_update,
 
 TRACE_EVENT(reclaim_retry_zone,
 
-	TP_PROTO(struct zoneref *zoneref,
+	TP_PROTO(int node,
+		struct zone *zone,
 		int order,
 		unsigned long reclaimable,
 		unsigned long available,
@@ -39,7 +40,7 @@ TRACE_EVENT(reclaim_retry_zone,
 		int no_progress_loops,
 		bool wmark_check),
 
-	TP_ARGS(zoneref, order, reclaimable, available, min_wmark, no_progress_loops, wmark_check),
+	TP_ARGS(node, zone, order, reclaimable, available, min_wmark, no_progress_loops, wmark_check),
 
 	TP_STRUCT__entry(
 		__field(	int, node)
@@ -53,8 +54,8 @@ TRACE_EVENT(reclaim_retry_zone,
 	),
 
 	TP_fast_assign(
-		__entry->node = zone_to_nid(zoneref->zone);
-		__entry->zone_idx = zoneref->zone_idx;
+		__entry->node = node;
+		__entry->zone_idx = zone_idx(zone);
 		__entry->order = order;
 		__entry->reclaimable = reclaimable;
 		__entry->available = available;
diff --git a/mm/compaction.c b/mm/compaction.c
index 672d3c78c6ab..d9f42e18991c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2039,15 +2039,15 @@ enum compact_result compaction_suitable(struct zone *zone, int order,
 bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 		int alloc_flags)
 {
+	struct nlist_traverser t;
 	struct zone *zone;
-	struct zoneref *z;
 
 	/*
 	 * Make sure at least one zone would pass __compaction_suitable if we continue
 	 * retrying the reclaim.
 	 */
-	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
-					ac->nodemask) {
+	for_each_zone_nlist_nodemask(zone, &t, ac->nodelist,
+					ac->high_zoneidx, ac->nodemask) {
 		unsigned long available;
 		enum compact_result compact_result;
 
@@ -2060,7 +2060,7 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 		available = zone_reclaimable_pages(zone) / order;
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
 		compact_result = __compaction_suitable(zone, order, alloc_flags,
-				ac_classzone_idx(ac), available);
+				ac->classzone_idx, available);
 		if (compact_result != COMPACT_SKIPPED)
 			return true;
 	}
@@ -2341,9 +2341,9 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 		enum compact_priority prio, struct page **capture)
 {
 	int may_perform_io = gfp_mask & __GFP_IO;
-	struct zoneref *z;
-	struct zone *zone;
 	enum compact_result rc = COMPACT_SKIPPED;
+	struct nlist_traverser t;
+	struct zone *zone;
 
 	/*
 	 * Check if the GFP flags allow compaction - GFP_NOIO is really
@@ -2355,8 +2355,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 	trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);
 
 	/* Compact each zone in the list */
-	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
-								ac->nodemask) {
+	for_each_zone_nlist_nodemask(zone, &t, ac->nodelist,
+					ac->high_zoneidx, ac->nodemask) {
 		enum compact_result status;
 
 		if (prio > MIN_COMPACT_PRIORITY
@@ -2366,7 +2366,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 		}
 
 		status = compact_zone_order(zone, order, gfp_mask, prio,
-				alloc_flags, ac_classzone_idx(ac), capture);
+				alloc_flags, ac->classzone_idx, capture);
 		rc = max(status, rc);
 
 		/* The allocation should succeed, stop compacting */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ac65bb5e38ac..2e55ec5dc84d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -848,16 +848,16 @@ static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask,
 		nodemask_t *nmask)
 {
 	unsigned int cpuset_mems_cookie;
-	struct zonelist *zonelist;
+	struct nodelist *nodelist;
+	struct nlist_traverser t;
 	struct zone *zone;
-	struct zoneref *z;
 	int node = NUMA_NO_NODE;
 
-	zonelist = node_zonelist(nid, gfp_mask);
+	nodelist = node_nodelist(nid, gfp_mask);
 
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
-	for_each_zone_zonelist_nodemask(zone, z, zonelist, gfp_zone(gfp_mask), nmask) {
+	for_each_zone_nlist_nodemask(zone, &t, nodelist, gfp_zone(gfp_mask), nmask) {
 		struct page *page;
 
 		if (!cpuset_zone_allowed(zone, gfp_mask))
diff --git a/mm/internal.h b/mm/internal.h
index 3cf20ab3ca01..90008f9fe7d9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -111,16 +111,15 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
  * by a const pointer.
  */
 struct alloc_context {
-	struct zonelist *zonelist;
+	struct nodelist *nodelist;
 	nodemask_t *nodemask;
-	struct zoneref *preferred_zoneref;
+	struct zone *preferred_zone;
+	enum zone_type classzone_idx;
 	int migratetype;
 	enum zone_type high_zoneidx;
 	bool spread_dirty_pages;
 };
 
-#define ac_classzone_idx(ac) zonelist_zone_idx(ac->preferred_zoneref)
-
 /*
  * Locate the struct page for both the matching buddy in our
  * pair (buddy1) and the combined O(n+1) page they form (page).
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c5b5f74cfd4d..a9c3464c2bff 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1558,7 +1558,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 				     int order)
 {
 	struct oom_control oc = {
-		.zonelist = NULL,
+		.nodelist = NULL,
 		.nodemask = NULL,
 		.memcg = memcg,
 		.gfp_mask = gfp_mask,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b2920ae87a61..b1df19d42047 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1876,18 +1876,18 @@ unsigned int mempolicy_slab_node(void)
 		return interleave_nodes(policy);
 
 	case MPOL_BIND: {
-		struct zoneref *z;
+		struct zone *zone;
 
 		/*
 		 * Follow bind policy behavior and start allocation at the
 		 * first node.
 		 */
-		struct zonelist *zonelist;
+		struct nodelist *nodelist;
 		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
-		zonelist = &NODE_DATA(node)->node_zonelists[ZONELIST_FALLBACK];
-		z = first_zones_zonelist(zonelist, highest_zoneidx,
+		nodelist = &NODE_DATA(node)->node_nodelists[NODELIST_FALLBACK];
+		zone = first_zone_nlist_nodemask(nodelist, highest_zoneidx,
 							&policy->v.nodes);
-		return z->zone ? zone_to_nid(z->zone) : node;
+		return zone ? zone_to_nid(zone) : node;
 	}
 
 	default:
@@ -2404,7 +2404,7 @@ static void sp_free(struct sp_node *n)
 int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
 {
 	struct mempolicy *pol;
-	struct zoneref *z;
+	struct zone *zone;
 	int curnid = page_to_nid(page);
 	unsigned long pgoff;
 	int thiscpu = raw_smp_processor_id();
@@ -2440,11 +2440,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 */
 		if (node_isset(curnid, pol->v.nodes))
 			goto out;
-		z = first_zones_zonelist(
-				node_zonelist(numa_node_id(), GFP_HIGHUSER),
+		zone = first_zone_nlist_nodemask(
+				node_nodelist(numa_node_id(), GFP_HIGHUSER),
 				gfp_zone(GFP_HIGHUSER),
 				&pol->v.nodes);
-		polnid = zone_to_nid(z->zone);
+		polnid = zone_to_nid(zone);
 		break;
 
 	default:
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 5c918388de99..448e3228a911 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -18,10 +18,46 @@
 #ifdef CONFIG_DEBUG_MEMORY_INIT
 int __meminitdata mminit_loglevel;
 
+const char *nodelist_name[MAX_NODELISTS] __meminitconst = {
+	"general",
+#ifdef CONFIG_NUMA
+	"thisnode"
+#endif
+};
+
 #ifndef SECTIONS_SHIFT
 #define SECTIONS_SHIFT	0
 #endif
 
+void __init mminit_print_nodelist(int nid, int nl_type)
+{
+	pg_data_t *pgdat = NODE_DATA(nid);
+	struct nodelist *nodelist;
+	struct nlist_traverser t;
+	struct zone *node_zones, *zone;
+	enum zone_type ztype;
+
+	nodelist = &pgdat->node_nodelists[nl_type];
+	node_zones = pgdat->node_zones;
+
+	for (ztype = MAX_NR_ZONES - 1; ztype >= 0; ztype--) {
+		zone = node_zones + ztype;
+
+		if (!populated_zone(zone))
+			continue;
+
+		/* Print information about the nodelist */
+		printk(KERN_DEBUG "mminit::nodelist %s %d:%s = ",
+			nodelist_name[nl_type], nid, zone->name);
+
+		/* Iterate the nodelist */
+		for_each_zone_nlist(zone, &t, nodelist, ztype) {
+			pr_cont("%d:%s ", traverser_node(&t), zone->name);
+		}
+		pr_cont("\n");
+	}
+}
+
 /* The zonelists are simply reported, validation is manual. */
 void __init mminit_verify_zonelist(void)
 {
@@ -31,33 +67,13 @@ void __init mminit_verify_zonelist(void)
 		return;
 
 	for_each_online_node(nid) {
-		pg_data_t *pgdat = NODE_DATA(nid);
-		struct zone *zone;
-		struct zoneref *z;
-		struct zonelist *zonelist;
-		int i, listid, zoneid;
-
-		BUG_ON(MAX_ZONELISTS > 2);
-		for (i = 0; i < MAX_ZONELISTS * MAX_NR_ZONES; i++) {
-
-			/* Identify the zone and nodelist */
-			zoneid = i % MAX_NR_ZONES;
-			listid = i / MAX_NR_ZONES;
-			zonelist = &pgdat->node_zonelists[listid];
-			zone = &pgdat->node_zones[zoneid];
-			if (!populated_zone(zone))
-				continue;
-
-			/* Print information about the zonelist */
-			printk(KERN_DEBUG "mminit::zonelist %s %d:%s = ",
-				listid > 0 ? "thisnode" : "general", nid,
-				zone->name);
-
-			/* Iterate the zonelist */
-			for_each_zone_zonelist(zone, z, zonelist, zoneid)
-				pr_cont("%d:%s ", zone_to_nid(zone), zone->name);
-			pr_cont("\n");
-		}
+		/* print general nodelist */
+		mminit_print_nodelist(nid, NODELIST_FALLBACK);
+
+#ifdef CONFIG_NUMA
+		/* print thisnode nodelist */
+		mminit_print_nodelist(nid, NODELIST_NOFALLBACK);
+#endif
 	}
 }
 
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 4686fdc23bb9..1f14e979108a 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -43,33 +43,27 @@ struct zone *next_zone(struct zone *zone)
 	return zone;
 }
 
-static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
+static inline bool node_in_nodemask(int nid, nodemask_t *nodemask)
 {
 #ifdef CONFIG_NUMA
-	return node_isset(zonelist_node_idx(zref), *nodes);
+	return node_isset(nid, *nodemask);
 #else
-	return 1;
+	return true;
 #endif /* CONFIG_NUMA */
 }
 
-/* Returns the next zone at or below highest_zoneidx in a zonelist */
-struct zoneref *__next_zones_zonelist(struct zoneref *z,
-					enum zone_type highest_zoneidx,
-					nodemask_t *nodes)
+/* Returns the next nlist_entry at or below highest_zoneidx in a nodelist */
+void __next_nlist_entry_nodemask(struct nlist_traverser *t,
+					nodemask_t *nodemask)
 {
 	/*
-	 * Find the next suitable zone to use for the allocation.
-	 * Only filter based on nodemask if it's set
+	 * Find the next suitable nlist_entry to use for the allocation.
+	 * Only filter based on nodemask
 	 */
-	if (unlikely(nodes == NULL))
-		while (zonelist_zone_idx(z) > highest_zoneidx)
-			z++;
-	else
-		while (zonelist_zone_idx(z) > highest_zoneidx ||
-				(z->zone && !zref_in_nodemask(z, nodes)))
-			z++;
-
-	return z;
+	while (!node_has_usable_zones(t) ||
+		(traverser_node(t) != NUMA_NO_NODE &&
+			!node_in_nodemask(traverser_node(t), nodemask)))
+		t->entry++;
 }
 
 #ifdef CONFIG_ARCH_HAS_HOLES_MEMORYMODEL
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 314ce1a3cf25..f44c79db0cd6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -251,10 +251,10 @@ static const char * const oom_constraint_text[] = {
  */
 static enum oom_constraint constrained_alloc(struct oom_control *oc)
 {
-	struct zone *zone;
-	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(oc->gfp_mask);
 	bool cpuset_limited = false;
+	struct nlist_traverser t;
+	struct zone *zone;
 	int nid;
 
 	if (is_memcg_oom(oc)) {
@@ -268,7 +268,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	if (!IS_ENABLED(CONFIG_NUMA))
 		return CONSTRAINT_NONE;
 
-	if (!oc->zonelist)
+	if (!oc->nodelist)
 		return CONSTRAINT_NONE;
 	/*
 	 * Reach here only when __GFP_NOFAIL is used. So, we should avoid
@@ -292,7 +292,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	}
 
 	/* Check this allocation failure is caused by cpuset's wall function */
-	for_each_zone_zonelist_nodemask(zone, z, oc->zonelist,
+	for_each_zone_nlist_nodemask(zone, &t, oc->nodelist,
 			high_zoneidx, oc->nodemask)
 		if (!cpuset_zone_allowed(zone, oc->gfp_mask))
 			cpuset_limited = true;
@@ -1120,7 +1120,7 @@ bool out_of_memory(struct oom_control *oc)
 void pagefault_out_of_memory(void)
 {
 	struct oom_control oc = {
-		.zonelist = NULL,
+		.nodelist = NULL,
 		.nodemask = NULL,
 		.memcg = NULL,
 		.gfp_mask = 0,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 62dcd6b76c80..ec5f48b755ff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2555,16 +2555,15 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
 static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 						bool force)
 {
-	struct zonelist *zonelist = ac->zonelist;
 	unsigned long flags;
-	struct zoneref *z;
+	struct nlist_traverser t;
 	struct zone *zone;
 	struct page *page;
 	int order;
 	bool ret;
 
-	for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
-								ac->nodemask) {
+	for_each_zone_nlist_nodemask(zone, &t, ac->nodelist,
+					ac->high_zoneidx, ac->nodemask) {
 		/*
 		 * Preserve at least one pageblock unless memory pressure
 		 * is really high.
@@ -3580,9 +3579,9 @@ static struct page *
 get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 						const struct alloc_context *ac)
 {
-	struct zoneref *z;
-	struct zone *zone;
 	struct pglist_data *last_pgdat_dirty_limit = NULL;
+	struct nlist_traverser t;
+	struct zone *zone;
 	bool no_fallback;
 
 retry:
@@ -3591,9 +3590,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
 	 */
 	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
-	z = ac->preferred_zoneref;
-	for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
-								ac->nodemask) {
+	for_each_zone_nlist_nodemask(zone, &t, ac->nodelist,
+					ac->high_zoneidx, ac->nodemask) {
 		struct page *page;
 		unsigned long mark;
 
@@ -3631,7 +3629,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		}
 
 		if (no_fallback && nr_online_nodes > 1 &&
-		    zone != ac->preferred_zoneref->zone) {
+		    zone != ac->preferred_zone) {
 			int local_nid;
 
 			/*
@@ -3639,7 +3637,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			 * fragmenting fallbacks. Locality is more important
 			 * than fragmentation avoidance.
 			 */
-			local_nid = zone_to_nid(ac->preferred_zoneref->zone);
+			local_nid = zone_to_nid(ac->preferred_zone);
 			if (zone_to_nid(zone) != local_nid) {
 				alloc_flags &= ~ALLOC_NOFRAGMENT;
 				goto retry;
@@ -3648,7 +3646,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 
 		mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
 		if (!zone_watermark_fast(zone, order, mark,
-				       ac_classzone_idx(ac), alloc_flags)) {
+					ac->classzone_idx, alloc_flags)) {
 			int ret;
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
@@ -3667,7 +3665,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 				goto try_this_zone;
 
 			if (node_reclaim_mode == 0 ||
-			    !zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
+			    !zone_allows_reclaim(ac->preferred_zone, zone))
 				continue;
 
 			ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);
@@ -3681,7 +3679,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			default:
 				/* did we reclaim enough */
 				if (zone_watermark_ok(zone, order, mark,
-						ac_classzone_idx(ac), alloc_flags))
+						ac->classzone_idx, alloc_flags))
 					goto try_this_zone;
 
 				continue;
@@ -3689,7 +3687,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		}
 
 try_this_zone:
-		page = rmqueue(ac->preferred_zoneref->zone, zone, order,
+		page = rmqueue(ac->preferred_zone, zone, order,
 				gfp_mask, alloc_flags, ac->migratetype);
 		if (page) {
 			prep_new_page(page, order, gfp_mask, alloc_flags);
@@ -3792,7 +3790,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	const struct alloc_context *ac, unsigned long *did_some_progress)
 {
 	struct oom_control oc = {
-		.zonelist = ac->zonelist,
+		.nodelist = ac->nodelist,
 		.nodemask = ac->nodemask,
 		.memcg = NULL,
 		.gfp_mask = gfp_mask,
@@ -4031,8 +4029,8 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
 		     enum compact_priority *compact_priority,
 		     int *compaction_retries)
 {
+	struct nlist_traverser t;
 	struct zone *zone;
-	struct zoneref *z;
 
 	if (!order || order > PAGE_ALLOC_COSTLY_ORDER)
 		return false;
@@ -4043,10 +4041,10 @@ should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_fla
 	 * Let's give them a good hope and keep retrying while the order-0
 	 * watermarks are OK.
 	 */
-	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
-					ac->nodemask) {
+	for_each_zone_nlist_nodemask(zone, &t, ac->nodelist,
+					ac->high_zoneidx, ac->nodemask) {
 		if (zone_watermark_ok(zone, 0, min_wmark_pages(zone),
-					ac_classzone_idx(ac), alloc_flags))
+					ac->classzone_idx, alloc_flags))
 			return true;
 	}
 	return false;
@@ -4121,7 +4119,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 	fs_reclaim_acquire(gfp_mask);
 	noreclaim_flag = memalloc_noreclaim_save();
 
-	progress = try_to_free_pages(ac->zonelist, order, gfp_mask,
+	progress = try_to_free_pages(ac->nodelist, order, gfp_mask,
 								ac->nodemask);
 
 	memalloc_noreclaim_restore(noreclaim_flag);
@@ -4167,12 +4165,12 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
 			     const struct alloc_context *ac)
 {
-	struct zoneref *z;
-	struct zone *zone;
-	pg_data_t *last_pgdat = NULL;
 	enum zone_type high_zoneidx = ac->high_zoneidx;
+	pg_data_t *last_pgdat = NULL;
+	struct nlist_traverser t;
+	struct zone *zone;
 
-	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, high_zoneidx,
+	for_each_zone_nlist_nodemask(zone, &t, ac->nodelist, high_zoneidx,
 					ac->nodemask) {
 		if (last_pgdat != zone->zone_pgdat)
 			wakeup_kswapd(zone, gfp_mask, order, high_zoneidx);
@@ -4263,6 +4261,16 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return !!__gfp_pfmemalloc_flags(gfp_mask);
 }
 
+static inline void
+reset_ac_preferred_zone(struct alloc_context *ac)
+{
+	ac->preferred_zone = first_zone_nlist_nodemask(ac->nodelist,
+					ac->high_zoneidx, ac->nodemask);
+
+	if (ac->preferred_zone)
+		ac->classzone_idx = zone_idx(ac->preferred_zone);
+}
+
 /*
  * Checks whether it makes sense to retry the reclaim to make a forward progress
  * for the given allocation request.
@@ -4278,8 +4286,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		     struct alloc_context *ac, int alloc_flags,
 		     bool did_some_progress, int *no_progress_loops)
 {
+	struct nlist_traverser t;
 	struct zone *zone;
-	struct zoneref *z;
 	bool ret = false;
 
 	/*
@@ -4307,8 +4315,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 	 * request even if all reclaimable pages are considered then we are
 	 * screwed and have to go OOM.
 	 */
-	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
-					ac->nodemask) {
+	for_each_zone_nlist_nodemask(zone, &t, ac->nodelist,
+					ac->high_zoneidx, ac->nodemask) {
 		unsigned long available;
 		unsigned long reclaimable;
 		unsigned long min_wmark = min_wmark_pages(zone);
@@ -4322,9 +4330,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		 * reclaimable pages?
 		 */
 		wmark = __zone_watermark_ok(zone, order, min_wmark,
-				ac_classzone_idx(ac), alloc_flags, available);
-		trace_reclaim_retry_zone(z, order, reclaimable,
-				available, min_wmark, *no_progress_loops, wmark);
+				ac->classzone_idx, alloc_flags, available);
+		trace_reclaim_retry_zone(traverser_node(&t), zone, order,
+					reclaimable, available, min_wmark,
+					*no_progress_loops, wmark);
 		if (wmark) {
 			/*
 			 * If we didn't make any progress and have a lot of
@@ -4440,9 +4449,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * there was a cpuset modification and we are retrying - otherwise we
 	 * could end up iterating over non-eligible zones endlessly.
 	 */
-	ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
-					ac->high_zoneidx, ac->nodemask);
-	if (!ac->preferred_zoneref->zone)
+	reset_ac_preferred_zone(ac);
+	if (!ac->preferred_zone)
 		goto nopage;
 
 	if (alloc_flags & ALLOC_KSWAPD)
@@ -4527,8 +4535,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
 		ac->nodemask = NULL;
-		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
-					ac->high_zoneidx, ac->nodemask);
+		reset_ac_preferred_zone(ac);
 	}
 
 	/* Attempt with potentially adjusted zonelist and alloc_flags */
@@ -4663,7 +4670,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 		unsigned int *alloc_flags)
 {
 	ac->high_zoneidx = gfp_zone(gfp_mask);
-	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
+	ac->nodelist = node_nodelist(preferred_nid, gfp_mask);
 	ac->nodemask = nodemask;
 	ac->migratetype = gfpflags_to_migratetype(gfp_mask);
 
@@ -4700,8 +4707,7 @@ static inline void finalise_ac(gfp_t gfp_mask, struct alloc_context *ac)
 	 * also used as the starting point for the zonelist iterator. It
 	 * may get reset for allocations that ignore memory policies.
 	 */
-	ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
-					ac->high_zoneidx, ac->nodemask);
+	reset_ac_preferred_zone(ac);
 }
 
 /*
@@ -4736,7 +4742,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 	 * Forbid the first pass from falling back to types that fragment
 	 * memory until all local zones are considered.
 	 */
-	alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp_mask);
+	alloc_flags |= alloc_flags_nofragment(ac.preferred_zone, gfp_mask);
 
 	/* First allocation attempt */
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
@@ -5031,15 +5037,15 @@ EXPORT_SYMBOL(free_pages_exact);
  */
 static unsigned long nr_free_zone_pages(int offset)
 {
-	struct zoneref *z;
+	struct nlist_traverser t;
 	struct zone *zone;
 
 	/* Just pick one node, since fallback list is circular */
 	unsigned long sum = 0;
 
-	struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
+	struct nodelist *nodelist = node_nodelist(numa_node_id(), GFP_KERNEL);
 
-	for_each_zone_zonelist(zone, z, zonelist, offset) {
+	for_each_zone_nlist(zone, &t, nodelist, offset) {
 		unsigned long size = zone_managed_pages(zone);
 		unsigned long high = high_wmark_pages(zone);
 		if (size > high)
@@ -5425,33 +5431,53 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	show_swap_cache_info();
 }
 
-static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
-{
-	zoneref->zone = zone;
-	zoneref->zone_idx = zone_idx(zone);
-}
-
 /*
  * Builds allocation fallback zone lists.
  *
  * Add all populated zones of a node to the zonelist.
  */
-static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
+static struct nlist_entry *
+set_nlist_entry(struct nlist_entry *entry, int nid)
 {
-	struct zone *zone;
 	enum zone_type zone_type = MAX_NR_ZONES;
-	int nr_zones = 0;
+	struct zone *node_zones = NODE_DATA(nid)->node_zones;
+	struct zone *zone;
+
+	BUILD_BUG_ON(MAX_NR_ZONES >=
+			BITS_PER_BYTE * sizeof(entry->node_usable_zones));
+
+	entry->node_usable_zones = 0;
 
 	do {
 		zone_type--;
-		zone = pgdat->node_zones + zone_type;
+		zone = node_zones + zone_type;
 		if (managed_zone(zone)) {
-			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
+			usable_zones_add(zone_type,
+					entry->node_usable_zones);
 			check_highest_zone(zone_type);
 		}
 	} while (zone_type);
 
-	return nr_zones;
+	if (entry->node_usable_zones) {
+		entry->nid = nid;
+		entry->zones = node_zones;
+		entry++;
+	}
+
+	return entry;
+}
+
+/*
+ * _nlist_entries[] end with
+ * (1) ->node_usable_zones = UINT_MAX,
+ * (2) ->nid = NUMA_NO_NODE,
+ * (3) ->zones = NULL
+ */
+static inline void set_last_nlist_entry(struct nlist_entry *entry)
+{
+	entry->node_usable_zones = UINT_MAX;
+	entry->nid = NUMA_NO_NODE;
+	entry->zones = NULL;
 }
 
 #ifdef CONFIG_NUMA
@@ -5568,45 +5594,37 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
 	return best_node;
 }
 
-
 /*
  * Build zonelists ordered by node and zones within node.
  * This results in maximum locality--normal zone overflows into local
  * DMA zone, if any--but risks exhausting DMA zone.
  */
-static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
+static void build_nodelists_in_node_order(pg_data_t *pgdat, int *node_order,
 		unsigned nr_nodes)
 {
-	struct zoneref *zonerefs;
+	struct nlist_entry *entry;
 	int i;
 
-	zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
-
-	for (i = 0; i < nr_nodes; i++) {
-		int nr_zones;
+	entry = pgdat->node_nodelists[NODELIST_FALLBACK]._nlist_entries;
 
-		pg_data_t *node = NODE_DATA(node_order[i]);
+	for (i = 0; i < nr_nodes; i++)
+		entry = set_nlist_entry(entry, node_order[i]);
 
-		nr_zones = build_zonerefs_node(node, zonerefs);
-		zonerefs += nr_zones;
-	}
-	zonerefs->zone = NULL;
-	zonerefs->zone_idx = 0;
+	set_last_nlist_entry(entry);
 }
 
 /*
  * Build gfp_thisnode zonelists
  */
-static void build_thisnode_zonelists(pg_data_t *pgdat)
+static void build_thisnode_nodelists(pg_data_t *pgdat)
 {
-	struct zoneref *zonerefs;
-	int nr_zones;
+	struct nlist_entry *entry;
+
+	entry = pgdat->node_nodelists[NODELIST_NOFALLBACK]._nlist_entries;
 
-	zonerefs = pgdat->node_zonelists[ZONELIST_NOFALLBACK]._zonerefs;
-	nr_zones = build_zonerefs_node(pgdat, zonerefs);
-	zonerefs += nr_zones;
-	zonerefs->zone = NULL;
-	zonerefs->zone_idx = 0;
+	entry = set_nlist_entry(entry, pgdat->node_id);
+
+	set_last_nlist_entry(entry);
 }
 
 /*
@@ -5645,8 +5663,8 @@ static void build_zonelists(pg_data_t *pgdat)
 		load--;
 	}
 
-	build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
-	build_thisnode_zonelists(pgdat);
+	build_nodelists_in_node_order(pgdat, node_order, nr_nodes);
+	build_thisnode_nodelists(pgdat);
 }
 
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
@@ -5658,12 +5676,12 @@ static void build_zonelists(pg_data_t *pgdat)
  */
 int local_memory_node(int node)
 {
-	struct zoneref *z;
+	struct zone *zone;
 
-	z = first_zones_zonelist(node_zonelist(node, GFP_KERNEL),
+	zone = first_zone_nlist_nodemask(node_nodelist(node, GFP_KERNEL),
 				   gfp_zone(GFP_KERNEL),
 				   NULL);
-	return zone_to_nid(z->zone);
+	return zone_to_nid(zone);
 }
 #endif
 
@@ -5674,14 +5692,12 @@ static void setup_min_slab_ratio(void);
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int node, local_node;
-	struct zoneref *zonerefs;
-	int nr_zones;
+	struct nlist_entry *entry;
 
 	local_node = pgdat->node_id;
 
-	zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
-	nr_zones = build_zonerefs_node(pgdat, zonerefs);
-	zonerefs += nr_zones;
+	entry = pgdat->node_nodelists[NODELIST_FALLBACK]._nlist_entries;
+	entry = set_nlist_entry(entry, local_node);
 
 	/*
 	 * Now we build the zonelist so that it contains the zones
@@ -5694,18 +5710,17 @@ static void build_zonelists(pg_data_t *pgdat)
 	for (node = local_node + 1; node < MAX_NUMNODES; node++) {
 		if (!node_online(node))
 			continue;
-		nr_zones = build_zonerefs_node(NODE_DATA(node), zonerefs);
-		zonerefs += nr_zones;
+
+		entry = set_nlist_entry(entry, node);
 	}
 	for (node = 0; node < local_node; node++) {
 		if (!node_online(node))
 			continue;
-		nr_zones = build_zonerefs_node(NODE_DATA(node), zonerefs);
-		zonerefs += nr_zones;
+
+		entry = set_nlist_entry(entry, node);
 	}
 
-	zonerefs->zone = NULL;
-	zonerefs->zone_idx = 0;
+	set_last_nlist_entry(entry);
 }
 
 #endif	/* CONFIG_NUMA */
@@ -8563,12 +8578,12 @@ struct page *alloc_contig_pages(unsigned long nr_pages, gfp_t gfp_mask,
 				int nid, nodemask_t *nodemask)
 {
 	unsigned long ret, pfn, flags;
-	struct zonelist *zonelist;
+	struct nodelist *nodelist;
+	struct nlist_traverser t;
 	struct zone *zone;
-	struct zoneref *z;
 
-	zonelist = node_zonelist(nid, gfp_mask);
-	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+	nodelist = node_nodelist(nid, gfp_mask);
+	for_each_zone_nlist_nodemask(zone, &t, nodelist,
 					gfp_zone(gfp_mask), nodemask) {
 		spin_lock_irqsave(&zone->lock, flags);
 
diff --git a/mm/slab.c b/mm/slab.c
index f1e1840af533..b9a1353cf2ab 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3103,8 +3103,8 @@ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
  */
 static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 {
-	struct zonelist *zonelist;
-	struct zoneref *z;
+	struct nodelist *nodelist;
+	struct nlist_traverser t;
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *obj = NULL;
@@ -3117,14 +3117,14 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
-	zonelist = node_zonelist(mempolicy_slab_node(), flags);
+	nodelist = node_nodelist(mempolicy_slab_node(), flags);
 
 retry:
 	/*
 	 * Look through allowed nodes for objects available
 	 * from existing per node queues.
 	 */
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+	for_each_zone_nlist(zone, &t, nodelist, high_zoneidx) {
 		nid = zone_to_nid(zone);
 
 		if (cpuset_zone_allowed(zone, flags) &&
diff --git a/mm/slub.c b/mm/slub.c
index b7c6b2e7f2db..ad1abfbc57b1 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1905,8 +1905,8 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
 		struct kmem_cache_cpu *c)
 {
 #ifdef CONFIG_NUMA
-	struct zonelist *zonelist;
-	struct zoneref *z;
+	struct nodelist *nodelist;
+	struct nlist_traverser t;
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *object;
@@ -1936,8 +1936,8 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
 
 	do {
 		cpuset_mems_cookie = read_mems_allowed_begin();
-		zonelist = node_zonelist(mempolicy_slab_node(), flags);
-		for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+		nodelist = node_nodelist(mempolicy_slab_node(), flags);
+		for_each_zone_nlist(zone, &t, nodelist, high_zoneidx) {
 			struct kmem_cache_node *n;
 
 			n = get_node(s, zone_to_nid(zone));
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c8e88f4d9932..a3ad433c8ff4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2918,9 +2918,9 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
  */
-static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static void shrink_zones(struct nodelist *nodelist, struct scan_control *sc)
 {
-	struct zoneref *z;
+	struct nlist_traverser t;
 	struct zone *zone;
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
@@ -2938,7 +2938,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		sc->reclaim_idx = gfp_zone(sc->gfp_mask);
 	}
 
-	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+	for_each_zone_nlist_nodemask(zone, &t, nodelist,
 					sc->reclaim_idx, sc->nodemask) {
 		/*
 		 * Take care memory controller reclaiming has small influence
@@ -3029,12 +3029,12 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
  * returns:	0, if no pages reclaimed
  * 		else, the number of pages reclaimed
  */
-static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
+static unsigned long do_try_to_free_pages(struct nodelist *nodelist,
 					  struct scan_control *sc)
 {
 	int initial_priority = sc->priority;
 	pg_data_t *last_pgdat;
-	struct zoneref *z;
+	struct nlist_traverser t;
 	struct zone *zone;
 retry:
 	delayacct_freepages_start();
@@ -3046,7 +3046,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);
 		sc->nr_scanned = 0;
-		shrink_zones(zonelist, sc);
+		shrink_zones(nodelist, sc);
 
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
 			break;
@@ -3063,7 +3063,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	} while (--sc->priority >= 0);
 
 	last_pgdat = NULL;
-	for_each_zone_zonelist_nodemask(zone, z, zonelist, sc->reclaim_idx,
+	for_each_zone_nlist_nodemask(zone, &t, nodelist, sc->reclaim_idx,
 					sc->nodemask) {
 		if (zone->zone_pgdat == last_pgdat)
 			continue;
@@ -3166,10 +3166,10 @@ static bool allow_direct_reclaim(pg_data_t *pgdat)
  * Returns true if a fatal signal was delivered during throttling. If this
  * happens, the page allocator should not consider triggering the OOM killer.
  */
-static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
+static bool throttle_direct_reclaim(gfp_t gfp_mask, struct nodelist *nodelist,
 					nodemask_t *nodemask)
 {
-	struct zoneref *z;
+	struct nlist_traverser t;
 	struct zone *zone;
 	pg_data_t *pgdat = NULL;
 
@@ -3204,7 +3204,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
 	 * for remote pfmemalloc reserves and processes on different nodes
 	 * should make reasonable progress.
 	 */
-	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+	for_each_zone_nlist_nodemask(zone, &t, nodelist,
 					gfp_zone(gfp_mask), nodemask) {
 		if (zone_idx(zone) > ZONE_NORMAL)
 			continue;
@@ -3250,7 +3250,7 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
 	return false;
 }
 
-unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
+unsigned long try_to_free_pages(struct nodelist *nodelist, int order,
 				gfp_t gfp_mask, nodemask_t *nodemask)
 {
 	unsigned long nr_reclaimed;
@@ -3279,13 +3279,13 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	 * 1 is returned so that the page allocator does not OOM kill at this
 	 * point.
 	 */
-	if (throttle_direct_reclaim(sc.gfp_mask, zonelist, nodemask))
+	if (throttle_direct_reclaim(sc.gfp_mask, nodelist, nodemask))
 		return 1;
 
 	set_task_reclaim_state(current, &sc.reclaim_state);
 	trace_mm_vmscan_direct_reclaim_begin(order, sc.gfp_mask);
 
-	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+	nr_reclaimed = do_try_to_free_pages(nodelist, &sc);
 
 	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
 	set_task_reclaim_state(current, NULL);
@@ -3363,7 +3363,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	 * equal pressure on all the nodes. This is based on the assumption that
 	 * the reclaim does not bail out early.
 	 */
-	struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
+	struct nodelist *nodelist = node_nodelist(numa_node_id(), sc.gfp_mask);
 
 	set_task_reclaim_state(current, &sc.reclaim_state);
 
@@ -3374,7 +3374,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	psi_memstall_enter(&pflags);
 	noreclaim_flag = memalloc_noreclaim_save();
 
-	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+	nr_reclaimed = do_try_to_free_pages(nodelist, &sc);
 
 	memalloc_noreclaim_restore(noreclaim_flag);
 	psi_memstall_leave(&pflags);
@@ -4033,7 +4033,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 		.may_swap = 1,
 		.hibernation_mode = 1,
 	};
-	struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
+	struct nodelist *nodelist = node_nodelist(numa_node_id(), sc.gfp_mask);
 	unsigned long nr_reclaimed;
 	unsigned int noreclaim_flag;
 
@@ -4041,7 +4041,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 	noreclaim_flag = memalloc_noreclaim_save();
 	set_task_reclaim_state(current, &sc.reclaim_state);
 
-	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+	nr_reclaimed = do_try_to_free_pages(nodelist, &sc);
 
 	set_task_reclaim_state(current, NULL);
 	memalloc_noreclaim_restore(noreclaim_flag);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v1 02/19] mm, hugetlb: use for_each_node in dequeue_huge_page_nodemask()
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
  2019-11-21 15:17 ` [RFC v1 01/19] mm, mmzone: modify zonelist to nodelist Pengfei Li
@ 2019-11-21 15:17 ` Pengfei Li
  2019-11-21 15:17 ` [RFC v1 03/19] mm, oom_kill: use for_each_node in constrained_alloc() Pengfei Li
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:17 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

In dequeue_huge_page_nodemask(), we want to traverse node instead of
zone, so use for_each_node instead of for_each_zone.

Signed-off-by: Pengfei Li <fly@kernel.page>
---
 mm/hugetlb.c | 15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2e55ec5dc84d..287b90c7ab36 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -850,25 +850,18 @@ static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask,
 	unsigned int cpuset_mems_cookie;
 	struct nodelist *nodelist;
 	struct nlist_traverser t;
-	struct zone *zone;
-	int node = NUMA_NO_NODE;
+	int node;
 
 	nodelist = node_nodelist(nid, gfp_mask);
 
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
-	for_each_zone_nlist_nodemask(zone, &t, nodelist, gfp_zone(gfp_mask), nmask) {
+	for_each_node_nlist_nodemask(node, &t, nodelist,
+					gfp_zone(gfp_mask), nmask) {
 		struct page *page;
 
-		if (!cpuset_zone_allowed(zone, gfp_mask))
-			continue;
-		/*
-		 * no need to ask again on the same node. Pool is node rather than
-		 * zone aware
-		 */
-		if (zone_to_nid(zone) == node)
+		if (!cpuset_node_allowed(node, gfp_mask))
 			continue;
-		node = zone_to_nid(zone);
 
 		page = dequeue_huge_page_node_exact(h, node);
 		if (page)
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v1 03/19] mm, oom_kill: use for_each_node in constrained_alloc()
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
  2019-11-21 15:17 ` [RFC v1 01/19] mm, mmzone: modify zonelist to nodelist Pengfei Li
  2019-11-21 15:17 ` [RFC v1 02/19] mm, hugetlb: use for_each_node in dequeue_huge_page_nodemask() Pengfei Li
@ 2019-11-21 15:17 ` Pengfei Li
  2019-11-21 15:17 ` [RFC v1 04/19] mm, slub: use for_each_node in get_any_partial() Pengfei Li
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:17 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

In constrained_alloc(), we want to traverse node instead of zone, so
use for_each_node instead of for_each_zone.

Signed-off-by: Pengfei Li <fly@kernel.page>
---
 mm/oom_kill.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f44c79db0cd6..db509d5e4db3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -254,7 +254,6 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	enum zone_type high_zoneidx = gfp_zone(oc->gfp_mask);
 	bool cpuset_limited = false;
 	struct nlist_traverser t;
-	struct zone *zone;
 	int nid;
 
 	if (is_memcg_oom(oc)) {
@@ -292,10 +291,13 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	}
 
 	/* Check this allocation failure is caused by cpuset's wall function */
-	for_each_zone_nlist_nodemask(zone, &t, oc->nodelist,
-			high_zoneidx, oc->nodemask)
-		if (!cpuset_zone_allowed(zone, oc->gfp_mask))
+	for_each_node_nlist_nodemask(nid, &t, oc->nodelist,
+					high_zoneidx, oc->nodemask) {
+		if (!cpuset_node_allowed(nid, oc->gfp_mask)) {
 			cpuset_limited = true;
+			break;
+		}
+	}
 
 	if (cpuset_limited) {
 		oc->totalpages = total_swap_pages;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v1 04/19] mm, slub: use for_each_node in get_any_partial()
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
                   ` (2 preceding siblings ...)
  2019-11-21 15:17 ` [RFC v1 03/19] mm, oom_kill: use for_each_node in constrained_alloc() Pengfei Li
@ 2019-11-21 15:17 ` Pengfei Li
  2019-11-21 15:17 ` [RFC v1 05/19] mm, slab: use for_each_node in fallback_alloc() Pengfei Li
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:17 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

In get_any_partial(), we want to traverse node instead of zone, so
use for_each_node instead of for_each_zone.

Signed-off-by: Pengfei Li <fly@kernel.page>
---
 mm/slub.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index ad1abfbc57b1..2fe3edbcf296 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1905,10 +1905,10 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
 		struct kmem_cache_cpu *c)
 {
 #ifdef CONFIG_NUMA
+	enum zone_type high_zoneidx = gfp_zone(flags);
 	struct nodelist *nodelist;
 	struct nlist_traverser t;
-	struct zone *zone;
-	enum zone_type high_zoneidx = gfp_zone(flags);
+	int node;
 	void *object;
 	unsigned int cpuset_mems_cookie;
 
@@ -1937,12 +1937,12 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
 	do {
 		cpuset_mems_cookie = read_mems_allowed_begin();
 		nodelist = node_nodelist(mempolicy_slab_node(), flags);
-		for_each_zone_nlist(zone, &t, nodelist, high_zoneidx) {
+		for_each_node_nlist(node, &t, nodelist, high_zoneidx) {
 			struct kmem_cache_node *n;
 
-			n = get_node(s, zone_to_nid(zone));
+			n = get_node(s, node);
 
-			if (n && cpuset_zone_allowed(zone, flags) &&
+			if (n && cpuset_node_allowed(node, flags) &&
 					n->nr_partial > s->min_partial) {
 				object = get_partial_node(s, n, c, flags);
 				if (object) {
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v1 05/19] mm, slab: use for_each_node in fallback_alloc()
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
                   ` (3 preceding siblings ...)
  2019-11-21 15:17 ` [RFC v1 04/19] mm, slub: use for_each_node in get_any_partial() Pengfei Li
@ 2019-11-21 15:17 ` Pengfei Li
  2019-11-21 15:17 ` [RFC v1 06/19] mm, vmscan: use for_each_node in do_try_to_free_pages() Pengfei Li
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:17 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

In fallback_alloc(), we want to traverse node instead of zone, so
use for_each_node instead of for_each_zone.

Signed-off-by: Pengfei Li <fly@kernel.page>
---
 mm/slab.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index b9a1353cf2ab..b94c06934459 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3103,12 +3103,11 @@ static void *alternate_node_alloc(struct kmem_cache *cachep, gfp_t flags)
  */
 static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 {
-	struct nodelist *nodelist;
-	struct nlist_traverser t;
-	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *obj = NULL;
 	struct page *page;
+	struct nodelist *nodelist;
+	struct nlist_traverser t;
 	int nid;
 	unsigned int cpuset_mems_cookie;
 
@@ -3124,10 +3123,8 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 	 * Look through allowed nodes for objects available
 	 * from existing per node queues.
 	 */
-	for_each_zone_nlist(zone, &t, nodelist, high_zoneidx) {
-		nid = zone_to_nid(zone);
-
-		if (cpuset_zone_allowed(zone, flags) &&
+	for_each_node_nlist(nid, &t, nodelist, high_zoneidx) {
+		if (cpuset_node_allowed(nid, flags) &&
 			get_node(cache, nid) &&
 			get_node(cache, nid)->free_objects) {
 				obj = ____cache_alloc_node(cache,
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v1 06/19] mm, vmscan: use for_each_node in do_try_to_free_pages()
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
                   ` (4 preceding siblings ...)
  2019-11-21 15:17 ` [RFC v1 05/19] mm, slab: use for_each_node in fallback_alloc() Pengfei Li
@ 2019-11-21 15:17 ` Pengfei Li
  2019-11-21 15:17 ` [RFC v1 07/19] mm, vmscan: use first_node in throttle_direct_reclaim() Pengfei Li
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:17 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

In do_try_to_free_pages(), we want to traverse node instead of zone,
so use for_each_node instead of for_each_zone.

Signed-off-by: Pengfei Li <fly@kernel.page>
---
 mm/vmscan.c | 17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a3ad433c8ff4..159a2aaa8db1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3033,9 +3033,9 @@ static unsigned long do_try_to_free_pages(struct nodelist *nodelist,
 					  struct scan_control *sc)
 {
 	int initial_priority = sc->priority;
-	pg_data_t *last_pgdat;
+	pg_data_t *pgdat;
 	struct nlist_traverser t;
-	struct zone *zone;
+	int node;
 retry:
 	delayacct_freepages_start();
 
@@ -3062,20 +3062,17 @@ static unsigned long do_try_to_free_pages(struct nodelist *nodelist,
 			sc->may_writepage = 1;
 	} while (--sc->priority >= 0);
 
-	last_pgdat = NULL;
-	for_each_zone_nlist_nodemask(zone, &t, nodelist, sc->reclaim_idx,
-					sc->nodemask) {
-		if (zone->zone_pgdat == last_pgdat)
-			continue;
-		last_pgdat = zone->zone_pgdat;
+	for_each_node_nlist_nodemask(node, &t, nodelist,
+					sc->reclaim_idx, sc->nodemask) {
+		pgdat = NODE_DATA(node);
 
-		snapshot_refaults(sc->target_mem_cgroup, zone->zone_pgdat);
+		snapshot_refaults(sc->target_mem_cgroup, pgdat);
 
 		if (cgroup_reclaim(sc)) {
 			struct lruvec *lruvec;
 
 			lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup,
-						   zone->zone_pgdat);
+						   pgdat);
 			clear_bit(LRUVEC_CONGESTED, &lruvec->flags);
 		}
 	}
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v1 07/19] mm, vmscan: use first_node in throttle_direct_reclaim()
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
                   ` (5 preceding siblings ...)
  2019-11-21 15:17 ` [RFC v1 06/19] mm, vmscan: use for_each_node in do_try_to_free_pages() Pengfei Li
@ 2019-11-21 15:17 ` Pengfei Li
  2019-11-21 15:18 ` [RFC v1 08/19] mm, vmscan: pass pgdat to wakeup_kswapd() Pengfei Li
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:17 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

In throttle_direct_reclaim(), we want to access first node instead of
first zone, so use first_node_nlist instead of first_zone_zonelist.

Signed-off-by: Pengfei Li <fly@kernel.page>
---
 mm/vmscan.c | 41 +++++++++++++++++++----------------------
 1 file changed, 19 insertions(+), 22 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 159a2aaa8db1..7554c8ba0841 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3166,9 +3166,9 @@ static bool allow_direct_reclaim(pg_data_t *pgdat)
 static bool throttle_direct_reclaim(gfp_t gfp_mask, struct nodelist *nodelist,
 					nodemask_t *nodemask)
 {
-	struct nlist_traverser t;
-	struct zone *zone;
-	pg_data_t *pgdat = NULL;
+	pg_data_t *pgdat;
+	enum zone_type high_idx;
+	int node;
 
 	/*
 	 * Kernel threads should not be throttled as they may be indirectly
@@ -3178,21 +3178,24 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct nodelist *nodelist,
 	 * processes to block on log_wait_commit().
 	 */
 	if (current->flags & PF_KTHREAD)
-		goto out;
+		return false;
 
 	/*
 	 * If a fatal signal is pending, this process should not throttle.
 	 * It should return quickly so it can exit and free its memory
 	 */
 	if (fatal_signal_pending(current))
-		goto out;
+		return false;
 
 	/*
 	 * Check if the pfmemalloc reserves are ok by finding the first node
 	 * with a usable ZONE_NORMAL or lower zone. The expectation is that
 	 * GFP_KERNEL will be required for allocating network buffers when
 	 * swapping over the network so ZONE_HIGHMEM is unusable.
-	 *
+	 */
+	high_idx = min_t(enum zone_type, ZONE_NORMAL, gfp_zone(gfp_mask));
+
+	/*
 	 * Throttling is based on the first usable node and throttled processes
 	 * wait on a queue until kswapd makes progress and wakes them. There
 	 * is an affinity then between processes waking up and where reclaim
@@ -3201,21 +3204,16 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct nodelist *nodelist,
 	 * for remote pfmemalloc reserves and processes on different nodes
 	 * should make reasonable progress.
 	 */
-	for_each_zone_nlist_nodemask(zone, &t, nodelist,
-					gfp_zone(gfp_mask), nodemask) {
-		if (zone_idx(zone) > ZONE_NORMAL)
-			continue;
-
-		/* Throttle based on the first usable node */
-		pgdat = zone->zone_pgdat;
-		if (allow_direct_reclaim(pgdat))
-			goto out;
-		break;
-	}
+	node = first_node_nlist_nodemask(nodelist, high_idx, nodemask);
 
 	/* If no zone was usable by the allocation flags then do not throttle */
-	if (!pgdat)
-		goto out;
+	if (node == NUMA_NO_NODE)
+		return false;
+
+	pgdat = NODE_DATA(node);
+	/* Throttle based on the first usable node */
+	if (allow_direct_reclaim(pgdat))
+		return false;
 
 	/* Account for the throttling */
 	count_vm_event(PGSCAN_DIRECT_THROTTLE);
@@ -3236,14 +3234,13 @@ static bool throttle_direct_reclaim(gfp_t gfp_mask, struct nodelist *nodelist,
 	}
 
 	/* Throttle until kswapd wakes the process */
-	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
-		allow_direct_reclaim(pgdat));
+	wait_event_killable(pgdat->pfmemalloc_wait,
+				allow_direct_reclaim(pgdat));
 
 check_pending:
 	if (fatal_signal_pending(current))
 		return true;
 
-out:
 	return false;
 }
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v1 08/19] mm, vmscan: pass pgdat to wakeup_kswapd()
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
                   ` (6 preceding siblings ...)
  2019-11-21 15:17 ` [RFC v1 07/19] mm, vmscan: use first_node in throttle_direct_reclaim() Pengfei Li
@ 2019-11-21 15:18 ` Pengfei Li
  2019-11-21 15:18 ` [RFC v1 09/19] mm, vmscan: use for_each_node in shrink_zones() Pengfei Li
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:18 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

This is a preparation patch. Passing pgdat to wakeup_kswapd() and
avoiding indirect access to pgdat via zone->zone_pgdat.

Signed-off-by: Pengfei Li <fly@kernel.page>
---
 include/linux/mmzone.h |  2 +-
 mm/page_alloc.c        |  4 ++--
 mm/vmscan.c            | 12 ++++--------
 3 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index dd493239b8b2..599b30620aa1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -814,7 +814,7 @@ static inline bool pgdat_is_empty(pg_data_t *pgdat)
 #include <linux/memory_hotplug.h>
 
 void build_all_zonelists(pg_data_t *pgdat);
-void wakeup_kswapd(struct zone *zone, gfp_t gfp_mask, int order,
+void wakeup_kswapd(pg_data_t *pgdat, gfp_t gfp_mask, int order,
 		   enum zone_type classzone_idx);
 bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			 int classzone_idx, unsigned int alloc_flags,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ec5f48b755ff..2dcf2a21c578 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3306,7 +3306,7 @@ struct page *rmqueue(struct zone *preferred_zone,
 	/* Separate test+clear to avoid unnecessary atomics */
 	if (test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags)) {
 		clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
-		wakeup_kswapd(zone, 0, 0, zone_idx(zone));
+		wakeup_kswapd(zone->zone_pgdat, 0, 0, zone_idx(zone));
 	}
 
 	VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
@@ -4173,7 +4173,7 @@ static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
 	for_each_zone_nlist_nodemask(zone, &t, ac->nodelist, high_zoneidx,
 					ac->nodemask) {
 		if (last_pgdat != zone->zone_pgdat)
-			wakeup_kswapd(zone, gfp_mask, order, high_zoneidx);
+			wakeup_kswapd(zone->zone_pgdat, gfp_mask, order, high_zoneidx);
 		last_pgdat = zone->zone_pgdat;
 	}
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7554c8ba0841..b5256ef682c2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3964,17 +3964,13 @@ static int kswapd(void *p)
  * has failed or is not needed, still wake up kcompactd if only compaction is
  * needed.
  */
-void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
+void wakeup_kswapd(pg_data_t *pgdat, gfp_t gfp_flags, int order,
 		   enum zone_type classzone_idx)
 {
-	pg_data_t *pgdat;
-
-	if (!managed_zone(zone))
-		return;
+	int node = pgdat->node_id;
 
-	if (!cpuset_zone_allowed(zone, gfp_flags))
+	if (!cpuset_node_allowed(node, gfp_flags))
 		return;
-	pgdat = zone->zone_pgdat;
 
 	if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES)
 		pgdat->kswapd_classzone_idx = classzone_idx;
@@ -4001,7 +3997,7 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
 		return;
 	}
 
-	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, classzone_idx, order,
+	trace_mm_vmscan_wakeup_kswapd(node, classzone_idx, order,
 				      gfp_flags);
 	wake_up_interruptible(&pgdat->kswapd_wait);
 }
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v1 09/19] mm, vmscan: use for_each_node in shrink_zones()
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
                   ` (7 preceding siblings ...)
  2019-11-21 15:18 ` [RFC v1 08/19] mm, vmscan: pass pgdat to wakeup_kswapd() Pengfei Li
@ 2019-11-21 15:18 ` Pengfei Li
  2019-11-21 15:18 ` [RFC v1 10/19] mm, page_alloc: use for_each_node in wake_all_kswapds() Pengfei Li
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:18 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

In shrink_zones(), we want to traverse node instead of zone, so
use for_each_node instead of for_each_zone.

Signed-off-by: Pengfei Li <fly@kernel.page>
---
 mm/vmscan.c | 53 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 30 insertions(+), 23 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b5256ef682c2..2b0e51525c3a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2910,6 +2910,25 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
 	return zone_watermark_ok_safe(zone, 0, watermark, sc->reclaim_idx);
 }
 
+static bool
+node_compaction_ready(struct nlist_traverser *t, struct scan_control *sc)
+{
+	bool node_ready = true;
+	struct zone *zone;
+
+	do {
+		zone = traverser_zone(t);
+
+		if (compaction_ready(zone, sc))
+			sc->compaction_ready = true;
+		else
+			node_ready = false;
+
+	} while (t->usable_zones);
+
+	return node_ready;
+}
+
 /*
  * This is the direct reclaim path, for page-allocating processes.  We only
  * try to reclaim pages from zones which will satisfy the caller's allocation
@@ -2920,12 +2939,12 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
  */
 static void shrink_zones(struct nodelist *nodelist, struct scan_control *sc)
 {
-	struct nlist_traverser t;
-	struct zone *zone;
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
 	gfp_t orig_mask;
-	pg_data_t *last_pgdat = NULL;
+	pg_data_t *pgdat;
+	struct nlist_traverser t;
+	int node;
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2938,14 +2957,17 @@ static void shrink_zones(struct nodelist *nodelist, struct scan_control *sc)
 		sc->reclaim_idx = gfp_zone(sc->gfp_mask);
 	}
 
-	for_each_zone_nlist_nodemask(zone, &t, nodelist,
+	for_each_node_nlist_nodemask(node, &t, nodelist,
 					sc->reclaim_idx, sc->nodemask) {
+
+		pgdat = NODE_DATA(node);
+
 		/*
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
 		 */
 		if (!cgroup_reclaim(sc)) {
-			if (!cpuset_zone_allowed(zone,
+			if (!cpuset_node_allowed(node,
 						 GFP_KERNEL | __GFP_HARDWALL))
 				continue;
 
@@ -2960,18 +2982,7 @@ static void shrink_zones(struct nodelist *nodelist, struct scan_control *sc)
 			 */
 			if (IS_ENABLED(CONFIG_COMPACTION) &&
 			    sc->order > PAGE_ALLOC_COSTLY_ORDER &&
-			    compaction_ready(zone, sc)) {
-				sc->compaction_ready = true;
-				continue;
-			}
-
-			/*
-			 * Shrink each node in the zonelist once. If the
-			 * zonelist is ordered by zone (not the default) then a
-			 * node may be shrunk multiple times but in that case
-			 * the user prefers lower zones being preserved.
-			 */
-			if (zone->zone_pgdat == last_pgdat)
+			    node_compaction_ready(&t, sc))
 				continue;
 
 			/*
@@ -2981,7 +2992,7 @@ static void shrink_zones(struct nodelist *nodelist, struct scan_control *sc)
 			 * and balancing, not for a memcg's limit.
 			 */
 			nr_soft_scanned = 0;
-			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone->zone_pgdat,
+			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(pgdat,
 						sc->order, sc->gfp_mask,
 						&nr_soft_scanned);
 			sc->nr_reclaimed += nr_soft_reclaimed;
@@ -2989,11 +3000,7 @@ static void shrink_zones(struct nodelist *nodelist, struct scan_control *sc)
 			/* need some check for avoid more shrink_zone() */
 		}
 
-		/* See comment about same check for global reclaim above */
-		if (zone->zone_pgdat == last_pgdat)
-			continue;
-		last_pgdat = zone->zone_pgdat;
-		shrink_node(zone->zone_pgdat, sc);
+		shrink_node(pgdat, sc);
 	}
 
 	/*
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v1 10/19] mm, page_alloc: use for_each_node in wake_all_kswapds()
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
                   ` (8 preceding siblings ...)
  2019-11-21 15:18 ` [RFC v1 09/19] mm, vmscan: use for_each_node in shrink_zones() Pengfei Li
@ 2019-11-21 15:18 ` Pengfei Li
  2019-11-21 15:18 ` [RFC v1 11/19] mm, mempolicy: use first_node in mempolicy_slab_node() Pengfei Li
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:18 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

In wake_all_kswapds(), we want to traverse node instead of zone, so
use for_each_node instead of for_each_zone.

Signed-off-by: Pengfei Li <fly@kernel.page>
---
 mm/page_alloc.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2dcf2a21c578..aa5c2ef4f8ec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4166,15 +4166,12 @@ static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
 			     const struct alloc_context *ac)
 {
 	enum zone_type high_zoneidx = ac->high_zoneidx;
-	pg_data_t *last_pgdat = NULL;
 	struct nlist_traverser t;
-	struct zone *zone;
+	int node;
 
-	for_each_zone_nlist_nodemask(zone, &t, ac->nodelist, high_zoneidx,
-					ac->nodemask) {
-		if (last_pgdat != zone->zone_pgdat)
-			wakeup_kswapd(zone->zone_pgdat, gfp_mask, order, high_zoneidx);
-		last_pgdat = zone->zone_pgdat;
+	for_each_node_nlist_nodemask(node, &t, ac->nodelist,
+					high_zoneidx, ac->nodemask) {
+		wakeup_kswapd(NODE_DATA(node), gfp_mask, order, high_zoneidx);
 	}
 }
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v1 11/19] mm, mempolicy: use first_node in mempolicy_slab_node()
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
                   ` (9 preceding siblings ...)
  2019-11-21 15:18 ` [RFC v1 10/19] mm, page_alloc: use for_each_node in wake_all_kswapds() Pengfei Li
@ 2019-11-21 15:18 ` Pengfei Li
  2019-11-21 15:18 ` [RFC v1 12/19] mm, mempolicy: use first_node in mpol_misplaced() Pengfei Li
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:18 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

In mempolicy_slab_node(), we want to access first node instead of
first zone, so use first_node_nlist instead of first_zone_nlist.

Signed-off-by: Pengfei Li <fly@kernel.page>
---
 mm/mempolicy.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b1df19d42047..66184add1627 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1855,8 +1855,10 @@ static unsigned interleave_nodes(struct mempolicy *policy)
  */
 unsigned int mempolicy_slab_node(void)
 {
-	struct mempolicy *policy;
 	int node = numa_mem_id();
+	struct mempolicy *policy;
+	struct nodelist *nodelist;
+	int first_node;
 
 	if (in_interrupt())
 		return node;
@@ -1876,18 +1878,16 @@ unsigned int mempolicy_slab_node(void)
 		return interleave_nodes(policy);
 
 	case MPOL_BIND: {
-		struct zone *zone;
-
 		/*
 		 * Follow bind policy behavior and start allocation at the
 		 * first node.
 		 */
-		struct nodelist *nodelist;
-		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
-		nodelist = &NODE_DATA(node)->node_nodelists[NODELIST_FALLBACK];
-		zone = first_zone_nlist_nodemask(nodelist, highest_zoneidx,
-							&policy->v.nodes);
-		return zone ? zone_to_nid(zone) : node;
+		nodelist = node_nodelist(node, GFP_KERNEL);
+
+		first_node = first_node_nlist_nodemask(nodelist,
+					gfp_zone(GFP_KERNEL), &policy->v.nodes);
+
+		return (first_node != NUMA_NO_NODE) ? first_node : node;
 	}
 
 	default:
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v1 12/19] mm, mempolicy: use first_node in mpol_misplaced()
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
                   ` (10 preceding siblings ...)
  2019-11-21 15:18 ` [RFC v1 11/19] mm, mempolicy: use first_node in mempolicy_slab_node() Pengfei Li
@ 2019-11-21 15:18 ` Pengfei Li
  2019-11-21 15:18 ` [RFC v1 13/19] mm, page_alloc: use first_node in local_memory_node() Pengfei Li
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:18 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

In mpol_misplaced(), we want to access first node instead of
first zone, so use first_node_nlist instead of first_zone_nlist.

Signed-off-by: Pengfei Li <fly@kernel.page>
---
 mm/mempolicy.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 66184add1627..8fd962762e46 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2404,7 +2404,6 @@ static void sp_free(struct sp_node *n)
 int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
 {
 	struct mempolicy *pol;
-	struct zone *zone;
 	int curnid = page_to_nid(page);
 	unsigned long pgoff;
 	int thiscpu = raw_smp_processor_id();
@@ -2440,11 +2439,10 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 */
 		if (node_isset(curnid, pol->v.nodes))
 			goto out;
-		zone = first_zone_nlist_nodemask(
+		polnid = first_node_nlist_nodemask(
 				node_nodelist(numa_node_id(), GFP_HIGHUSER),
 				gfp_zone(GFP_HIGHUSER),
 				&pol->v.nodes);
-		polnid = zone_to_nid(zone);
 		break;
 
 	default:
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v1 13/19] mm, page_alloc: use first_node in local_memory_node()
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
                   ` (11 preceding siblings ...)
  2019-11-21 15:18 ` [RFC v1 12/19] mm, mempolicy: use first_node in mpol_misplaced() Pengfei Li
@ 2019-11-21 15:18 ` Pengfei Li
  2019-11-21 15:18 ` [RFC v1 14/19] mm, compaction: rename compaction_zonelist_suitable Pengfei Li
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:18 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

In local_memory_node(), we want to access first node instead of
first zone, so use first_node_nlist instead of first_zone_nlist.

Signed-off-by: Pengfei Li <fly@kernel.page>
---
 mm/page_alloc.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aa5c2ef4f8ec..5c96d1ecd643 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5673,12 +5673,8 @@ static void build_zonelists(pg_data_t *pgdat)
  */
 int local_memory_node(int node)
 {
-	struct zone *zone;
-
-	zone = first_zone_nlist_nodemask(node_nodelist(node, GFP_KERNEL),
-				   gfp_zone(GFP_KERNEL),
-				   NULL);
-	return zone_to_nid(zone);
+	return first_node_nlist_nodemask(node_nodelist(node, GFP_KERNEL),
+						gfp_zone(GFP_KERNEL), NULL);
 }
 #endif
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v1 14/19] mm, compaction: rename compaction_zonelist_suitable
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
                   ` (12 preceding siblings ...)
  2019-11-21 15:18 ` [RFC v1 13/19] mm, page_alloc: use first_node in local_memory_node() Pengfei Li
@ 2019-11-21 15:18 ` Pengfei Li
  2019-11-21 15:18 ` [RFC v1 15/19] mm, mm_init: rename mminit_verify_zonelist Pengfei Li
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:18 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

This is a cleanup patch. Rename compaction_zonelist_suitable
to compaction_nodelist_suitable.

Signed-off-by: Pengfei Li <fly@kernel.page>
---
 include/linux/compaction.h | 2 +-
 mm/compaction.c            | 2 +-
 mm/page_alloc.c            | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 4b898cdbdf05..3ba55eb7c353 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -177,7 +177,7 @@ static inline bool compaction_withdrawn(enum compact_result result)
 }
 
 
-bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
+bool compaction_nodelist_suitable(struct alloc_context *ac, int order,
 					int alloc_flags);
 
 extern int kcompactd_run(int nid);
diff --git a/mm/compaction.c b/mm/compaction.c
index d9f42e18991c..91581ab1d593 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2036,7 +2036,7 @@ enum compact_result compaction_suitable(struct zone *zone, int order,
 	return ret;
 }
 
-bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
+bool compaction_nodelist_suitable(struct alloc_context *ac, int order,
 		int alloc_flags)
 {
 	struct nlist_traverser t;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5c96d1ecd643..3987b8e97158 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3967,7 +3967,7 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 	 * to work with, so we retry only if it looks like reclaim can help.
 	 */
 	if (compaction_needs_reclaim(compact_result)) {
-		ret = compaction_zonelist_suitable(ac, order, alloc_flags);
+		ret = compaction_nodelist_suitable(ac, order, alloc_flags);
 		goto out;
 	}
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v1 15/19] mm, mm_init: rename mminit_verify_zonelist
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
                   ` (13 preceding siblings ...)
  2019-11-21 15:18 ` [RFC v1 14/19] mm, compaction: rename compaction_zonelist_suitable Pengfei Li
@ 2019-11-21 15:18 ` Pengfei Li
  2019-11-21 15:18 ` [RFC v1 16/19] mm, page_alloc: cleanup build_zonelists Pengfei Li
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:18 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

This is a cleanup patch. Rename mminit_verify_zonelist to
mminit_verify_nodelist.

Signed-off-by: Pengfei Li <fly@kernel.page>
---
 mm/internal.h   | 4 ++--
 mm/mm_init.c    | 2 +-
 mm/page_alloc.c | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 90008f9fe7d9..73ba9b6376cd 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -443,7 +443,7 @@ do { \
 } while (0)
 
 extern void mminit_verify_pageflags_layout(void);
-extern void mminit_verify_zonelist(void);
+extern void mminit_verify_nodelist(void);
 #else
 
 static inline void mminit_dprintk(enum mminit_level level,
@@ -455,7 +455,7 @@ static inline void mminit_verify_pageflags_layout(void)
 {
 }
 
-static inline void mminit_verify_zonelist(void)
+static inline void mminit_verify_nodelist(void)
 {
 }
 #endif /* CONFIG_DEBUG_MEMORY_INIT */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 448e3228a911..ac91374b0e95 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -59,7 +59,7 @@ void __init mminit_print_nodelist(int nid, int nl_type)
 }
 
 /* The zonelists are simply reported, validation is manual. */
-void __init mminit_verify_zonelist(void)
+void __init mminit_verify_nodelist(void)
 {
 	int nid;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3987b8e97158..5b735eb88c0d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5803,7 +5803,7 @@ build_all_zonelists_init(void)
 	for_each_possible_cpu(cpu)
 		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
 
-	mminit_verify_zonelist();
+	mminit_verify_nodelist();
 	cpuset_init_current_mems_allowed();
 }
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v1 16/19] mm, page_alloc: cleanup build_zonelists
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
                   ` (14 preceding siblings ...)
  2019-11-21 15:18 ` [RFC v1 15/19] mm, mm_init: rename mminit_verify_zonelist Pengfei Li
@ 2019-11-21 15:18 ` Pengfei Li
  2019-11-21 15:18 ` [RFC v1 17/19] mm, memory_hotplug: cleanup online_pages() Pengfei Li
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:18 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

This is a cleanup patch.

Renaming some functions and variables when initializing nodelist
is xxx_nodelist instead of xxx_zonelist.

Signed-off-by: Pengfei Li <fly@kernel.page>
---
 include/linux/mmzone.h |  2 +-
 init/main.c            |  2 +-
 mm/memory_hotplug.c    |  6 +++---
 mm/page_alloc.c        | 20 ++++++++++----------
 4 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 599b30620aa1..f1a492c13037 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -813,7 +813,7 @@ static inline bool pgdat_is_empty(pg_data_t *pgdat)
 
 #include <linux/memory_hotplug.h>
 
-void build_all_zonelists(pg_data_t *pgdat);
+void build_all_nodelists(pg_data_t *pgdat);
 void wakeup_kswapd(pg_data_t *pgdat, gfp_t gfp_mask, int order,
 		   enum zone_type classzone_idx);
 bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
diff --git a/init/main.c b/init/main.c
index 4d814de017ee..d561bdc537eb 100644
--- a/init/main.c
+++ b/init/main.c
@@ -602,7 +602,7 @@ asmlinkage __visible void __init start_kernel(void)
 	smp_prepare_boot_cpu();	/* arch-specific boot-cpu hooks */
 	boot_cpu_hotplug_init();
 
-	build_all_zonelists(NULL);
+	build_all_nodelists(NULL);
 	page_alloc_init();
 
 	pr_notice("Kernel command line: %s\n", boot_command_line);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 46b2e056a43f..3c63529df112 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -821,7 +821,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 
 	node_states_set_node(nid, &arg);
 	if (need_zonelists_rebuild)
-		build_all_zonelists(NULL);
+		build_all_nodelists(NULL);
 	else
 		zone_pcp_update(zone);
 
@@ -904,7 +904,7 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
 	 * The node we allocated has no zone fallback lists. For avoiding
 	 * to access not-initialized zonelist, build here.
 	 */
-	build_all_zonelists(pgdat);
+	build_all_nodelists(pgdat);
 
 	/*
 	 * When memory is hot-added, all the memory is in offline state. So
@@ -1565,7 +1565,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 	if (!populated_zone(zone)) {
 		zone_pcp_reset(zone);
-		build_all_zonelists(NULL);
+		build_all_nodelists(NULL);
 	} else
 		zone_pcp_update(zone);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b735eb88c0d..146abe537300 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5631,7 +5631,7 @@ static void build_thisnode_nodelists(pg_data_t *pgdat)
  * may still exist in local DMA zone.
  */
 
-static void build_zonelists(pg_data_t *pgdat)
+static void build_nodelists(pg_data_t *pgdat)
 {
 	static int node_order[MAX_NUMNODES];
 	int node, load, nr_nodes = 0;
@@ -5682,7 +5682,7 @@ static void setup_min_unmapped_ratio(void);
 static void setup_min_slab_ratio(void);
 #else	/* CONFIG_NUMA */
 
-static void build_zonelists(pg_data_t *pgdat)
+static void build_nodelists(pg_data_t *pgdat)
 {
 	int node, local_node;
 	struct nlist_entry *entry;
@@ -5737,7 +5737,7 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
 static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 
-static void __build_all_zonelists(void *data)
+static void __build_all_nodelists(void *data)
 {
 	int nid;
 	int __maybe_unused cpu;
@@ -5755,12 +5755,12 @@ static void __build_all_zonelists(void *data)
 	 * building zonelists is fine - no need to touch other nodes.
 	 */
 	if (self && !node_online(self->node_id)) {
-		build_zonelists(self);
+		build_nodelists(self);
 	} else {
 		for_each_online_node(nid) {
 			pg_data_t *pgdat = NODE_DATA(nid);
 
-			build_zonelists(pgdat);
+			build_nodelists(pgdat);
 		}
 
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
@@ -5781,11 +5781,11 @@ static void __build_all_zonelists(void *data)
 }
 
 static noinline void __init
-build_all_zonelists_init(void)
+build_all_nodelists_init(void)
 {
 	int cpu;
 
-	__build_all_zonelists(NULL);
+	__build_all_nodelists(NULL);
 
 	/*
 	 * Initialize the boot_pagesets that are going to be used
@@ -5813,12 +5813,12 @@ build_all_zonelists_init(void)
  * __ref due to call of __init annotated helper build_all_zonelists_init
  * [protected by SYSTEM_BOOTING].
  */
-void __ref build_all_zonelists(pg_data_t *pgdat)
+void __ref build_all_nodelists(pg_data_t *pgdat)
 {
 	if (system_state == SYSTEM_BOOTING) {
-		build_all_zonelists_init();
+		build_all_nodelists_init();
 	} else {
-		__build_all_zonelists(pgdat);
+		__build_all_nodelists(pgdat);
 		/* cpuset refresh routine should be here */
 	}
 	vm_total_pages = nr_free_pagecache_pages();
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v1 17/19] mm, memory_hotplug: cleanup online_pages()
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
                   ` (15 preceding siblings ...)
  2019-11-21 15:18 ` [RFC v1 16/19] mm, page_alloc: cleanup build_zonelists Pengfei Li
@ 2019-11-21 15:18 ` Pengfei Li
  2019-11-21 15:18 ` [RFC v1 18/19] kernel, sysctl: cleanup numa_zonelist_order Pengfei Li
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:18 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

This is a cleanup patch.

In online_pages(), rename the local variable need_zonelists_rebuild
to need_nodelists_rebuild.

Signed-off-by: Pengfei Li <fly@kernel.page>
---
 mm/memory_hotplug.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 3c63529df112..3ff55da7b225 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -760,10 +760,10 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
 
 int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type)
 {
-	unsigned long flags;
+	bool need_nodelists_rebuild = false;
 	unsigned long onlined_pages = 0;
+	unsigned long flags;
 	struct zone *zone;
-	int need_zonelists_rebuild = 0;
 	int nid;
 	int ret;
 	struct memory_notify arg;
@@ -798,7 +798,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	 * So, zonelist must be updated after online.
 	 */
 	if (!populated_zone(zone)) {
-		need_zonelists_rebuild = 1;
+		need_nodelists_rebuild = true;
 		setup_zone_pageset(zone);
 	}
 
@@ -806,7 +806,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 		online_pages_range);
 	if (ret) {
 		/* not a single memory resource was applicable */
-		if (need_zonelists_rebuild)
+		if (need_nodelists_rebuild)
 			zone_pcp_reset(zone);
 		goto failed_addition;
 	}
@@ -820,7 +820,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	shuffle_zone(zone);
 
 	node_states_set_node(nid, &arg);
-	if (need_zonelists_rebuild)
+	if (need_nodelists_rebuild)
 		build_all_nodelists(NULL);
 	else
 		zone_pcp_update(zone);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v1 18/19] kernel, sysctl: cleanup numa_zonelist_order
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
                   ` (16 preceding siblings ...)
  2019-11-21 15:18 ` [RFC v1 17/19] mm, memory_hotplug: cleanup online_pages() Pengfei Li
@ 2019-11-21 15:18 ` Pengfei Li
  2019-11-21 15:18 ` [RFC v1 19/19] mm, mmzone: cleanup zonelist in comments Pengfei Li
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:18 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

This is a cleanup patch.

Rename numa_zonelist_order to numa_nodelist_order, and clean up
some zonelist_xxx accordingly.

Signed-off-by: Pengfei Li <fly@kernel.page>
---
 include/linux/mmzone.h |  6 +++---
 kernel/sysctl.c        |  8 ++++----
 mm/page_alloc.c        | 16 ++++++++--------
 3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f1a492c13037..0423a84dfd7d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -959,10 +959,10 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
 
-extern int numa_zonelist_order_handler(struct ctl_table *, int,
+extern int numa_nodelist_order_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
-extern char numa_zonelist_order[];
-#define NUMA_ZONELIST_ORDER_LEN	16
+extern char numa_nodelist_order[];
+#define NUMA_NODELIST_ORDER_LEN	16
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 50373984a5e2..040c0c561399 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1648,11 +1648,11 @@ static struct ctl_table vm_table[] = {
 #endif
 #ifdef CONFIG_NUMA
 	{
-		.procname	= "numa_zonelist_order",
-		.data		= &numa_zonelist_order,
-		.maxlen		= NUMA_ZONELIST_ORDER_LEN,
+		.procname	= "numa_nodelist_order",
+		.data		= &numa_nodelist_order,
+		.maxlen		= NUMA_NODELIST_ORDER_LEN,
 		.mode		= 0644,
-		.proc_handler	= numa_zonelist_order_handler,
+		.proc_handler	= numa_nodelist_order_handler,
 	},
 #endif
 #if (defined(CONFIG_X86_32) && !defined(CONFIG_UML))|| \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 146abe537300..bc24e614c296 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5479,7 +5479,7 @@ static inline void set_last_nlist_entry(struct nlist_entry *entry)
 
 #ifdef CONFIG_NUMA
 
-static int __parse_numa_zonelist_order(char *s)
+static int __parse_numa_nodelist_order(char *s)
 {
 	/*
 	 * We used to support different zonlists modes but they turned
@@ -5488,27 +5488,27 @@ static int __parse_numa_zonelist_order(char *s)
 	 * not fail it silently
 	 */
 	if (!(*s == 'd' || *s == 'D' || *s == 'n' || *s == 'N')) {
-		pr_warn("Ignoring unsupported numa_zonelist_order value:  %s\n", s);
+		pr_warn("Ignoring unsupported numa_nodelist_order value:  %s\n", s);
 		return -EINVAL;
 	}
 	return 0;
 }
 
-static __init int setup_numa_zonelist_order(char *s)
+static __init int setup_numa_nodelist_order(char *s)
 {
 	if (!s)
 		return 0;
 
-	return __parse_numa_zonelist_order(s);
+	return __parse_numa_nodelist_order(s);
 }
-early_param("numa_zonelist_order", setup_numa_zonelist_order);
+early_param("numa_nodelist_order", setup_numa_nodelist_order);
 
-char numa_zonelist_order[] = "Node";
+char numa_nodelist_order[] = "Node";
 
 /*
  * sysctl handler for numa_zonelist_order
  */
-int numa_zonelist_order_handler(struct ctl_table *table, int write,
+int numa_nodelist_order_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
 		loff_t *ppos)
 {
@@ -5521,7 +5521,7 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
 	if (IS_ERR(str))
 		return PTR_ERR(str);
 
-	ret = __parse_numa_zonelist_order(str);
+	ret = __parse_numa_nodelist_order(str);
 	kfree(str);
 	return ret;
 }
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC v1 19/19] mm, mmzone: cleanup zonelist in comments
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
                   ` (17 preceding siblings ...)
  2019-11-21 15:18 ` [RFC v1 18/19] kernel, sysctl: cleanup numa_zonelist_order Pengfei Li
@ 2019-11-21 15:18 ` Pengfei Li
  2019-11-21 18:04 ` [RFC v1 00/19] Modify zonelist to nodelist v1 Michal Hocko
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 39+ messages in thread
From: Pengfei Li @ 2019-11-21 15:18 UTC (permalink / raw)
  To: akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, Pengfei Li

This is a cleanup patch.

Clean up all comments that contain the zonelist.

Signed-off-by: Pengfei Li <fly@kernel.page>
---
 arch/hexagon/mm/init.c           |  2 +-
 arch/ia64/include/asm/topology.h |  2 +-
 arch/x86/mm/numa.c               |  2 +-
 include/linux/gfp.h              |  8 +++---
 include/linux/mmzone.h           |  4 +--
 kernel/cgroup/cpuset.c           |  4 +--
 mm/internal.h                    |  2 +-
 mm/memory_hotplug.c              |  8 +++---
 mm/mempolicy.c                   |  2 +-
 mm/mm_init.c                     |  2 +-
 mm/page_alloc.c                  | 45 ++++++++++++++++----------------
 mm/vmscan.c                      |  2 +-
 12 files changed, 41 insertions(+), 42 deletions(-)

diff --git a/arch/hexagon/mm/init.c b/arch/hexagon/mm/init.c
index c961773a6fff..65eccf11cf55 100644
--- a/arch/hexagon/mm/init.c
+++ b/arch/hexagon/mm/init.c
@@ -103,7 +103,7 @@ void __init paging_init(void)
 
 	zones_sizes[ZONE_NORMAL] = max_low_pfn;
 
-	free_area_init(zones_sizes);  /*  sets up the zonelists and mem_map  */
+	free_area_init(zones_sizes);  /*  sets up the nodelists and mem_map  */
 
 	/*
 	 * Start of high memory area.  Will probably need something more
diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index 43567240b0d6..cd3f4b121c89 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -13,7 +13,7 @@
 
 #ifdef CONFIG_NUMA
 
-/* Nodes w/o CPUs are preferred for memory allocations, see build_zonelists */
+/* Nodes w/o CPUs are preferred for memory allocations, see build_nodelists */
 #define PENALTY_FOR_NODE_WITH_CPUS 255
 
 /*
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 4123100e0eaf..4ada68abdcc9 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -728,7 +728,7 @@ static void __init init_memory_less_node(int nid)
 	free_area_init_node(nid, zones_size, 0, zholes_size);
 
 	/*
-	 * All zonelists will be built later in start_kernel() after per cpu
+	 * All nodelists will be built later in start_kernel() after per cpu
 	 * areas are initialized.
 	 */
 }
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6caab5a30f39..882c3d844ea1 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -466,10 +466,10 @@ static inline int gfp_nodelist(gfp_t flags)
 }
 
 /*
- * We get the zone list from the current node and the gfp_mask.
- * This zone list contains a maximum of MAXNODES*MAX_NR_ZONES zones.
- * There are two zonelists per node, one for all zones with memory and
- * one containing just zones from the node the zonelist belongs to.
+ * We get the node list from the current node and the gfp_mask.
+ * This node list contains a maximum of MAXNODES nodes.
+ * There are two nodelists per node, one for all zones with memory and
+ * one containing just zones from the node the nodelist belongs to.
  *
  * For the normal case of non-DISCONTIGMEM systems the NODE_DATA() gets
  * optimized to &contig_page_data at compile-time.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0423a84dfd7d..4b21e884b357 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -648,7 +648,7 @@ enum {
 	NODELIST_FALLBACK,	/* nodelist with fallback */
 #ifdef CONFIG_NUMA
 	/*
-	 * The NUMA zonelists are doubled because we need nodelists that
+	 * The NUMA nodelists are doubled because we need nodelists that
 	 * restrict the allocations to a single node for __GFP_THISNODE.
 	 */
 	NODELIST_NOFALLBACK,	/* nodelist without fallback (__GFP_THISNODE) */
@@ -657,7 +657,7 @@ enum {
 };
 
 /*
- * This struct contains information about a node in a zonelist. It is stored
+ * This struct contains information about a node in a nodelist. It is stored
  * here to avoid dereferences into large structures and lookups of tables
  */
 struct nlist_entry {
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 58f5073acff7..759d122b17e1 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3393,7 +3393,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
  * __alloc_pages() routine only calls here with __GFP_HARDWALL bit
  * _not_ set if it's a GFP_KERNEL allocation, and all nodes in the
  * current tasks mems_allowed came up empty on the first pass over
- * the zonelist.  So only GFP_KERNEL allocations, if all nodes in the
+ * the nodelist.  So only GFP_KERNEL allocations, if all nodes in the
  * cpuset are short of memory, might require taking the callback_lock.
  *
  * The first call here from mm/page_alloc:get_page_from_freelist()
@@ -3467,7 +3467,7 @@ bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
  * should not be possible for the following code to return an
  * offline node.  But if it did, that would be ok, as this routine
  * is not returning the node where the allocation must be, only
- * the node where the search should start.  The zonelist passed to
+ * the node where the search should start.  The nodelist passed to
  * __alloc_pages() will include all nodes.  If the slab allocator
  * is passed an offline node, it will fall back to the local node.
  * See kmem_cache_alloc_node().
diff --git a/mm/internal.h b/mm/internal.h
index 73ba9b6376cd..68954eac0bd9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -105,7 +105,7 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
  * nodemask, migratetype and high_zoneidx are initialized only once in
  * __alloc_pages_nodemask() and then never change.
  *
- * zonelist, preferred_zone and classzone_idx are set first in
+ * nodelist, preferred_zone and classzone_idx are set first in
  * __alloc_pages_nodemask() for the fast path, and might be later changed
  * in __alloc_pages_slowpath(). All other functions pass the whole strucure
  * by a const pointer.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 3ff55da7b225..9dbd3377932f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -793,9 +793,9 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 		goto failed_addition;
 
 	/*
-	 * If this zone is not populated, then it is not in zonelist.
+	 * If this zone is not populated, then it is not in nodelist.
 	 * This means the page allocator ignores this zone.
-	 * So, zonelist must be updated after online.
+	 * So, nodelist must be updated after online.
 	 */
 	if (!populated_zone(zone)) {
 		need_nodelists_rebuild = true;
@@ -901,8 +901,8 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
 	free_area_init_core_hotplug(nid);
 
 	/*
-	 * The node we allocated has no zone fallback lists. For avoiding
-	 * to access not-initialized zonelist, build here.
+	 * The node we allocated has no node fallback lists. For avoiding
+	 * to access not-initialized nodelist, build here.
 	 */
 	build_all_nodelists(pgdat);
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8fd962762e46..fc90cea07cfb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1950,7 +1950,7 @@ static inline unsigned interleave_nid(struct mempolicy *pol,
  * Returns a nid suitable for a huge page allocation and a pointer
  * to the struct mempolicy for conditional unref after allocation.
  * If the effective policy is 'BIND, returns a pointer to the mempolicy's
- * @nodemask for filtering the zonelist.
+ * @nodemask for filtering the nodelist.
  *
  * Must be protected by read_mems_allowed_begin()
  */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index ac91374b0e95..fff486f9a330 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -58,7 +58,7 @@ void __init mminit_print_nodelist(int nid, int nl_type)
 	}
 }
 
-/* The zonelists are simply reported, validation is manual. */
+/* The nodelists are simply reported, validation is manual. */
 void __init mminit_verify_nodelist(void)
 {
 	int nid;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bc24e614c296..6998e051a94b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2647,7 +2647,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
 	/*
 	 * Do not steal pages from freelists belonging to other pageblocks
 	 * i.e. orders < pageblock_order. If there are no local zones free,
-	 * the zonelists will be reiterated without ALLOC_NOFRAGMENT.
+	 * the nodelists will be reiterated without ALLOC_NOFRAGMENT.
 	 */
 	if (alloc_flags & ALLOC_NOFRAGMENT)
 		min_order = pageblock_order;
@@ -3572,7 +3572,7 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
 }
 
 /*
- * get_page_from_freelist goes through the zonelist trying to allocate
+ * get_page_from_freelist goes through the nodelist trying to allocate
  * a page.
  */
 static struct page *
@@ -3586,7 +3586,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 
 retry:
 	/*
-	 * Scan zonelist, looking for a zone with enough free.
+	 * Scan nodelist, looking for a zone with enough free.
 	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
 	 */
 	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
@@ -3811,7 +3811,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/*
-	 * Go through the zonelist yet one more time, keep very high watermark
+	 * Go through the nodelist yet one more time, keep very high watermark
 	 * here, this is only to catch a parallel oom killing, we must fail if
 	 * we're still under heavy pressure. But make sure that this reclaim
 	 * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
@@ -4441,7 +4441,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
 	/*
-	 * We need to recalculate the starting point for the zonelist iterator
+	 * We need to recalculate the starting point for the nodelist iterator
 	 * because we might have used different nodemask in the fast path, or
 	 * there was a cpuset modification and we are retrying - otherwise we
 	 * could end up iterating over non-eligible zones endlessly.
@@ -4526,7 +4526,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		alloc_flags = reserve_flags;
 
 	/*
-	 * Reset the nodemask and zonelist iterators if memory policies can be
+	 * Reset the nodemask and nodelist iterators if memory policies can be
 	 * ignored. These allocations are high priority and system rather than
 	 * user oriented.
 	 */
@@ -4535,7 +4535,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		reset_ac_preferred_zone(ac);
 	}
 
-	/* Attempt with potentially adjusted zonelist and alloc_flags */
+	/* Attempt with potentially adjusted nodelist and alloc_flags */
 	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 	if (page)
 		goto got_pg;
@@ -4701,7 +4701,7 @@ static inline void finalise_ac(gfp_t gfp_mask, struct alloc_context *ac)
 
 	/*
 	 * The preferred zone is used for statistics but crucially it is
-	 * also used as the starting point for the zonelist iterator. It
+	 * also used as the starting point for the nodelist iterator. It
 	 * may get reset for allocations that ignore memory policies.
 	 */
 	reset_ac_preferred_zone(ac);
@@ -5429,9 +5429,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 }
 
 /*
- * Builds allocation fallback zone lists.
- *
- * Add all populated zones of a node to the zonelist.
+ * Builds allocation fallback node lists.
+ * Add all populated zones of a node to the nodelist.
  */
 static struct nlist_entry *
 set_nlist_entry(struct nlist_entry *entry, int nid)
@@ -5506,7 +5505,7 @@ early_param("numa_nodelist_order", setup_numa_nodelist_order);
 char numa_nodelist_order[] = "Node";
 
 /*
- * sysctl handler for numa_zonelist_order
+ * sysctl handler for numa_nodelist_order
  */
 int numa_nodelist_order_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
@@ -5592,7 +5591,7 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
 }
 
 /*
- * Build zonelists ordered by node and zones within node.
+ * Build nodelists ordered by node and zones within node.
  * This results in maximum locality--normal zone overflows into local
  * DMA zone, if any--but risks exhausting DMA zone.
  */
@@ -5611,7 +5610,7 @@ static void build_nodelists_in_node_order(pg_data_t *pgdat, int *node_order,
 }
 
 /*
- * Build gfp_thisnode zonelists
+ * Build gfp_thisnode nodelists
  */
 static void build_thisnode_nodelists(pg_data_t *pgdat)
 {
@@ -5625,7 +5624,7 @@ static void build_thisnode_nodelists(pg_data_t *pgdat)
 }
 
 /*
- * Build zonelists ordered by zone and nodes within zones.
+ * Build nodelists ordered by zone and nodes within zones.
  * This results in conserving DMA zone[s] until all Normal memory is
  * exhausted, but results in overflowing to remote node while memory
  * may still exist in local DMA zone.
@@ -5667,9 +5666,9 @@ static void build_nodelists(pg_data_t *pgdat)
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
 /*
  * Return node id of node used for "local" allocations.
- * I.e., first node id of first zone in arg node's generic zonelist.
+ * I.e., first node id of first zone in arg node's generic nodelist.
  * Used for initializing percpu 'numa_mem', which is used primarily
- * for kernel allocations, so use GFP_KERNEL flags to locate zonelist.
+ * for kernel allocations, so use GFP_KERNEL flags to locate nodelist.
  */
 int local_memory_node(int node)
 {
@@ -5693,7 +5692,7 @@ static void build_nodelists(pg_data_t *pgdat)
 	entry = set_nlist_entry(entry, local_node);
 
 	/*
-	 * Now we build the zonelist so that it contains the zones
+	 * Now we build the nodelist so that it contains the zones
 	 * of all the other nodes.
 	 * We don't want to pressure a particular node, so when
 	 * building the zones for node N, we make sure that the
@@ -5752,7 +5751,7 @@ static void __build_all_nodelists(void *data)
 
 	/*
 	 * This node is hotadded and no memory is yet present.   So just
-	 * building zonelists is fine - no need to touch other nodes.
+	 * building nodelists is fine - no need to touch other nodes.
 	 */
 	if (self && !node_online(self->node_id)) {
 		build_nodelists(self);
@@ -5766,7 +5765,7 @@ static void __build_all_nodelists(void *data)
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
 		/*
 		 * We now know the "local memory node" for each node--
-		 * i.e., the node of the first zone in the generic zonelist.
+		 * i.e., the node of the first zone in the generic nodelist.
 		 * Set up numa_mem percpu variable for on-line cpus.  During
 		 * boot, only the boot cpu should be on-line;  we'll init the
 		 * secondary cpus' numa_mem as they come on-line.  During
@@ -5810,7 +5809,7 @@ build_all_nodelists_init(void)
 /*
  * unless system_state == SYSTEM_BOOTING.
  *
- * __ref due to call of __init annotated helper build_all_zonelists_init
+ * __ref due to call of __init annotated helper build_all_nodelists_init
  * [protected by SYSTEM_BOOTING].
  */
 void __ref build_all_nodelists(pg_data_t *pgdat)
@@ -5834,7 +5833,7 @@ void __ref build_all_nodelists(pg_data_t *pgdat)
 	else
 		page_group_by_mobility_disabled = 0;
 
-	pr_info("Built %u zonelists, mobility grouping %s.  Total pages: %ld\n",
+	pr_info("Built %u nodelists, mobility grouping %s.  Total pages: %ld\n",
 		nr_online_nodes,
 		page_group_by_mobility_disabled ? "off" : "on",
 		vm_total_pages);
@@ -8554,7 +8553,7 @@ static bool zone_spans_last_pfn(const struct zone *zone,
  * @nodemask:	Mask for other possible nodes
  *
  * This routine is a wrapper around alloc_contig_range(). It scans over zones
- * on an applicable zonelist to find a contiguous pfn range which can then be
+ * on an applicable nodelist to find a contiguous pfn range which can then be
  * tried for allocation with alloc_contig_range(). This routine is intended
  * for allocation requests which can not be fulfilled with the buddy allocator.
  *
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2b0e51525c3a..8a0c786f7c25 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3360,7 +3360,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.may_swap = may_swap,
 	};
 	/*
-	 * Traverse the ZONELIST_FALLBACK zonelist of the current node to put
+	 * Traverse the NODELIST_FALLBACK nodelist of the current node to put
 	 * equal pressure on all the nodes. This is based on the assumption that
 	 * the reclaim does not bail out early.
 	 */
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [RFC v1 00/19] Modify zonelist to nodelist v1
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
                   ` (18 preceding siblings ...)
  2019-11-21 15:18 ` [RFC v1 19/19] mm, mmzone: cleanup zonelist in comments Pengfei Li
@ 2019-11-21 18:04 ` Michal Hocko
  2019-11-22 15:05   ` Pengfei Li
  2019-11-22 10:03 ` David Hildenbrand
       [not found] ` <2019112215245905276118@gmail.com>
  21 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2019-11-21 18:04 UTC (permalink / raw)
  To: Pengfei Li
  Cc: akpm, mgorman, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel, linux-mm

On Thu 21-11-19 23:17:52, Pengfei Li wrote:
[...]
> Since I don't currently have multiple node NUMA systems, I would be
> grateful if anyone would like to test this series of patches.

I didn't really get to think about the actual patchset. From a very
quick glance I am wondering whether we need to optimize as there are
usually only small amount of numa nodes. But I am quite busy so I cannot
really do any claims.

Anyway, you can test this even without NUMA HW. Have a look at numa=fake
option (numa_emu_cmdline). Or you can use kvm/qemu which provides easy
ways to setup a NUMA capable virtual machine.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v1 00/19] Modify zonelist to nodelist v1
  2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
                   ` (19 preceding siblings ...)
  2019-11-21 18:04 ` [RFC v1 00/19] Modify zonelist to nodelist v1 Michal Hocko
@ 2019-11-22 10:03 ` David Hildenbrand
  2019-11-22 15:49   ` Pengfei Li
       [not found] ` <2019112215245905276118@gmail.com>
  21 siblings, 1 reply; 39+ messages in thread
From: David Hildenbrand @ 2019-11-22 10:03 UTC (permalink / raw)
  To: Pengfei Li, akpm
  Cc: mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm

On 21.11.19 16:17, Pengfei Li wrote:
> Motivation
> ----------
> Currently if we want to iterate through all the nodes we have to
> traverse all the zones from the zonelist.
> 
> So in order to reduce the number of loops required to traverse node,
> this series of patches modified the zonelist to nodelist.
> 
> Two new macros have been introduced:
> 1) for_each_node_nlist
> 2) for_each_node_nlist_nodemask
> 
> 
> Benefit
> -------
> 1. For a NUMA system with N nodes, each node has M zones, the number
>     of loops is reduced from N*M times to N times when traversing node.
> 
> 2. The size of pg_data_t is much reduced.
> 
> 
> Test Result
> -----------
> Currently I have only performed a simple page allocation benchmark
> test on my laptop, and the results show that the performance of a
> system with only one node is almost unaffected.
> 

So you are seeing no performance changes. I am wondering why do we need 
this, then - because your motivation sounds like a performance 
improvement? (not completely against this, just trying to understand the 
value of this :) )


-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v1 00/19] Modify zonelist to nodelist v1
       [not found] ` <2019112215245905276118@gmail.com>
@ 2019-11-22 10:14   ` David Hildenbrand
  2019-11-22 15:28   ` Pengfei Li
  1 sibling, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2019-11-22 10:14 UTC (permalink / raw)
  To: lixinhai.lxh, Pengfei Li, akpm
  Cc: mgorman, Michal Hocko, Vlastimil Babka, cl, iamjoonsoo.kim, guro,
	linux-kernel, linux-mm

On 22.11.19 08:25, lixinhai.lxh@gmail.com wrote:
> On 2019-11-21 at 23:17 Pengfei Li wrote:
>> Motivation
>> ----------
>> Currently if we want to iterate through all the nodes we have to
>> traverse all the zones from the zonelist.
>>
>> So in order to reduce the number of loops required to traverse node,
>> this series of patches modified the zonelist to nodelist.
>>
>> Two new macros have been introduced:
>> 1) for_each_node_nlist
>> 2) for_each_node_nlist_nodemask
>>
>>
>> Benefit
>> -------
>> 1. For a NUMA system with N nodes, each node has M zones, the number
>>     of loops is reduced from N*M times to N times when traversing node.
>>
> 
> It looks to me that we don't really have system which has N nodes and
> each node with M zones in its address range.
> We may have systems which has several nodes, but only the first node has
> all zone types, other nodes only have NORMAL zone. (Evenly distribute the
> !NORMAL zones on all nodes is not reasonable, as those zones have limited
> size)
> So iterate over zones to reach nodes should at N level, not M*N level.

I guess NORMAL/MOVABLE/DEVICE would be common for most nodes, while I do 
agree that usually we will only have 1 or 2 zones per node (when we have 
many nodes). So it would be something like c*N, whereby c is most 
probably on average 2.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v1 00/19] Modify zonelist to nodelist v1
  2019-11-21 18:04 ` [RFC v1 00/19] Modify zonelist to nodelist v1 Michal Hocko
@ 2019-11-22 15:05   ` Pengfei Li
  2019-11-25  8:40     ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Pengfei Li @ 2019-11-22 15:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, mgorman, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel, linux-mm

On Thu, 21 Nov 2019 19:04:01 +0100
Michal Hocko <mhocko@kernel.org> wrote:

> On Thu 21-11-19 23:17:52, Pengfei Li wrote:
> [...]
> > Since I don't currently have multiple node NUMA systems, I would be
> > grateful if anyone would like to test this series of patches.
> 
> I didn't really get to think about the actual patchset. From a very
> quick glance I am wondering whether we need to optimize as there are
> usually only small amount of numa nodes. But I am quite busy so I
> cannot really do any claims.

Thanks for your comments.

I think it's time to modify the zonelist to nodelist because the
zonelist is always in node order and the page reclamation is based on
node.

I will do more performance testing to show that multi-node systems will
benefit from this series of patches.

> Anyway, you can test this even without NUMA HW. Have a look at
> numa=fake option (numa_emu_cmdline). Or you can use kvm/qemu which
> provides easy ways to setup a NUMA capable virtual machine.

Thanks a lot. I will use the numa=fake option to do more performance
testing.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v1 00/19] Modify zonelist to nodelist v1
       [not found] ` <2019112215245905276118@gmail.com>
  2019-11-22 10:14   ` David Hildenbrand
@ 2019-11-22 15:28   ` Pengfei Li
  2019-11-22 15:53     ` Qian Cai
  1 sibling, 1 reply; 39+ messages in thread
From: Pengfei Li @ 2019-11-22 15:28 UTC (permalink / raw)
  To: lixinhai.lxh
  Cc: akpm, mgorman, Michal Hocko, Vlastimil Babka, cl, iamjoonsoo.kim,
	guro, linux-kernel, linux-mm, fly

On Fri, 22 Nov 2019 15:25:00 +0800
"lixinhai.lxh@gmail.com" <lixinhai.lxh@gmail.com> wrote:

> On 2019-11-21 at 23:17 Pengfei Li wrote:
> >Motivation
> >----------
> >Currently if we want to iterate through all the nodes we have to
> >traverse all the zones from the zonelist.
> >
> >So in order to reduce the number of loops required to traverse node,
> >this series of patches modified the zonelist to nodelist.
> >
> >Two new macros have been introduced:
> >1) for_each_node_nlist
> >2) for_each_node_nlist_nodemask
> >
> >
> >Benefit
> >-------
> >1. For a NUMA system with N nodes, each node has M zones, the number
> >   of loops is reduced from N*M times to N times when traversing
> >node.
> > 
> 
> It looks to me that we don't really have system which has N nodes and 
> each node with M zones in its address range. 
> We may have systems which has several nodes, but only the first node
> has all zone types, other nodes only have NORMAL zone. (Evenly
> distribute the !NORMAL zones on all nodes is not reasonable, as those
> zones have limited size)
> So iterate over zones to reach nodes should at N level, not M*N level.
> 

Thanks for your comments.

In the case you said, the number of loops required to traverse all
nodes is similar to traversing all zones.

I have two main reasons to explain that this series of patches is
beneficial.

1. When node has more than one zone, it will take fewer cycles to
traverse all nodes. (for example, ZONE_MOVABLE?)

2. Using zonelist to traverse all nodes is inefficient, pgdat must be
obtained indirectly via zone->zone_pgdat, and additional judgment must
be made.

E.g
1) Using zonelist to traverse all nodes

	last_pgdat = NULL;	
	for_each_zone_zonelist(zone, xxx) {
		pgdat = zone->zone_pgdat;
		if (pgdat == last_pgdat)
			continue;

		last_pgdat = pgdat;
		do_something(pgdat);
	}

2) Using nodelist to traverse all nodes

	for_each_node_nodelist(node, xxx) {
		do_something(NODE_INFO(node));
	}

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v1 00/19] Modify zonelist to nodelist v1
  2019-11-22 10:03 ` David Hildenbrand
@ 2019-11-22 15:49   ` Pengfei Li
  2019-11-22 15:53     ` Christopher Lameter
  0 siblings, 1 reply; 39+ messages in thread
From: Pengfei Li @ 2019-11-22 15:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, mgorman, mhocko, vbabka, cl, iamjoonsoo.kim, guro,
	linux-kernel, linux-mm, fly

On Fri, 22 Nov 2019 11:03:30 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 21.11.19 16:17, Pengfei Li wrote:
> > Motivation
> > ----------
> > Currently if we want to iterate through all the nodes we have to
> > traverse all the zones from the zonelist.
> > 
> > So in order to reduce the number of loops required to traverse node,
> > this series of patches modified the zonelist to nodelist.
> > 
> > Two new macros have been introduced:
> > 1) for_each_node_nlist
> > 2) for_each_node_nlist_nodemask
> > 
> > 
> > Benefit
> > -------
> > 1. For a NUMA system with N nodes, each node has M zones, the number
> >     of loops is reduced from N*M times to N times when traversing
> > node.
> > 
> > 2. The size of pg_data_t is much reduced.
> > 
> > 
> > Test Result
> > -----------
> > Currently I have only performed a simple page allocation benchmark
> > test on my laptop, and the results show that the performance of a
> > system with only one node is almost unaffected.
> > 
> 
> So you are seeing no performance changes. I am wondering why do we
> need this, then - because your motivation sounds like a performance 
> improvement? (not completely against this, just trying to understand
> the value of this :) )

Thanks for your comments.

I am sorry that I did not make it clear. I want to express this series
of patches will benefit NUMA systems with multiple nodes.

The main benefit is that it will be more efficient when traversing all
nodes (for example when performing page reclamation).


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v1 00/19] Modify zonelist to nodelist v1
  2019-11-22 15:28   ` Pengfei Li
@ 2019-11-22 15:53     ` Qian Cai
  2019-11-25  8:39       ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Qian Cai @ 2019-11-22 15:53 UTC (permalink / raw)
  To: Pengfei Li, lixinhai.lxh
  Cc: akpm, mgorman, Michal Hocko, Vlastimil Babka, cl, iamjoonsoo.kim,
	guro, linux-kernel, linux-mm

On Fri, 2019-11-22 at 23:28 +0800, Pengfei Li wrote:
> On Fri, 22 Nov 2019 15:25:00 +0800
> "lixinhai.lxh@gmail.com" <lixinhai.lxh@gmail.com> wrote:
> 
> > On 2019-11-21 at 23:17 Pengfei Li wrote:
> > > Motivation
> > > ----------
> > > Currently if we want to iterate through all the nodes we have to
> > > traverse all the zones from the zonelist.
> > > 
> > > So in order to reduce the number of loops required to traverse node,
> > > this series of patches modified the zonelist to nodelist.
> > > 
> > > Two new macros have been introduced:
> > > 1) for_each_node_nlist
> > > 2) for_each_node_nlist_nodemask
> > > 
> > > 
> > > Benefit
> > > -------
> > > 1. For a NUMA system with N nodes, each node has M zones, the number
> > >    of loops is reduced from N*M times to N times when traversing
> > > node.
> > > 
> > 
> > It looks to me that we don't really have system which has N nodes and 
> > each node with M zones in its address range. 
> > We may have systems which has several nodes, but only the first node
> > has all zone types, other nodes only have NORMAL zone. (Evenly
> > distribute the !NORMAL zones on all nodes is not reasonable, as those
> > zones have limited size)
> > So iterate over zones to reach nodes should at N level, not M*N level.
> > 
> 
> Thanks for your comments.
> 
> In the case you said, the number of loops required to traverse all
> nodes is similar to traversing all zones.
> 
> I have two main reasons to explain that this series of patches is
> beneficial.
> 
> 1. When node has more than one zone, it will take fewer cycles to
> traverse all nodes. (for example, ZONE_MOVABLE?)

ZONE_MOVABLE is broken for ages (non-movable allocations are there all the time
last time I tried) which indicates there is very few people care about it, so it
is rather weak to use that as a justification for the churns it might cause.

> 
> 2. Using zonelist to traverse all nodes is inefficient, pgdat must be
> obtained indirectly via zone->zone_pgdat, and additional judgment must
> be made.
> 
> E.g
> 1) Using zonelist to traverse all nodes
> 
> 	last_pgdat = NULL;	
> 	for_each_zone_zonelist(zone, xxx) {
> 		pgdat = zone->zone_pgdat;
> 		if (pgdat == last_pgdat)
> 			continue;
> 
> 		last_pgdat = pgdat;
> 		do_something(pgdat);
> 	}
> 
> 2) Using nodelist to traverse all nodes
> 
> 	for_each_node_nodelist(node, xxx) {
> 		do_something(NODE_INFO(node));
> 	}
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v1 00/19] Modify zonelist to nodelist v1
  2019-11-22 15:49   ` Pengfei Li
@ 2019-11-22 15:53     ` Christopher Lameter
  2019-11-22 16:06       ` David Hildenbrand
  2019-11-22 17:36       ` Pengfei Li
  0 siblings, 2 replies; 39+ messages in thread
From: Christopher Lameter @ 2019-11-22 15:53 UTC (permalink / raw)
  To: Pengfei Li
  Cc: David Hildenbrand, akpm, mgorman, mhocko, vbabka, iamjoonsoo.kim,
	guro, linux-kernel, linux-mm

On Fri, 22 Nov 2019, Pengfei Li wrote:

> I am sorry that I did not make it clear. I want to express this series
> of patches will benefit NUMA systems with multiple nodes.

Ok but that benefit needs to be quantified somehow.

> The main benefit is that it will be more efficient when traversing all
> nodes (for example when performing page reclamation).

And you loose the prioritization of allocations through these different
zones. We create zonelists with a certain sequence of the zones in
order to prefer allocations from certain zones. This is in particular
important for the smaller DMA zones which may be easily exhausted.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v1 00/19] Modify zonelist to nodelist v1
  2019-11-22 15:53     ` Christopher Lameter
@ 2019-11-22 16:06       ` David Hildenbrand
  2019-11-22 17:36       ` Pengfei Li
  1 sibling, 0 replies; 39+ messages in thread
From: David Hildenbrand @ 2019-11-22 16:06 UTC (permalink / raw)
  To: Christopher Lameter, Pengfei Li
  Cc: akpm, mgorman, mhocko, vbabka, iamjoonsoo.kim, guro,
	linux-kernel, linux-mm

On 22.11.19 16:53, Christopher Lameter wrote:
> On Fri, 22 Nov 2019, Pengfei Li wrote:
> 
>> I am sorry that I did not make it clear. I want to express this series
>> of patches will benefit NUMA systems with multiple nodes.
> 
> Ok but that benefit needs to be quantified somehow.
> 
>> The main benefit is that it will be more efficient when traversing all
>> nodes (for example when performing page reclamation).
> 
> And you loose the prioritization of allocations through these different
> zones. We create zonelists with a certain sequence of the zones in
> order to prefer allocations from certain zones. This is in particular
> important for the smaller DMA zones which may be easily exhausted.
> 

That's an important point also when MOVABLE is in place. You really
don't want to go to random other zones before considering all MOVABLE
zones. Same with DMA and NORMAL.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v1 00/19] Modify zonelist to nodelist v1
  2019-11-22 15:53     ` Christopher Lameter
  2019-11-22 16:06       ` David Hildenbrand
@ 2019-11-22 17:36       ` Pengfei Li
  2019-11-22 18:24         ` Christopher Lameter
  1 sibling, 1 reply; 39+ messages in thread
From: Pengfei Li @ 2019-11-22 17:36 UTC (permalink / raw)
  To: Christopher Lameter
  Cc: David Hildenbrand, akpm, mgorman, mhocko, vbabka, iamjoonsoo.kim,
	guro, linux-kernel, linux-mm, fly

On Fri, 22 Nov 2019 15:53:57 +0000 (UTC)
Christopher Lameter <cl@linux.com> wrote:

> On Fri, 22 Nov 2019, Pengfei Li wrote:
> 
> > I am sorry that I did not make it clear. I want to express this
> > series of patches will benefit NUMA systems with multiple nodes.
> 
> Ok but that benefit needs to be quantified somehow.
> 

Thanks for your comments.

Yes, I will add detailed performance test data in v2.

> > The main benefit is that it will be more efficient when traversing
> > all nodes (for example when performing page reclamation).
> 
> And you loose the prioritization of allocations through these
> different zones.

Sorry, I forgot to mention that the information about the zones that
are available to the node is still there.

The old for_each_zone_zonelist has been replaced with
for_each_zone_nodelist.

I will add some key implementation details in v2. 

> We create zonelists with a certain sequence of the
> zones in order to prefer allocations from certain zones. This is in
> particular important for the smaller DMA zones which may be easily
> exhausted.
> 

I'm not sure if I understand what you mean, but since commit
c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE"), the
zonelist is always in "Node" order, so building the nodelist is fine.

-- 
Pengfei

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v1 00/19] Modify zonelist to nodelist v1
  2019-11-22 17:36       ` Pengfei Li
@ 2019-11-22 18:24         ` Christopher Lameter
  0 siblings, 0 replies; 39+ messages in thread
From: Christopher Lameter @ 2019-11-22 18:24 UTC (permalink / raw)
  To: Pengfei Li
  Cc: David Hildenbrand, akpm, mgorman, mhocko, vbabka, iamjoonsoo.kim,
	guro, linux-kernel, linux-mm

On Sat, 23 Nov 2019, Pengfei Li wrote:

> I'm not sure if I understand what you mean, but since commit
> c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE"), the
> zonelist is always in "Node" order, so building the nodelist is fine.

Sounds good. Just wonder how the allocations that are constrained to
certain physical addresses (16M via DMA and 4G via DMA32) will work in
that case.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v1 00/19] Modify zonelist to nodelist v1
  2019-11-22 15:53     ` Qian Cai
@ 2019-11-25  8:39       ` Michal Hocko
  2019-11-26 15:30         ` Qian Cai
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2019-11-25  8:39 UTC (permalink / raw)
  To: Qian Cai
  Cc: Pengfei Li, lixinhai.lxh, akpm, mgorman, Vlastimil Babka, cl,
	iamjoonsoo.kim, guro, linux-kernel, linux-mm

On Fri 22-11-19 10:53:22, Qian Cai wrote:
[...]
> ZONE_MOVABLE is broken for ages (non-movable allocations are there all the time
> last time I tried) which indicates there is very few people care about it, so it
> is rather weak to use that as a justification for the churns it might cause.

People do care about ZONE_MOVABLE and if there is a non-movable memory
sitting there then it is a bug. Please report that.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v1 00/19] Modify zonelist to nodelist v1
  2019-11-22 15:05   ` Pengfei Li
@ 2019-11-25  8:40     ` Michal Hocko
  2019-11-25 14:46       ` Pengfei Li
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2019-11-25  8:40 UTC (permalink / raw)
  To: Pengfei Li
  Cc: akpm, mgorman, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel, linux-mm

On Fri 22-11-19 23:05:43, Pengfei Li wrote:
> On Thu, 21 Nov 2019 19:04:01 +0100
> Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Thu 21-11-19 23:17:52, Pengfei Li wrote:
> > [...]
> > > Since I don't currently have multiple node NUMA systems, I would be
> > > grateful if anyone would like to test this series of patches.
> > 
> > I didn't really get to think about the actual patchset. From a very
> > quick glance I am wondering whether we need to optimize as there are
> > usually only small amount of numa nodes. But I am quite busy so I
> > cannot really do any claims.
> 
> Thanks for your comments.
> 
> I think it's time to modify the zonelist to nodelist because the
> zonelist is always in node order and the page reclamation is based on
> node.
> 
> I will do more performance testing to show that multi-node systems will
> benefit from this series of patches.

Sensible performance numbers on multiple workloads (ideally some real
world ones rather than artificial microbenchmarks) is essential for a
performance optimization that is this large.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v1 00/19] Modify zonelist to nodelist v1
  2019-11-25  8:40     ` Michal Hocko
@ 2019-11-25 14:46       ` Pengfei Li
  2019-11-25 15:46         ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Pengfei Li @ 2019-11-25 14:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm, mgorman, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel,
	linux-mm, fly

On Mon, 25 Nov 2019 09:40:58 +0100
Michal Hocko <mhocko@kernel.org> wrote:

> On Fri 22-11-19 23:05:43, Pengfei Li wrote:
> > On Thu, 21 Nov 2019 19:04:01 +0100
> > Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > > On Thu 21-11-19 23:17:52, Pengfei Li wrote:
> > > [...]
> > > > Since I don't currently have multiple node NUMA systems, I
> > > > would be grateful if anyone would like to test this series of
> > > > patches.
> > > 
> > > I didn't really get to think about the actual patchset. From a
> > > very quick glance I am wondering whether we need to optimize as
> > > there are usually only small amount of numa nodes. But I am quite
> > > busy so I cannot really do any claims.
> > 
> > Thanks for your comments.
> > 
> > I think it's time to modify the zonelist to nodelist because the
> > zonelist is always in node order and the page reclamation is based
> > on node.
> > 
> > I will do more performance testing to show that multi-node systems
> > will benefit from this series of patches.
> 
> Sensible performance numbers on multiple workloads (ideally some real
> world ones rather than artificial microbenchmarks) is essential for a
> performance optimization that is this large.


Thank you for your suggestion.

But this is probably a bit difficult because I don't have a NUMA server
to do real-world workload testing.

I will do as many performance benchmarks as possible, just like Mel
Gorman's "Move LRU page reclaim from zones to nodes v9"
(https://lwn.net/Articles/694121/).


-- 
Pengfei

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v1 00/19] Modify zonelist to nodelist v1
  2019-11-25 14:46       ` Pengfei Li
@ 2019-11-25 15:46         ` Michal Hocko
  0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2019-11-25 15:46 UTC (permalink / raw)
  To: Pengfei Li
  Cc: akpm, mgorman, vbabka, cl, iamjoonsoo.kim, guro, linux-kernel, linux-mm

On Mon 25-11-19 22:46:03, Pengfei Li wrote:
> On Mon, 25 Nov 2019 09:40:58 +0100
> Michal Hocko <mhocko@kernel.org> wrote:
> 
> > On Fri 22-11-19 23:05:43, Pengfei Li wrote:
> > > On Thu, 21 Nov 2019 19:04:01 +0100
> > > Michal Hocko <mhocko@kernel.org> wrote:
> > > 
> > > > On Thu 21-11-19 23:17:52, Pengfei Li wrote:
> > > > [...]
> > > > > Since I don't currently have multiple node NUMA systems, I
> > > > > would be grateful if anyone would like to test this series of
> > > > > patches.
> > > > 
> > > > I didn't really get to think about the actual patchset. From a
> > > > very quick glance I am wondering whether we need to optimize as
> > > > there are usually only small amount of numa nodes. But I am quite
> > > > busy so I cannot really do any claims.
> > > 
> > > Thanks for your comments.
> > > 
> > > I think it's time to modify the zonelist to nodelist because the
> > > zonelist is always in node order and the page reclamation is based
> > > on node.
> > > 
> > > I will do more performance testing to show that multi-node systems
> > > will benefit from this series of patches.
> > 
> > Sensible performance numbers on multiple workloads (ideally some real
> > world ones rather than artificial microbenchmarks) is essential for a
> > performance optimization that is this large.
> 
> 
> Thank you for your suggestion.
> 
> But this is probably a bit difficult because I don't have a NUMA server
> to do real-world workload testing.

For this particular feature you really do not need any real NUMA server.
Your patch shouldn't introduce NUMA locality. All you are aiming for is
to optimize the zone list iteration.

> I will do as many performance benchmarks as possible, just like Mel
> Gorman's "Move LRU page reclaim from zones to nodes v9"
> (https://lwn.net/Articles/694121/).

Be aware that this will be quite time consuming and non-trivial to
process/evaluate. Not that I want to discourage you from this endeavor
but it is always good to think whether your final goal really has a
potential to a visible optimization. I might be wrong but only the page
allocator should really be the hot path which iterates over zonelist
so a microbenchmark targeting this path would be something I would start
with. Unless there are some really nice results from there I would lose
more time with other benchmarks.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v1 00/19] Modify zonelist to nodelist v1
  2019-11-25  8:39       ` Michal Hocko
@ 2019-11-26 15:30         ` Qian Cai
  2019-11-26 15:41           ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Qian Cai @ 2019-11-26 15:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Pengfei Li, lixinhai.lxh, akpm, mgorman, Vlastimil Babka, cl,
	iamjoonsoo.kim, guro, linux-kernel, linux-mm



> On Nov 25, 2019, at 3:39 AM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> People do care about ZONE_MOVABLE and if there is a non-movable memory
> sitting there then it is a bug. Please report that.

It is trivial to test yourself if you ever care. Just pass kernelcore= to as many NUMA machines you could find, an then test if ever possible to offline those memory.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v1 00/19] Modify zonelist to nodelist v1
  2019-11-26 15:30         ` Qian Cai
@ 2019-11-26 15:41           ` Michal Hocko
  2019-11-26 19:04             ` Qian Cai
  0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2019-11-26 15:41 UTC (permalink / raw)
  To: Qian Cai
  Cc: Pengfei Li, lixinhai.lxh, akpm, mgorman, Vlastimil Babka, cl,
	iamjoonsoo.kim, guro, linux-kernel, linux-mm

On Tue 26-11-19 10:30:56, Qian Cai wrote:
> 
> 
> > On Nov 25, 2019, at 3:39 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > People do care about ZONE_MOVABLE and if there is a non-movable memory
> > sitting there then it is a bug. Please report that.
> 
> It is trivial to test yourself if you ever care. Just pass kernelcore=
> to as many NUMA machines you could find, an then test if ever possible
> to offline those memory.

I definitely do care if you can provide more details (ideally in a
separate email thread). I am using movable memory for memory hotplug
usecases and so far I do not remember any kernel/non-movable allocations
would make it in - modulo bugs when somebody might use __GFP_MOVABLE
when it is not appropriate.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v1 00/19] Modify zonelist to nodelist v1
  2019-11-26 15:41           ` Michal Hocko
@ 2019-11-26 19:04             ` Qian Cai
  2019-11-27  8:50               ` Michal Hocko
  0 siblings, 1 reply; 39+ messages in thread
From: Qian Cai @ 2019-11-26 19:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Pengfei Li, lixinhai.lxh, akpm, mgorman, Vlastimil Babka, cl,
	iamjoonsoo.kim, guro, linux-kernel, linux-mm



> On Nov 26, 2019, at 10:41 AM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> On Tue 26-11-19 10:30:56, Qian Cai wrote:
>> 
>> 
>>> On Nov 25, 2019, at 3:39 AM, Michal Hocko <mhocko@kernel.org> wrote:
>>> 
>>> People do care about ZONE_MOVABLE and if there is a non-movable memory
>>> sitting there then it is a bug. Please report that.
>> 
>> It is trivial to test yourself if you ever care. Just pass kernelcore=
>> to as many NUMA machines you could find, an then test if ever possible
>> to offline those memory.
> 
> I definitely do care if you can provide more details (ideally in a
> separate email thread). I am using movable memory for memory hotplug
> usecases and so far I do not remember any kernel/non-movable allocations
> would make it in - modulo bugs when somebody might use __GFP_MOVABLE
> when it is not appropriate.
> 

I don’t think it is anything to do with __GFP_MOVABLE. It is about booting a kernel
with either passing kernelcore= or movablecore=. Then, those ZONE_MOVABLE will
have non-movable pages which looks like those from vmemmap_populate(). How
do you create ZONE_MOVALBLE in this first place? 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC v1 00/19] Modify zonelist to nodelist v1
  2019-11-26 19:04             ` Qian Cai
@ 2019-11-27  8:50               ` Michal Hocko
  0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2019-11-27  8:50 UTC (permalink / raw)
  To: Qian Cai
  Cc: Pengfei Li, lixinhai.lxh, akpm, mgorman, Vlastimil Babka, cl,
	iamjoonsoo.kim, guro, linux-kernel, linux-mm

On Tue 26-11-19 14:04:06, Qian Cai wrote:
> 
> 
> > On Nov 26, 2019, at 10:41 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > On Tue 26-11-19 10:30:56, Qian Cai wrote:
> >> 
> >> 
> >>> On Nov 25, 2019, at 3:39 AM, Michal Hocko <mhocko@kernel.org> wrote:
> >>> 
> >>> People do care about ZONE_MOVABLE and if there is a non-movable memory
> >>> sitting there then it is a bug. Please report that.
> >> 
> >> It is trivial to test yourself if you ever care. Just pass kernelcore=
> >> to as many NUMA machines you could find, an then test if ever possible
> >> to offline those memory.
> > 
> > I definitely do care if you can provide more details (ideally in a
> > separate email thread). I am using movable memory for memory hotplug
> > usecases and so far I do not remember any kernel/non-movable allocations
> > would make it in - modulo bugs when somebody might use __GFP_MOVABLE
> > when it is not appropriate.
> > 
> 
> I don’t think it is anything to do with __GFP_MOVABLE. It is about booting a kernel
> with either passing kernelcore= or movablecore=. Then, those ZONE_MOVABLE will
> have non-movable pages which looks like those from vmemmap_populate().

OK, I see. This looks like a bug in kernelcore/movablecore. And honestly
I wouldn't be surprised because these are hacks that should have been
removed. I have even attempted to do that because their main usecase is
mostly gone. There were some objections though...

> How do you create ZONE_MOVALBLE in this first place? 

The most common usecase I work with is to use movable_node parameter
which makes whole nodes marked as hotplugable to be moved to
ZONE_MOVABLE.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2019-11-27  8:50 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-21 15:17 [RFC v1 00/19] Modify zonelist to nodelist v1 Pengfei Li
2019-11-21 15:17 ` [RFC v1 01/19] mm, mmzone: modify zonelist to nodelist Pengfei Li
2019-11-21 15:17 ` [RFC v1 02/19] mm, hugetlb: use for_each_node in dequeue_huge_page_nodemask() Pengfei Li
2019-11-21 15:17 ` [RFC v1 03/19] mm, oom_kill: use for_each_node in constrained_alloc() Pengfei Li
2019-11-21 15:17 ` [RFC v1 04/19] mm, slub: use for_each_node in get_any_partial() Pengfei Li
2019-11-21 15:17 ` [RFC v1 05/19] mm, slab: use for_each_node in fallback_alloc() Pengfei Li
2019-11-21 15:17 ` [RFC v1 06/19] mm, vmscan: use for_each_node in do_try_to_free_pages() Pengfei Li
2019-11-21 15:17 ` [RFC v1 07/19] mm, vmscan: use first_node in throttle_direct_reclaim() Pengfei Li
2019-11-21 15:18 ` [RFC v1 08/19] mm, vmscan: pass pgdat to wakeup_kswapd() Pengfei Li
2019-11-21 15:18 ` [RFC v1 09/19] mm, vmscan: use for_each_node in shrink_zones() Pengfei Li
2019-11-21 15:18 ` [RFC v1 10/19] mm, page_alloc: use for_each_node in wake_all_kswapds() Pengfei Li
2019-11-21 15:18 ` [RFC v1 11/19] mm, mempolicy: use first_node in mempolicy_slab_node() Pengfei Li
2019-11-21 15:18 ` [RFC v1 12/19] mm, mempolicy: use first_node in mpol_misplaced() Pengfei Li
2019-11-21 15:18 ` [RFC v1 13/19] mm, page_alloc: use first_node in local_memory_node() Pengfei Li
2019-11-21 15:18 ` [RFC v1 14/19] mm, compaction: rename compaction_zonelist_suitable Pengfei Li
2019-11-21 15:18 ` [RFC v1 15/19] mm, mm_init: rename mminit_verify_zonelist Pengfei Li
2019-11-21 15:18 ` [RFC v1 16/19] mm, page_alloc: cleanup build_zonelists Pengfei Li
2019-11-21 15:18 ` [RFC v1 17/19] mm, memory_hotplug: cleanup online_pages() Pengfei Li
2019-11-21 15:18 ` [RFC v1 18/19] kernel, sysctl: cleanup numa_zonelist_order Pengfei Li
2019-11-21 15:18 ` [RFC v1 19/19] mm, mmzone: cleanup zonelist in comments Pengfei Li
2019-11-21 18:04 ` [RFC v1 00/19] Modify zonelist to nodelist v1 Michal Hocko
2019-11-22 15:05   ` Pengfei Li
2019-11-25  8:40     ` Michal Hocko
2019-11-25 14:46       ` Pengfei Li
2019-11-25 15:46         ` Michal Hocko
2019-11-22 10:03 ` David Hildenbrand
2019-11-22 15:49   ` Pengfei Li
2019-11-22 15:53     ` Christopher Lameter
2019-11-22 16:06       ` David Hildenbrand
2019-11-22 17:36       ` Pengfei Li
2019-11-22 18:24         ` Christopher Lameter
     [not found] ` <2019112215245905276118@gmail.com>
2019-11-22 10:14   ` David Hildenbrand
2019-11-22 15:28   ` Pengfei Li
2019-11-22 15:53     ` Qian Cai
2019-11-25  8:39       ` Michal Hocko
2019-11-26 15:30         ` Qian Cai
2019-11-26 15:41           ` Michal Hocko
2019-11-26 19:04             ` Qian Cai
2019-11-27  8:50               ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).