linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] cleanup zonelists initialization
@ 2017-07-14  7:59 Michal Hocko
  2017-07-14  7:59 ` [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE Michal Hocko
                   ` (8 more replies)
  0 siblings, 9 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-14  7:59 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	LKML, joonsoo kim, linux-api, Michal Hocko, Shaohua Li,
	Toshi Kani, Wen Congyang

Hi,
this is aimed at cleaning up the zonelists initialization code we have
but the primary motivation was bug report [1] which got resolved but
the usage of stop_machine is just too ugly to live. Most patches are
straightforward but 3 of them need a special consideration.

Patch 1 removes zone ordered zonelists completely. I am CCing linux-api
because this is a user visible change. As I argue in the patch
description I do not think we have a strong usecase for it these days.
I have kept sysctl in place and warn into the log if somebody tries to
configure zone lists ordering. If somebody has a real usecase for it
we can revert this patch but I do not expect anybody will actually notice
runtime differences. This patch is not strictly needed for the rest but
it made patch 6 easier to implement.

Patch 7 removes stop_machine from build_all_zonelists without adding any
special synchronization between iterators and updater which I _believe_
is acceptable as explained in the changelog. I hope I am not missing
anything.

Patch 8 then removes zonelists_mutex which is kind of ugly as well and
not really needed AFAICS but a care should be taken when double checking
my thinking.

This has passed my light testing but I currently do not have a HW to
test hotadd_new_pgdat path (aka a completely new node added to the
system in runtime).

This is based on the current mmomt git tree (mmotm-2017-07-12-15-11).
Any feedback is highly appreciated.

The diffstat looks really promissing
 include/linux/mmzone.h |   3 +-
 init/main.c            |   2 +-
 kernel/sysctl.c        |   2 -
 mm/internal.h          |   1 +
 mm/memory_hotplug.c    |  27 +----
 mm/page_alloc.c        | 293 ++++++++++++-------------------------------------
 mm/page_ext.c          |   5 +-
 mm/sparse-vmemmap.c    |  11 +-
 mm/sparse.c            |  10 +-
 9 files changed, 89 insertions(+), 265 deletions(-)

Shortlog says
Michal Hocko (9):
      mm, page_alloc: rip out ZONELIST_ORDER_ZONE
      mm, page_alloc: remove boot pageset initialization from memory hotplug
      mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization
      mm, memory_hotplug: drop zone from build_all_zonelists
      mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node
      mm, page_alloc: simplify zonelist initialization
      mm, page_alloc: remove stop_machine from build_all_zonelists
      mm, memory_hotplug: get rid of zonelists_mutex
      mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations

[1] http://lkml.kernel.org/r/alpine.DEB.2.20.1706291803380.1861@nanos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14  7:59 [PATCH 0/9] cleanup zonelists initialization Michal Hocko
@ 2017-07-14  7:59 ` Michal Hocko
  2017-07-14  9:36   ` Mel Gorman
  2017-07-19  9:33   ` Vlastimil Babka
  2017-07-14  7:59 ` [PATCH 2/9] mm, page_alloc: remove boot pageset initialization from memory hotplug Michal Hocko
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-14  7:59 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	LKML, Michal Hocko, linux-api

From: Michal Hocko <mhocko@suse.com>

Supporting zone ordered zonelists costs us just a lot of code while
the usefulness is arguable if existent at all. Mel has already made
node ordering default on 64b systems. 32b systems are still using
ZONELIST_ORDER_ZONE because it is considered better to fallback to
a different NUMA node rather than consume precious lowmem zones.

This argument is, however, weaken by the fact that the memory reclaim
has been reworked to be node rather than zone oriented. This means
that lowmem requests have to skip over all highmem pages on LRUs already
and so zone ordering doesn't save the reclaim time much. So the only
advantage of the zone ordering is under a light memory pressure when
highmem requests do not ever hit into lowmem zones and the lowmem
pressure doesn't need to reclaim.

Considering that 32b NUMA systems are rather suboptimal already and
it is generally advisable to use 64b kernel on such a HW I believe we
should rather care about the code maintainability and just get rid of
ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
somebody tries to set zone ordering either from kernel command line
or the sysctl.

Cc: <linux-api@vger.kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 kernel/sysctl.c |   2 -
 mm/page_alloc.c | 178 ++++++++------------------------------------------------
 2 files changed, 23 insertions(+), 157 deletions(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 655686d546cb..0cbce40f5426 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1553,8 +1553,6 @@ static struct ctl_table vm_table[] = {
 #ifdef CONFIG_NUMA
 	{
 		.procname	= "numa_zonelist_order",
-		.data		= &numa_zonelist_order,
-		.maxlen		= NUMA_ZONELIST_ORDER_LEN,
 		.mode		= 0644,
 		.proc_handler	= numa_zonelist_order_handler,
 	},
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80e4adb4c360..d9f4ea057e74 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4791,52 +4791,18 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
 	return nr_zones;
 }
 
-
-/*
- *  zonelist_order:
- *  0 = automatic detection of better ordering.
- *  1 = order by ([node] distance, -zonetype)
- *  2 = order by (-zonetype, [node] distance)
- *
- *  If not NUMA, ZONELIST_ORDER_ZONE and ZONELIST_ORDER_NODE will create
- *  the same zonelist. So only NUMA can configure this param.
- */
-#define ZONELIST_ORDER_DEFAULT  0
-#define ZONELIST_ORDER_NODE     1
-#define ZONELIST_ORDER_ZONE     2
-
-/* zonelist order in the kernel.
- * set_zonelist_order() will set this to NODE or ZONE.
- */
-static int current_zonelist_order = ZONELIST_ORDER_DEFAULT;
-static char zonelist_order_name[3][8] = {"Default", "Node", "Zone"};
-
-
 #ifdef CONFIG_NUMA
-/* The value user specified ....changed by config */
-static int user_zonelist_order = ZONELIST_ORDER_DEFAULT;
-/* string for sysctl */
-#define NUMA_ZONELIST_ORDER_LEN	16
-char numa_zonelist_order[16] = "default";
-
-/*
- * interface for configure zonelist ordering.
- * command line option "numa_zonelist_order"
- *	= "[dD]efault	- default, automatic configuration.
- *	= "[nN]ode 	- order by node locality, then by zone within node
- *	= "[zZ]one      - order by zone, then by locality within zone
- */
 
 static int __parse_numa_zonelist_order(char *s)
 {
-	if (*s == 'd' || *s == 'D') {
-		user_zonelist_order = ZONELIST_ORDER_DEFAULT;
-	} else if (*s == 'n' || *s == 'N') {
-		user_zonelist_order = ZONELIST_ORDER_NODE;
-	} else if (*s == 'z' || *s == 'Z') {
-		user_zonelist_order = ZONELIST_ORDER_ZONE;
-	} else {
-		pr_warn("Ignoring invalid numa_zonelist_order value:  %s\n", s);
+	/*
+	 * We used to support different zonlists modes but they turned
+	 * out to be just not useful. Let's keep the warning in place
+	 * if somebody still use the cmd line parameter so that we do
+	 * not fail it silently
+	 */
+	if (!(*s == 'd' || *s == 'D' || *s == 'n' || *s == 'N')) {
+		pr_warn("Ignoring unsupported numa_zonelist_order value:  %s\n", s);
 		return -EINVAL;
 	}
 	return 0;
@@ -4844,16 +4810,10 @@ static int __parse_numa_zonelist_order(char *s)
 
 static __init int setup_numa_zonelist_order(char *s)
 {
-	int ret;
-
 	if (!s)
 		return 0;
 
-	ret = __parse_numa_zonelist_order(s);
-	if (ret == 0)
-		strlcpy(numa_zonelist_order, s, NUMA_ZONELIST_ORDER_LEN);
-
-	return ret;
+	return __parse_numa_zonelist_order(s);
 }
 early_param("numa_zonelist_order", setup_numa_zonelist_order);
 
@@ -4864,40 +4824,22 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
 		loff_t *ppos)
 {
-	char saved_string[NUMA_ZONELIST_ORDER_LEN];
+	char *str;
 	int ret;
-	static DEFINE_MUTEX(zl_order_mutex);
 
-	mutex_lock(&zl_order_mutex);
-	if (write) {
-		if (strlen((char *)table->data) >= NUMA_ZONELIST_ORDER_LEN) {
-			ret = -EINVAL;
-			goto out;
-		}
-		strcpy(saved_string, (char *)table->data);
+	if (!write) {
+		int len = sizeof("Default");
+		if (copy_to_user(buffer, "Default", len))
+			return -EFAULT;
+		return len;
 	}
-	ret = proc_dostring(table, write, buffer, length, ppos);
-	if (ret)
-		goto out;
-	if (write) {
-		int oldval = user_zonelist_order;
 
-		ret = __parse_numa_zonelist_order((char *)table->data);
-		if (ret) {
-			/*
-			 * bogus value.  restore saved string
-			 */
-			strncpy((char *)table->data, saved_string,
-				NUMA_ZONELIST_ORDER_LEN);
-			user_zonelist_order = oldval;
-		} else if (oldval != user_zonelist_order) {
-			mutex_lock(&zonelists_mutex);
-			build_all_zonelists(NULL, NULL);
-			mutex_unlock(&zonelists_mutex);
-		}
-	}
-out:
-	mutex_unlock(&zl_order_mutex);
+	str = memdup_user_nul(buffer, 16);
+	if (IS_ERR(str))
+		return PTR_ERR(str);
+
+	ret = __parse_numa_zonelist_order(str);
+	kfree(str);
 	return ret;
 }
 
@@ -5006,70 +4948,12 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
  */
 static int node_order[MAX_NUMNODES];
 
-static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
-{
-	int pos, j, node;
-	int zone_type;		/* needs to be signed */
-	struct zone *z;
-	struct zonelist *zonelist;
-
-	zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
-	pos = 0;
-	for (zone_type = MAX_NR_ZONES - 1; zone_type >= 0; zone_type--) {
-		for (j = 0; j < nr_nodes; j++) {
-			node = node_order[j];
-			z = &NODE_DATA(node)->node_zones[zone_type];
-			if (managed_zone(z)) {
-				zoneref_set_zone(z,
-					&zonelist->_zonerefs[pos++]);
-				check_highest_zone(zone_type);
-			}
-		}
-	}
-	zonelist->_zonerefs[pos].zone = NULL;
-	zonelist->_zonerefs[pos].zone_idx = 0;
-}
-
-#if defined(CONFIG_64BIT)
-/*
- * Devices that require DMA32/DMA are relatively rare and do not justify a
- * penalty to every machine in case the specialised case applies. Default
- * to Node-ordering on 64-bit NUMA machines
- */
-static int default_zonelist_order(void)
-{
-	return ZONELIST_ORDER_NODE;
-}
-#else
-/*
- * On 32-bit, the Normal zone needs to be preserved for allocations accessible
- * by the kernel. If processes running on node 0 deplete the low memory zone
- * then reclaim will occur more frequency increasing stalls and potentially
- * be easier to OOM if a large percentage of the zone is under writeback or
- * dirty. The problem is significantly worse if CONFIG_HIGHPTE is not set.
- * Hence, default to zone ordering on 32-bit.
- */
-static int default_zonelist_order(void)
-{
-	return ZONELIST_ORDER_ZONE;
-}
-#endif /* CONFIG_64BIT */
-
-static void set_zonelist_order(void)
-{
-	if (user_zonelist_order == ZONELIST_ORDER_DEFAULT)
-		current_zonelist_order = default_zonelist_order();
-	else
-		current_zonelist_order = user_zonelist_order;
-}
-
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int i, node, load;
 	nodemask_t used_mask;
 	int local_node, prev_node;
 	struct zonelist *zonelist;
-	unsigned int order = current_zonelist_order;
 
 	/* initialize zonelists */
 	for (i = 0; i < MAX_ZONELISTS; i++) {
@@ -5099,15 +4983,7 @@ static void build_zonelists(pg_data_t *pgdat)
 
 		prev_node = node;
 		load--;
-		if (order == ZONELIST_ORDER_NODE)
-			build_zonelists_in_node_order(pgdat, node);
-		else
-			node_order[i++] = node;	/* remember order */
-	}
-
-	if (order == ZONELIST_ORDER_ZONE) {
-		/* calculate node order -- i.e., DMA last! */
-		build_zonelists_in_zone_order(pgdat, i);
+		build_zonelists_in_node_order(pgdat, node);
 	}
 
 	build_thisnode_zonelists(pgdat);
@@ -5135,11 +5011,6 @@ static void setup_min_unmapped_ratio(void);
 static void setup_min_slab_ratio(void);
 #else	/* CONFIG_NUMA */
 
-static void set_zonelist_order(void)
-{
-	current_zonelist_order = ZONELIST_ORDER_ZONE;
-}
-
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int node, local_node;
@@ -5279,8 +5150,6 @@ build_all_zonelists_init(void)
  */
 void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
 {
-	set_zonelist_order();
-
 	if (system_state == SYSTEM_BOOTING) {
 		build_all_zonelists_init();
 	} else {
@@ -5306,9 +5175,8 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
 	else
 		page_group_by_mobility_disabled = 0;
 
-	pr_info("Built %i zonelists in %s order, mobility grouping %s.  Total pages: %ld\n",
+	pr_info("Built %i zonelists, mobility grouping %s.  Total pages: %ld\n",
 		nr_online_nodes,
-		zonelist_order_name[current_zonelist_order],
 		page_group_by_mobility_disabled ? "off" : "on",
 		vm_total_pages);
 #ifdef CONFIG_NUMA
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 2/9] mm, page_alloc: remove boot pageset initialization from memory hotplug
  2017-07-14  7:59 [PATCH 0/9] cleanup zonelists initialization Michal Hocko
  2017-07-14  7:59 ` [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE Michal Hocko
@ 2017-07-14  7:59 ` Michal Hocko
  2017-07-14  9:39   ` Mel Gorman
  2017-07-19 13:15   ` Vlastimil Babka
  2017-07-14  8:00 ` [PATCH 3/9] mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization Michal Hocko
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-14  7:59 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

boot_pageset is a boot time hack which gets superseded by normal
pagesets later in the boot process. It makes zero sense to reinitialize
it again and again during memory hotplug.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 38 +++++++++++++++++++++-----------------
 1 file changed, 21 insertions(+), 17 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d9f4ea057e74..7746824a425d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5098,23 +5098,8 @@ static int __build_all_zonelists(void *data)
 		}
 	}
 
-	/*
-	 * Initialize the boot_pagesets that are going to be used
-	 * for bootstrapping processors. The real pagesets for
-	 * each zone will be allocated later when the per cpu
-	 * allocator is available.
-	 *
-	 * boot_pagesets are used also for bootstrapping offline
-	 * cpus if the system is already booted because the pagesets
-	 * are needed to initialize allocators on a specific cpu too.
-	 * F.e. the percpu allocator needs the page allocator which
-	 * needs the percpu allocator in order to allocate its pagesets
-	 * (a chicken-egg dilemma).
-	 */
-	for_each_possible_cpu(cpu) {
-		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
-
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
+	for_each_possible_cpu(cpu) {
 		/*
 		 * We now know the "local memory node" for each node--
 		 * i.e., the node of the first zone in the generic zonelist.
@@ -5125,8 +5110,8 @@ static int __build_all_zonelists(void *data)
 		 */
 		if (cpu_online(cpu))
 			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
-#endif
 	}
+#endif
 
 	return 0;
 }
@@ -5134,7 +5119,26 @@ static int __build_all_zonelists(void *data)
 static noinline void __init
 build_all_zonelists_init(void)
 {
+	int cpu;
+
 	__build_all_zonelists(NULL);
+
+	/*
+	 * Initialize the boot_pagesets that are going to be used
+	 * for bootstrapping processors. The real pagesets for
+	 * each zone will be allocated later when the per cpu
+	 * allocator is available.
+	 *
+	 * boot_pagesets are used also for bootstrapping offline
+	 * cpus if the system is already booted because the pagesets
+	 * are needed to initialize allocators on a specific cpu too.
+	 * F.e. the percpu allocator needs the page allocator which
+	 * needs the percpu allocator in order to allocate its pagesets
+	 * (a chicken-egg dilemma).
+	 */
+	for_each_possible_cpu(cpu)
+		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
+
 	mminit_verify_zonelist();
 	cpuset_init_current_mems_allowed();
 }
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 3/9] mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization
  2017-07-14  7:59 [PATCH 0/9] cleanup zonelists initialization Michal Hocko
  2017-07-14  7:59 ` [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE Michal Hocko
  2017-07-14  7:59 ` [PATCH 2/9] mm, page_alloc: remove boot pageset initialization from memory hotplug Michal Hocko
@ 2017-07-14  8:00 ` Michal Hocko
  2017-07-14  9:48   ` Mel Gorman
  2017-07-19 13:19   ` Vlastimil Babka
  2017-07-14  8:00 ` [PATCH 4/9] mm, memory_hotplug: drop zone from build_all_zonelists Michal Hocko
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-14  8:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__build_all_zonelists reinitializes each online cpu local node for
CONFIG_HAVE_MEMORYLESS_NODES. This makes sense because previously memory
less nodes could gain some memory during memory hotplug and so the local
node should be changed for CPUs close to such a node. It makes less
sense to do that unconditionally for a newly creaded NUMA node which is
still offline and without any memory.

Let's also simplify the cpu loop and use for_each_online_cpu instead of
an explicit cpu_online check for all possible cpus.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7746824a425d..ebc3311555b1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5096,10 +5096,8 @@ static int __build_all_zonelists(void *data)
 
 			build_zonelists(pgdat);
 		}
-	}
 
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
-	for_each_possible_cpu(cpu) {
 		/*
 		 * We now know the "local memory node" for each node--
 		 * i.e., the node of the first zone in the generic zonelist.
@@ -5108,10 +5106,10 @@ static int __build_all_zonelists(void *data)
 		 * secondary cpus' numa_mem as they come on-line.  During
 		 * node/memory hotplug, we'll fixup all on-line cpus.
 		 */
-		if (cpu_online(cpu))
+		for_each_online_cpu(cpu)
 			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
-	}
 #endif
+	}
 
 	return 0;
 }
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 4/9] mm, memory_hotplug: drop zone from build_all_zonelists
  2017-07-14  7:59 [PATCH 0/9] cleanup zonelists initialization Michal Hocko
                   ` (2 preceding siblings ...)
  2017-07-14  8:00 ` [PATCH 3/9] mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization Michal Hocko
@ 2017-07-14  8:00 ` Michal Hocko
  2017-07-19 13:33   ` Vlastimil Babka
  2017-07-14  8:00 ` [PATCH 5/9] mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node Michal Hocko
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 55+ messages in thread
From: Michal Hocko @ 2017-07-14  8:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	LKML, Michal Hocko, Wen Congyang

From: Michal Hocko <mhocko@suse.com>

build_all_zonelists gets a zone parameter to initialize zone's
pagesets. There is only a single user which gives a non-NULL
zone parameter and that one doesn't really need the rest of the
build_all_zonelists (see 6dcd73d7011b ("memory-hotplug: allocate zone's
pcp before onlining pages")).

Therefore remove setup_zone_pageset from build_all_zonelists and call it
from its only user directly. This will also remove a pointless zonlists
rebuilding which is always good.

Cc: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mmzone.h |  2 +-
 init/main.c            |  2 +-
 mm/internal.h          |  1 +
 mm/memory_hotplug.c    | 10 +++++-----
 mm/page_alloc.c        | 12 +++---------
 5 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fc14b8b3f6ce..97f0f42e9f2a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -771,7 +771,7 @@ static inline bool is_dev_zone(const struct zone *zone)
 #include <linux/memory_hotplug.h>
 
 extern struct mutex zonelists_mutex;
-void build_all_zonelists(pg_data_t *pgdat, struct zone *zone);
+void build_all_zonelists(pg_data_t *pgdat);
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
 bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			 int classzone_idx, unsigned int alloc_flags,
diff --git a/init/main.c b/init/main.c
index f866510472d7..e1baf6544d15 100644
--- a/init/main.c
+++ b/init/main.c
@@ -519,7 +519,7 @@ asmlinkage __visible void __init start_kernel(void)
 	boot_cpu_state_init();
 	smp_prepare_boot_cpu();	/* arch-specific boot-cpu hooks */
 
-	build_all_zonelists(NULL, NULL);
+	build_all_zonelists(NULL);
 	page_alloc_init();
 
 	pr_notice("Kernel command line: %s\n", boot_command_line);
diff --git a/mm/internal.h b/mm/internal.h
index 24d88f084705..19d1f35aa6d7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -522,4 +522,5 @@ static inline bool is_migrate_highatomic_page(struct page *page)
 	return get_pageblock_migratetype(page) == MIGRATE_HIGHATOMIC;
 }
 
+void setup_zone_pageset(struct zone *zone);
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d620d0427b6b..639b8af37c45 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -952,7 +952,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	mutex_lock(&zonelists_mutex);
 	if (!populated_zone(zone)) {
 		need_zonelists_rebuild = 1;
-		build_all_zonelists(NULL, zone);
+		setup_zone_pageset(zone);
 	}
 
 	ret = walk_system_ram_range(pfn, nr_pages, &onlined_pages,
@@ -973,7 +973,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	if (onlined_pages) {
 		node_states_set_node(nid, &arg);
 		if (need_zonelists_rebuild)
-			build_all_zonelists(NULL, NULL);
+			build_all_zonelists(NULL);
 		else
 			zone_pcp_update(zone);
 	}
@@ -1051,7 +1051,7 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
 	 * to access not-initialized zonelist, build here.
 	 */
 	mutex_lock(&zonelists_mutex);
-	build_all_zonelists(pgdat, NULL);
+	build_all_zonelists(pgdat);
 	mutex_unlock(&zonelists_mutex);
 
 	/*
@@ -1107,7 +1107,7 @@ int try_online_node(int nid)
 
 	if (pgdat->node_zonelists->_zonerefs->zone == NULL) {
 		mutex_lock(&zonelists_mutex);
-		build_all_zonelists(NULL, NULL);
+		build_all_zonelists(NULL);
 		mutex_unlock(&zonelists_mutex);
 	}
 
@@ -1727,7 +1727,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	if (!populated_zone(zone)) {
 		zone_pcp_reset(zone);
 		mutex_lock(&zonelists_mutex);
-		build_all_zonelists(NULL, NULL);
+		build_all_zonelists(NULL);
 		mutex_unlock(&zonelists_mutex);
 	} else
 		zone_pcp_update(zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebc3311555b1..00e117922b3f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5065,7 +5065,6 @@ static void build_zonelists(pg_data_t *pgdat)
 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
 static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
-static void setup_zone_pageset(struct zone *zone);
 
 /*
  * Global mutex to protect against size modification of zonelists
@@ -5146,19 +5145,14 @@ build_all_zonelists_init(void)
  * unless system_state == SYSTEM_BOOTING.
  *
  * __ref due to (1) call of __meminit annotated setup_zone_pageset
- * [we're only called with non-NULL zone through __meminit paths] and
- * (2) call of __init annotated helper build_all_zonelists_init
+ * and (2) call of __init annotated helper build_all_zonelists_init
  * [protected by SYSTEM_BOOTING].
  */
-void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
+void __ref build_all_zonelists(pg_data_t *pgdat)
 {
 	if (system_state == SYSTEM_BOOTING) {
 		build_all_zonelists_init();
 	} else {
-#ifdef CONFIG_MEMORY_HOTPLUG
-		if (zone)
-			setup_zone_pageset(zone);
-#endif
 		/* we have to stop all cpus to guarantee there is no user
 		   of zonelist */
 		stop_machine_cpuslocked(__build_all_zonelists, pgdat, NULL);
@@ -5432,7 +5426,7 @@ static void __meminit zone_pageset_init(struct zone *zone, int cpu)
 	pageset_set_high_and_batch(zone, pcp);
 }
 
-static void __meminit setup_zone_pageset(struct zone *zone)
+void __meminit setup_zone_pageset(struct zone *zone)
 {
 	int cpu;
 	zone->pageset = alloc_percpu(struct per_cpu_pageset);
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 5/9] mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node
  2017-07-14  7:59 [PATCH 0/9] cleanup zonelists initialization Michal Hocko
                   ` (3 preceding siblings ...)
  2017-07-14  8:00 ` [PATCH 4/9] mm, memory_hotplug: drop zone from build_all_zonelists Michal Hocko
@ 2017-07-14  8:00 ` Michal Hocko
  2017-07-14 12:14   ` Michal Hocko
  2017-07-20  6:13   ` Vlastimil Babka
  2017-07-14  8:00 ` [PATCH 6/9] mm, page_alloc: simplify zonelist initialization Michal Hocko
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-14  8:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	LKML, Michal Hocko, Toshi Kani

From: Michal Hocko <mhocko@suse.com>

try_online_node calls hotadd_new_pgdat which already calls
build_all_zonelists. So the additional call is redundant.  Even though
hotadd_new_pgdat will only initialize zonelists of the new node this is
the right thing to do because such a node doesn't have any memory so
other zonelists would ignore all the zones from this node anyway.

Cc: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory_hotplug.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 639b8af37c45..0d2f6a11075c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1104,13 +1104,6 @@ int try_online_node(int nid)
 	node_set_online(nid);
 	ret = register_one_node(nid);
 	BUG_ON(ret);
-
-	if (pgdat->node_zonelists->_zonerefs->zone == NULL) {
-		mutex_lock(&zonelists_mutex);
-		build_all_zonelists(NULL);
-		mutex_unlock(&zonelists_mutex);
-	}
-
 out:
 	mem_hotplug_done();
 	return ret;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
  2017-07-14  7:59 [PATCH 0/9] cleanup zonelists initialization Michal Hocko
                   ` (4 preceding siblings ...)
  2017-07-14  8:00 ` [PATCH 5/9] mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node Michal Hocko
@ 2017-07-14  8:00 ` Michal Hocko
  2017-07-14  9:55   ` Mel Gorman
                     ` (2 more replies)
  2017-07-14  8:00 ` [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists Michal Hocko
                   ` (2 subsequent siblings)
  8 siblings, 3 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-14  8:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

build_zonelists gradually builds zonelists from the nearest to the most
distant node. As we do not know how many populated zones we will have in
each node we rely on the _zoneref to terminate initialized part of the
zonelist by a NULL zone. While this is functionally correct it is quite
suboptimal because we cannot allow updaters to race with zonelists
users because they could see an empty zonelist and fail the allocation
or hit the OOM killer in the worst case.

We can do much better, though. We can store the node ordering into an
already existing node_order array and then give this array to
build_zonelists_in_node_order and do the whole initialization at once.
zonelists consumers still might see halfway initialized state but that
should be much more tolerateable because the list will not be empty and
they would either see some zone twice or skip over some zone(s) in the
worst case which shouldn't lead to immediate failures.

This patch alone doesn't introduce any functional change yet, though, it
is merely a preparatory work for later changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 42 ++++++++++++++++++------------------------
 1 file changed, 18 insertions(+), 24 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 00e117922b3f..78bd62418380 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4913,17 +4913,20 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
  * This results in maximum locality--normal zone overflows into local
  * DMA zone, if any--but risks exhausting DMA zone.
  */
-static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
+static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order)
 {
-	int j;
 	struct zonelist *zonelist;
+	int i, zoneref_idx = 0;
 
 	zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
-	for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++)
-		;
-	j = build_zonelists_node(NODE_DATA(node), zonelist, j);
-	zonelist->_zonerefs[j].zone = NULL;
-	zonelist->_zonerefs[j].zone_idx = 0;
+
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		pg_data_t *node = NODE_DATA(node_order[i]);
+
+		zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
+	}
+	zonelist->_zonerefs[zoneref_idx].zone = NULL;
+	zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
 }
 
 /*
@@ -4931,13 +4934,13 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
  */
 static void build_thisnode_zonelists(pg_data_t *pgdat)
 {
-	int j;
 	struct zonelist *zonelist;
+	int zoneref_idx = 0;
 
 	zonelist = &pgdat->node_zonelists[ZONELIST_NOFALLBACK];
-	j = build_zonelists_node(pgdat, zonelist, 0);
-	zonelist->_zonerefs[j].zone = NULL;
-	zonelist->_zonerefs[j].zone_idx = 0;
+	zoneref_idx = build_zonelists_node(pgdat, zonelist, zoneref_idx);
+	zonelist->_zonerefs[zoneref_idx].zone = NULL;
+	zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
 }
 
 /*
@@ -4946,21 +4949,13 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
  * exhausted, but results in overflowing to remote node while memory
  * may still exist in local DMA zone.
  */
-static int node_order[MAX_NUMNODES];
 
 static void build_zonelists(pg_data_t *pgdat)
 {
-	int i, node, load;
+	static int node_order[MAX_NUMNODES];
+	int node, load, i = 0;
 	nodemask_t used_mask;
 	int local_node, prev_node;
-	struct zonelist *zonelist;
-
-	/* initialize zonelists */
-	for (i = 0; i < MAX_ZONELISTS; i++) {
-		zonelist = pgdat->node_zonelists + i;
-		zonelist->_zonerefs[0].zone = NULL;
-		zonelist->_zonerefs[0].zone_idx = 0;
-	}
 
 	/* NUMA-aware ordering of nodes */
 	local_node = pgdat->node_id;
@@ -4969,8 +4964,6 @@ static void build_zonelists(pg_data_t *pgdat)
 	nodes_clear(used_mask);
 
 	memset(node_order, 0, sizeof(node_order));
-	i = 0;
-
 	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
 		/*
 		 * We don't want to pressure a particular node.
@@ -4981,11 +4974,12 @@ static void build_zonelists(pg_data_t *pgdat)
 		    node_distance(local_node, prev_node))
 			node_load[node] = load;
 
+		node_order[i++] = node;
 		prev_node = node;
 		load--;
-		build_zonelists_in_node_order(pgdat, node);
 	}
 
+	build_zonelists_in_node_order(pgdat, node_order);
 	build_thisnode_zonelists(pgdat);
 }
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists
  2017-07-14  7:59 [PATCH 0/9] cleanup zonelists initialization Michal Hocko
                   ` (5 preceding siblings ...)
  2017-07-14  8:00 ` [PATCH 6/9] mm, page_alloc: simplify zonelist initialization Michal Hocko
@ 2017-07-14  8:00 ` Michal Hocko
  2017-07-14  9:59   ` Mel Gorman
                     ` (2 more replies)
  2017-07-14  8:00 ` [PATCH 8/9] mm, memory_hotplug: get rid of zonelists_mutex Michal Hocko
  2017-07-14  8:00 ` [PATCH 9/9] mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations Michal Hocko
  8 siblings, 3 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-14  8:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

build_all_zonelists has been (ab)using stop_machine to make sure that
zonelists do not change while somebody is looking at them. This is
is just a gross hack because a) it complicates the context from which
we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
switch locking to a percpu rwsem")) and b) is is not really necessary
especially after "mm, page_alloc: simplify zonelist initialization".

Updates of the zonelists happen very seldom, basically only when a zone
becomes populated during memory online or when it loses all the memory
during offline. A racing iteration over zonelists could either miss a
zone or try to work on one zone twice. Both of these are something we
can live with occasionally because there will always be at least one
zone visible so we are not likely to fail allocation too easily for
example.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 78bd62418380..217889ecd13f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5066,8 +5066,7 @@ static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
  */
 DEFINE_MUTEX(zonelists_mutex);
 
-/* return values int ....just for stop_machine() */
-static int __build_all_zonelists(void *data)
+static void __build_all_zonelists(void *data)
 {
 	int nid;
 	int cpu;
@@ -5103,8 +5102,6 @@ static int __build_all_zonelists(void *data)
 			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
 #endif
 	}
-
-	return 0;
 }
 
 static noinline void __init
@@ -5147,9 +5144,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
 	if (system_state == SYSTEM_BOOTING) {
 		build_all_zonelists_init();
 	} else {
-		/* we have to stop all cpus to guarantee there is no user
-		   of zonelist */
-		stop_machine_cpuslocked(__build_all_zonelists, pgdat, NULL);
+		__build_all_zonelists(pgdat);
 		/* cpuset refresh routine should be here */
 	}
 	vm_total_pages = nr_free_pagecache_pages();
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 8/9] mm, memory_hotplug: get rid of zonelists_mutex
  2017-07-14  7:59 [PATCH 0/9] cleanup zonelists initialization Michal Hocko
                   ` (6 preceding siblings ...)
  2017-07-14  8:00 ` [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists Michal Hocko
@ 2017-07-14  8:00 ` Michal Hocko
  2017-07-14  8:00 ` [PATCH 9/9] mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations Michal Hocko
  8 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-14  8:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

zonelists_mutex has been introduced by 4eaf3f64397c ("mem-hotplug: fix
potential race while building zonelist for new populated zone") to
protect zonelist building from races. This is no longer needed though
because both memory online and offline are fully serialized. New users
have grown since then.

Notably setup_per_zone_wmarks wants to prevent from races between memory
hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
(see cfd3da1e49bb ("mm: Serialize access to min_free_kbytes"). Let's
add a private lock for that purpose. This will not prevent from seeing
halfway through memory hotplug operation but that shouldn't be a big
deal becuse memory hotplug will update watermarks explicitly so we will
eventually get a full picture. The lock just makes sure we won't race
when updating watermarks leading to weird results.

Also __build_all_zonelists manipulates global data so add a private lock
for it as well. This doesn't seem to be necessary today but it is more
robust to have a lock there.

While we are at it make sure we document that memory online/offline
depends on a full serialization either via mem_hotplug_begin() or
device_lock.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mmzone.h |  1 -
 mm/memory_hotplug.c    | 12 ++----------
 mm/page_alloc.c        | 18 +++++++++---------
 3 files changed, 11 insertions(+), 20 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 97f0f42e9f2a..3b638f989b7b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -770,7 +770,6 @@ static inline bool is_dev_zone(const struct zone *zone)
 
 #include <linux/memory_hotplug.h>
 
-extern struct mutex zonelists_mutex;
 void build_all_zonelists(pg_data_t *pgdat);
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
 bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0d2f6a11075c..c1b0077eb9d1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -917,7 +917,7 @@ static struct zone * __meminit move_pfn_range(int online_type, int nid,
 	return zone;
 }
 
-/* Must be protected by mem_hotplug_begin() */
+/* Must be protected by mem_hotplug_begin() or a device_lock */
 int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type)
 {
 	unsigned long flags;
@@ -949,7 +949,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	 * This means the page allocator ignores this zone.
 	 * So, zonelist must be updated after online.
 	 */
-	mutex_lock(&zonelists_mutex);
 	if (!populated_zone(zone)) {
 		need_zonelists_rebuild = 1;
 		setup_zone_pageset(zone);
@@ -960,7 +959,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	if (ret) {
 		if (need_zonelists_rebuild)
 			zone_pcp_reset(zone);
-		mutex_unlock(&zonelists_mutex);
 		goto failed_addition;
 	}
 
@@ -978,8 +976,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 			zone_pcp_update(zone);
 	}
 
-	mutex_unlock(&zonelists_mutex);
-
 	init_per_zone_wmark_min();
 
 	if (onlined_pages) {
@@ -1050,9 +1046,7 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
 	 * The node we allocated has no zone fallback lists. For avoiding
 	 * to access not-initialized zonelist, build here.
 	 */
-	mutex_lock(&zonelists_mutex);
 	build_all_zonelists(pgdat);
-	mutex_unlock(&zonelists_mutex);
 
 	/*
 	 * zone->managed_pages is set to an approximate value in
@@ -1719,9 +1713,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 	if (!populated_zone(zone)) {
 		zone_pcp_reset(zone);
-		mutex_lock(&zonelists_mutex);
 		build_all_zonelists(NULL);
-		mutex_unlock(&zonelists_mutex);
 	} else
 		zone_pcp_update(zone);
 
@@ -1747,7 +1739,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	return ret;
 }
 
-/* Must be protected by mem_hotplug_begin() */
+/* Must be protected by mem_hotplug_begin() or a device_lock */
 int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 {
 	return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 217889ecd13f..dd4c96edcec3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5060,17 +5060,14 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
 static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 
-/*
- * Global mutex to protect against size modification of zonelists
- * as well as to serialize pageset setup for the new populated zone.
- */
-DEFINE_MUTEX(zonelists_mutex);
-
 static void __build_all_zonelists(void *data)
 {
 	int nid;
 	int cpu;
 	pg_data_t *self = data;
+	static DEFINE_SPINLOCK(lock);
+
+	spin_lock(&lock);
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -5102,6 +5099,8 @@ static void __build_all_zonelists(void *data)
 			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
 #endif
 	}
+
+	spin_unlock(&lock);
 }
 
 static noinline void __init
@@ -5132,7 +5131,6 @@ build_all_zonelists_init(void)
 }
 
 /*
- * Called with zonelists_mutex held always
  * unless system_state == SYSTEM_BOOTING.
  *
  * __ref due to (1) call of __meminit annotated setup_zone_pageset
@@ -6874,9 +6872,11 @@ static void __setup_per_zone_wmarks(void)
  */
 void setup_per_zone_wmarks(void)
 {
-	mutex_lock(&zonelists_mutex);
+	static DEFINE_SPINLOCK(lock);
+
+	spin_lock(&lock);
 	__setup_per_zone_wmarks();
-	mutex_unlock(&zonelists_mutex);
+	spin_unlock(&lock);
 }
 
 /*
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 9/9] mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations
  2017-07-14  7:59 [PATCH 0/9] cleanup zonelists initialization Michal Hocko
                   ` (7 preceding siblings ...)
  2017-07-14  8:00 ` [PATCH 8/9] mm, memory_hotplug: get rid of zonelists_mutex Michal Hocko
@ 2017-07-14  8:00 ` Michal Hocko
  2017-07-20  8:04   ` Vlastimil Babka
  8 siblings, 1 reply; 55+ messages in thread
From: Michal Hocko @ 2017-07-14  8:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	LKML, Michal Hocko, joonsoo kim, Shaohua Li

From: Michal Hocko <mhocko@suse.com>

f52407ce2dea ("memory hotplug: alloc page from other node in memory
online") has introduced N_HIGH_MEMORY checks to only use NUMA aware
allocations when there is some memory present because the respective
node might not have any memory yet at the time and so it could fail
or even OOM. Things have changed since then though. Zonelists are
now always initialized before we do any allocations even for hotplug
(see 959ecc48fc75 ("mm/memory_hotplug.c: fix building of node hotplug
zonelist")). Therefore these checks are not really needed. In fact
caller of the allocator should never care about whether the node is
populated because that might change at any time.

Cc: Shaohua Li <shaohua.li@intel.com>
Cc: joonsoo kim <js1304@gmail.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_ext.c       |  5 +----
 mm/sparse-vmemmap.c | 11 +++--------
 mm/sparse.c         | 10 +++-------
 3 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/mm/page_ext.c b/mm/page_ext.c
index 88ccc044b09a..714ce79256c5 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -222,10 +222,7 @@ static void *__meminit alloc_page_ext(size_t size, int nid)
 		return addr;
 	}
 
-	if (node_state(nid, N_HIGH_MEMORY))
-		addr = vzalloc_node(size, nid);
-	else
-		addr = vzalloc(size);
+	addr = vzalloc_node(size, nid);
 
 	return addr;
 }
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index c50b1a14d55e..d1a39b8051e0 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -54,14 +54,9 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
 	if (slab_is_available()) {
 		struct page *page;
 
-		if (node_state(node, N_HIGH_MEMORY))
-			page = alloc_pages_node(
-				node, GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
-				get_order(size));
-		else
-			page = alloc_pages(
-				GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
-				get_order(size));
+		page = alloc_pages_node(node,
+			GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
+			get_order(size));
 		if (page)
 			return page_address(page);
 		return NULL;
diff --git a/mm/sparse.c b/mm/sparse.c
index 7b4be3fd5cac..a9783acf2bb9 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -65,14 +65,10 @@ static noinline struct mem_section __ref *sparse_index_alloc(int nid)
 	unsigned long array_size = SECTIONS_PER_ROOT *
 				   sizeof(struct mem_section);
 
-	if (slab_is_available()) {
-		if (node_state(nid, N_HIGH_MEMORY))
-			section = kzalloc_node(array_size, GFP_KERNEL, nid);
-		else
-			section = kzalloc(array_size, GFP_KERNEL);
-	} else {
+	if (slab_is_available())
+		section = kzalloc_node(array_size, GFP_KERNEL, nid);
+	else
 		section = memblock_virt_alloc_node(array_size, nid);
-	}
 
 	return section;
 }
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14  7:59 ` [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE Michal Hocko
@ 2017-07-14  9:36   ` Mel Gorman
  2017-07-14 10:47     ` Michal Hocko
  2017-07-19  9:33   ` Vlastimil Babka
  1 sibling, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2017-07-14  9:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	Michal Hocko, linux-api

On Fri, Jul 14, 2017 at 09:59:58AM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Supporting zone ordered zonelists costs us just a lot of code while
> the usefulness is arguable if existent at all. Mel has already made
> node ordering default on 64b systems. 32b systems are still using
> ZONELIST_ORDER_ZONE because it is considered better to fallback to
> a different NUMA node rather than consume precious lowmem zones.
> 
> This argument is, however, weaken by the fact that the memory reclaim
> has been reworked to be node rather than zone oriented. This means
> that lowmem requests have to skip over all highmem pages on LRUs already
> and so zone ordering doesn't save the reclaim time much. So the only
> advantage of the zone ordering is under a light memory pressure when
> highmem requests do not ever hit into lowmem zones and the lowmem
> pressure doesn't need to reclaim.
> 
> Considering that 32b NUMA systems are rather suboptimal already and
> it is generally advisable to use 64b kernel on such a HW I believe we
> should rather care about the code maintainability and just get rid of
> ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
> somebody tries to set zone ordering either from kernel command line
> or the sysctl.
> 
> Cc: <linux-api@vger.kernel.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> index 80e4adb4c360..d9f4ea057e74 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4864,40 +4824,22 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
>  		void __user *buffer, size_t *length,
>  		loff_t *ppos)
>  {
> -	char saved_string[NUMA_ZONELIST_ORDER_LEN];
> +	char *str;
>  	int ret;
> -	static DEFINE_MUTEX(zl_order_mutex);
>  
> -	mutex_lock(&zl_order_mutex);
> -	if (write) {
> -		if (strlen((char *)table->data) >= NUMA_ZONELIST_ORDER_LEN) {
> -			ret = -EINVAL;
> -			goto out;
> -		}
> -		strcpy(saved_string, (char *)table->data);
> +	if (!write) {
> +		int len = sizeof("Default");
> +		if (copy_to_user(buffer, "Default", len))
> +			return -EFAULT;
> +		return len;
>  	}

That should to be "default" because the original code would have the proc
entry display "default" unless it was set at runtime. Pretty weird I
know but it's always possible someone is parsing the original default
and not handling it properly.

Otherwise I think we're way past the point where large memory 32-bit
NUMA machines are a thing so

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/9] mm, page_alloc: remove boot pageset initialization from memory hotplug
  2017-07-14  7:59 ` [PATCH 2/9] mm, page_alloc: remove boot pageset initialization from memory hotplug Michal Hocko
@ 2017-07-14  9:39   ` Mel Gorman
  2017-07-19 13:15   ` Vlastimil Babka
  1 sibling, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2017-07-14  9:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	Michal Hocko

On Fri, Jul 14, 2017 at 09:59:59AM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> boot_pageset is a boot time hack which gets superseded by normal
> pagesets later in the boot process. It makes zero sense to reinitialize
> it again and again during memory hotplug.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

You could also go slightly further and remove the batch parameter to
setupo_pageset because it's always 0. Otherwise

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/9] mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization
  2017-07-14  8:00 ` [PATCH 3/9] mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization Michal Hocko
@ 2017-07-14  9:48   ` Mel Gorman
  2017-07-14 10:50     ` Michal Hocko
  2017-07-19 13:19   ` Vlastimil Babka
  1 sibling, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2017-07-14  9:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	Michal Hocko

On Fri, Jul 14, 2017 at 10:00:00AM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> __build_all_zonelists reinitializes each online cpu local node for
> CONFIG_HAVE_MEMORYLESS_NODES. This makes sense because previously memory
> less nodes could gain some memory during memory hotplug and so the local
> node should be changed for CPUs close to such a node. It makes less
> sense to do that unconditionally for a newly creaded NUMA node which is
> still offline and without any memory.
> 
> Let's also simplify the cpu loop and use for_each_online_cpu instead of
> an explicit cpu_online check for all possible cpus.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/page_alloc.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7746824a425d..ebc3311555b1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5096,10 +5096,8 @@ static int __build_all_zonelists(void *data)
>  
>  			build_zonelists(pgdat);
>  		}
> -	}
>  
>  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
> -	for_each_possible_cpu(cpu) {
>  		/*
>  		 * We now know the "local memory node" for each node--
>  		 * i.e., the node of the first zone in the generic zonelist.
> @@ -5108,10 +5106,10 @@ static int __build_all_zonelists(void *data)
>  		 * secondary cpus' numa_mem as they come on-line.  During
>  		 * node/memory hotplug, we'll fixup all on-line cpus.
>  		 */
> -		if (cpu_online(cpu))
> +		for_each_online_cpu(cpu)
>  			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
> -	}
>  #endif
> +	}
>  

This is not as clear a benefit. For each online node, we now go through
all online CPUs once per node. There would be some rationale for using
for_each_online_cpu.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
  2017-07-14  8:00 ` [PATCH 6/9] mm, page_alloc: simplify zonelist initialization Michal Hocko
@ 2017-07-14  9:55   ` Mel Gorman
  2017-07-14 10:51     ` Michal Hocko
  2017-07-14 12:46   ` Mel Gorman
  2017-07-20  6:55   ` Vlastimil Babka
  2 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2017-07-14  9:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	Michal Hocko

On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote:
>  
>  	zonelist = &pgdat->node_zonelists[ZONELIST_NOFALLBACK];
> -	j = build_zonelists_node(pgdat, zonelist, 0);
> -	zonelist->_zonerefs[j].zone = NULL;
> -	zonelist->_zonerefs[j].zone_idx = 0;
> +	zoneref_idx = build_zonelists_node(pgdat, zonelist, zoneref_idx);
> +	zonelist->_zonerefs[zoneref_idx].zone = NULL;
> +	zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
>  }
>  
>  /*
> @@ -4946,21 +4949,13 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
>   * exhausted, but results in overflowing to remote node while memory
>   * may still exist in local DMA zone.
>   */
> -static int node_order[MAX_NUMNODES];
>  
>  static void build_zonelists(pg_data_t *pgdat)
>  {
> -	int i, node, load;
> +	static int node_order[MAX_NUMNODES];
> +	int node, load, i = 0;

Emm, node_order can be large. The first distro config I checked
indicated that this is 8K. I got hung up on that part and didn't look
closely at the rest of the patch.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists
  2017-07-14  8:00 ` [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists Michal Hocko
@ 2017-07-14  9:59   ` Mel Gorman
  2017-07-14 11:00     ` Michal Hocko
  2017-07-14 11:29   ` Vlastimil Babka
  2017-07-20  7:24   ` Vlastimil Babka
  2 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2017-07-14  9:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	Michal Hocko

On Fri, Jul 14, 2017 at 10:00:04AM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> build_all_zonelists has been (ab)using stop_machine to make sure that
> zonelists do not change while somebody is looking at them. This is
> is just a gross hack because a) it complicates the context from which
> we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> switch locking to a percpu rwsem")) and b) is is not really necessary
> especially after "mm, page_alloc: simplify zonelist initialization".
> 
> Updates of the zonelists happen very seldom, basically only when a zone
> becomes populated during memory online or when it loses all the memory
> during offline. A racing iteration over zonelists could either miss a
> zone or try to work on one zone twice. Both of these are something we
> can live with occasionally because there will always be at least one
> zone visible so we are not likely to fail allocation too easily for
> example.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

This patch is contingent on the last patch which updates in place
instead of zeroing the early part of the zonelist first but needs to fix
the stack usage issues. I think it's also worth pointing out in the
changelog that stop_machine never gave the guarantees it claimed as a
process iterating through the zonelist can be stopped so when it resumes
the zonelist has changed underneath it. Doing it online is roughly
equivalent in terms of safety.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14  9:36   ` Mel Gorman
@ 2017-07-14 10:47     ` Michal Hocko
  2017-07-14 11:16       ` Mel Gorman
  0 siblings, 1 reply; 55+ messages in thread
From: Michal Hocko @ 2017-07-14 10:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	linux-api

On Fri 14-07-17 10:36:50, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 09:59:58AM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Supporting zone ordered zonelists costs us just a lot of code while
> > the usefulness is arguable if existent at all. Mel has already made
> > node ordering default on 64b systems. 32b systems are still using
> > ZONELIST_ORDER_ZONE because it is considered better to fallback to
> > a different NUMA node rather than consume precious lowmem zones.
> > 
> > This argument is, however, weaken by the fact that the memory reclaim
> > has been reworked to be node rather than zone oriented. This means
> > that lowmem requests have to skip over all highmem pages on LRUs already
> > and so zone ordering doesn't save the reclaim time much. So the only
> > advantage of the zone ordering is under a light memory pressure when
> > highmem requests do not ever hit into lowmem zones and the lowmem
> > pressure doesn't need to reclaim.
> > 
> > Considering that 32b NUMA systems are rather suboptimal already and
> > it is generally advisable to use 64b kernel on such a HW I believe we
> > should rather care about the code maintainability and just get rid of
> > ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
> > somebody tries to set zone ordering either from kernel command line
> > or the sysctl.
> > 
> > Cc: <linux-api@vger.kernel.org>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > index 80e4adb4c360..d9f4ea057e74 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4864,40 +4824,22 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
> >  		void __user *buffer, size_t *length,
> >  		loff_t *ppos)
> >  {
> > -	char saved_string[NUMA_ZONELIST_ORDER_LEN];
> > +	char *str;
> >  	int ret;
> > -	static DEFINE_MUTEX(zl_order_mutex);
> >  
> > -	mutex_lock(&zl_order_mutex);
> > -	if (write) {
> > -		if (strlen((char *)table->data) >= NUMA_ZONELIST_ORDER_LEN) {
> > -			ret = -EINVAL;
> > -			goto out;
> > -		}
> > -		strcpy(saved_string, (char *)table->data);
> > +	if (!write) {
> > +		int len = sizeof("Default");
> > +		if (copy_to_user(buffer, "Default", len))
> > +			return -EFAULT;
> > +		return len;
> >  	}
> 
> That should to be "default" because the original code would have the proc
> entry display "default" unless it was set at runtime. Pretty weird I
> know but it's always possible someone is parsing the original default
> and not handling it properly.

Ohh, right! That is indeed strange. Then I guess it would be probably
better to simply return Node to make it clear what the default is. What
do you think?

> Otherwise I think we're way past the point where large memory 32-bit
> NUMA machines are a thing so
> 
> Acked-by: Mel Gorman <mgorman@suse.de>

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/9] mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization
  2017-07-14  9:48   ` Mel Gorman
@ 2017-07-14 10:50     ` Michal Hocko
  2017-07-14 12:32       ` Mel Gorman
  0 siblings, 1 reply; 55+ messages in thread
From: Michal Hocko @ 2017-07-14 10:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML

On Fri 14-07-17 10:48:10, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 10:00:00AM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > __build_all_zonelists reinitializes each online cpu local node for
> > CONFIG_HAVE_MEMORYLESS_NODES. This makes sense because previously memory
> > less nodes could gain some memory during memory hotplug and so the local
> > node should be changed for CPUs close to such a node. It makes less
> > sense to do that unconditionally for a newly creaded NUMA node which is
> > still offline and without any memory.
> > 
> > Let's also simplify the cpu loop and use for_each_online_cpu instead of
> > an explicit cpu_online check for all possible cpus.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/page_alloc.c | 6 ++----
> >  1 file changed, 2 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 7746824a425d..ebc3311555b1 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5096,10 +5096,8 @@ static int __build_all_zonelists(void *data)
> >  
> >  			build_zonelists(pgdat);
> >  		}
> > -	}
> >  
> >  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
> > -	for_each_possible_cpu(cpu) {
> >  		/*
> >  		 * We now know the "local memory node" for each node--
> >  		 * i.e., the node of the first zone in the generic zonelist.
> > @@ -5108,10 +5106,10 @@ static int __build_all_zonelists(void *data)
> >  		 * secondary cpus' numa_mem as they come on-line.  During
> >  		 * node/memory hotplug, we'll fixup all on-line cpus.
> >  		 */
> > -		if (cpu_online(cpu))
> > +		for_each_online_cpu(cpu)
> >  			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
> > -	}
> >  #endif
> > +	}
> >  
> 
> This is not as clear a benefit. For each online node, we now go through
> all online CPUs once per node. There would be some rationale for using
> for_each_online_cpu.

I am not sure I understand. I am using for_each_online_cpu...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
  2017-07-14  9:55   ` Mel Gorman
@ 2017-07-14 10:51     ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-14 10:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML

On Fri 14-07-17 10:55:34, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote:
> >  
> >  	zonelist = &pgdat->node_zonelists[ZONELIST_NOFALLBACK];
> > -	j = build_zonelists_node(pgdat, zonelist, 0);
> > -	zonelist->_zonerefs[j].zone = NULL;
> > -	zonelist->_zonerefs[j].zone_idx = 0;
> > +	zoneref_idx = build_zonelists_node(pgdat, zonelist, zoneref_idx);
> > +	zonelist->_zonerefs[zoneref_idx].zone = NULL;
> > +	zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
> >  }
> >  
> >  /*
> > @@ -4946,21 +4949,13 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
> >   * exhausted, but results in overflowing to remote node while memory
> >   * may still exist in local DMA zone.
> >   */
> > -static int node_order[MAX_NUMNODES];
> >  
> >  static void build_zonelists(pg_data_t *pgdat)
> >  {
> > -	int i, node, load;
> > +	static int node_order[MAX_NUMNODES];
> > +	int node, load, i = 0;
> 
> Emm, node_order can be large. The first distro config I checked
> indicated that this is 8K. I got hung up on that part and didn't look
> closely at the rest of the patch.

yes, that's why I kept it static. I just placed it into the function to
make it clear what the scope is.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists
  2017-07-14  9:59   ` Mel Gorman
@ 2017-07-14 11:00     ` Michal Hocko
  2017-07-14 12:47       ` Mel Gorman
  0 siblings, 1 reply; 55+ messages in thread
From: Michal Hocko @ 2017-07-14 11:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML

On Fri 14-07-17 10:59:32, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 10:00:04AM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > build_all_zonelists has been (ab)using stop_machine to make sure that
> > zonelists do not change while somebody is looking at them. This is
> > is just a gross hack because a) it complicates the context from which
> > we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> > switch locking to a percpu rwsem")) and b) is is not really necessary
> > especially after "mm, page_alloc: simplify zonelist initialization".
> > 
> > Updates of the zonelists happen very seldom, basically only when a zone
> > becomes populated during memory online or when it loses all the memory
> > during offline. A racing iteration over zonelists could either miss a
> > zone or try to work on one zone twice. Both of these are something we
> > can live with occasionally because there will always be at least one
> > zone visible so we are not likely to fail allocation too easily for
> > example.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> This patch is contingent on the last patch which updates in place
> instead of zeroing the early part of the zonelist first but needs to fix
> the stack usage issues. I think it's also worth pointing out in the
> changelog that stop_machine never gave the guarantees it claimed as a
> process iterating through the zonelist can be stopped so when it resumes
> the zonelist has changed underneath it. Doing it online is roughly
> equivalent in terms of safety.

OK, what about the following addendum?
"
Please note that the original stop_machine approach doesn't really
provide a better exclusion because the iteration might be interrupted
half way (unless the whole iteration is preempt disabled which is not the
case in most cases) so the some zones could still be seen twice or a
zone missed.
"
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14 10:47     ` Michal Hocko
@ 2017-07-14 11:16       ` Mel Gorman
  2017-07-14 11:38         ` Michal Hocko
  0 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2017-07-14 11:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	linux-api

On Fri, Jul 14, 2017 at 12:47:57PM +0200, Michal Hocko wrote:
> > That should to be "default" because the original code would have the proc
> > entry display "default" unless it was set at runtime. Pretty weird I
> > know but it's always possible someone is parsing the original default
> > and not handling it properly.
> 
> Ohh, right! That is indeed strange. Then I guess it would be probably
> better to simply return Node to make it clear what the default is. What
> do you think?
> 

That would work too. The casing still matches.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists
  2017-07-14  8:00 ` [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists Michal Hocko
  2017-07-14  9:59   ` Mel Gorman
@ 2017-07-14 11:29   ` Vlastimil Babka
  2017-07-14 11:43     ` Michal Hocko
  2017-07-20  7:24   ` Vlastimil Babka
  2 siblings, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2017-07-14 11:29 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, LKML, Michal Hocko

On 07/14/2017 10:00 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> build_all_zonelists has been (ab)using stop_machine to make sure that
> zonelists do not change while somebody is looking at them. This is
> is just a gross hack because a) it complicates the context from which
> we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> switch locking to a percpu rwsem")) and b) is is not really necessary
> especially after "mm, page_alloc: simplify zonelist initialization".
> 
> Updates of the zonelists happen very seldom, basically only when a zone
> becomes populated during memory online or when it loses all the memory
> during offline. A racing iteration over zonelists could either miss a
> zone or try to work on one zone twice. Both of these are something we
> can live with occasionally because there will always be at least one
> zone visible so we are not likely to fail allocation too easily for
> example.

Given the experience with with cpusets and mempolicies, I would rather
avoid the risk of allocation not seeing the only zone(s) that are
allowed by its nodemask, and triggering premature OOM. So maybe the
updates could be done in a way to avoid that, e.g. first append a copy
of the old zonelist to the end, then overwrite and terminate with NULL.
But if this requires any barriers or something similar on the iteration
site, which is performance critical, then it's bad.
Maybe a seqcount, that the iteration side only starts checking in the
slowpath? Like we have with cpusets now.
I know that Mel noted that stop_machine() also never had such guarantees
to prevent this, but it could have made the chances smaller.

> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/page_alloc.c | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 78bd62418380..217889ecd13f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5066,8 +5066,7 @@ static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>   */
>  DEFINE_MUTEX(zonelists_mutex);
>  
> -/* return values int ....just for stop_machine() */
> -static int __build_all_zonelists(void *data)
> +static void __build_all_zonelists(void *data)
>  {
>  	int nid;
>  	int cpu;
> @@ -5103,8 +5102,6 @@ static int __build_all_zonelists(void *data)
>  			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
>  #endif
>  	}
> -
> -	return 0;
>  }
>  
>  static noinline void __init
> @@ -5147,9 +5144,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
>  	if (system_state == SYSTEM_BOOTING) {
>  		build_all_zonelists_init();
>  	} else {
> -		/* we have to stop all cpus to guarantee there is no user
> -		   of zonelist */
> -		stop_machine_cpuslocked(__build_all_zonelists, pgdat, NULL);
> +		__build_all_zonelists(pgdat);
>  		/* cpuset refresh routine should be here */
>  	}
>  	vm_total_pages = nr_free_pagecache_pages();
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14 11:16       ` Mel Gorman
@ 2017-07-14 11:38         ` Michal Hocko
  2017-07-14 12:56           ` Mel Gorman
  0 siblings, 1 reply; 55+ messages in thread
From: Michal Hocko @ 2017-07-14 11:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	linux-api

On Fri 14-07-17 12:16:33, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 12:47:57PM +0200, Michal Hocko wrote:
> > > That should to be "default" because the original code would have the proc
> > > entry display "default" unless it was set at runtime. Pretty weird I
> > > know but it's always possible someone is parsing the original default
> > > and not handling it properly.
> > 
> > Ohh, right! That is indeed strange. Then I guess it would be probably
> > better to simply return Node to make it clear what the default is. What
> > do you think?
> > 
> 
> That would work too. The casing still matches.

This folded in?
---

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists
  2017-07-14 11:29   ` Vlastimil Babka
@ 2017-07-14 11:43     ` Michal Hocko
  2017-07-14 11:45       ` Michal Hocko
  0 siblings, 1 reply; 55+ messages in thread
From: Michal Hocko @ 2017-07-14 11:43 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Mel Gorman, Johannes Weiner, LKML

On Fri 14-07-17 13:29:14, Vlastimil Babka wrote:
> On 07/14/2017 10:00 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > build_all_zonelists has been (ab)using stop_machine to make sure that
> > zonelists do not change while somebody is looking at them. This is
> > is just a gross hack because a) it complicates the context from which
> > we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> > switch locking to a percpu rwsem")) and b) is is not really necessary
> > especially after "mm, page_alloc: simplify zonelist initialization".
> > 
> > Updates of the zonelists happen very seldom, basically only when a zone
> > becomes populated during memory online or when it loses all the memory
> > during offline. A racing iteration over zonelists could either miss a
> > zone or try to work on one zone twice. Both of these are something we
> > can live with occasionally because there will always be at least one
> > zone visible so we are not likely to fail allocation too easily for
> > example.
> 
> Given the experience with with cpusets and mempolicies, I would rather
> avoid the risk of allocation not seeing the only zone(s) that are
> allowed by its nodemask, and triggering premature OOM.

I would argue, those are a different beast because they are directly
under control of not fully priviledged user and change between the empty
nodemask and cpusets very often. For this one to trigger we
would have to online/offline the last memory block in the zone very
often and that doesn't resemble a sensible usecase even remotely.

> So maybe the
> updates could be done in a way to avoid that, e.g. first append a copy
> of the old zonelist to the end, then overwrite and terminate with NULL.
> But if this requires any barriers or something similar on the iteration
> site, which is performance critical, then it's bad.
> Maybe a seqcount, that the iteration side only starts checking in the
> slowpath? Like we have with cpusets now.
> I know that Mel noted that stop_machine() also never had such guarantees
> to prevent this, but it could have made the chances smaller.

I think we can come up with some scheme but is this really worth it
considering how unlikely the whole thing is? Well, if somebody hits a
premature OOM killer or allocations failures it would have to be along
with a heavy memory hotplug operations and then it would be quite easy
to spot what is going on and try to fix it. I would rather not
overcomplicate it, to be honest.

> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/page_alloc.c | 9 ++-------
> >  1 file changed, 2 insertions(+), 7 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 78bd62418380..217889ecd13f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5066,8 +5066,7 @@ static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
> >   */
> >  DEFINE_MUTEX(zonelists_mutex);
> >  
> > -/* return values int ....just for stop_machine() */
> > -static int __build_all_zonelists(void *data)
> > +static void __build_all_zonelists(void *data)
> >  {
> >  	int nid;
> >  	int cpu;
> > @@ -5103,8 +5102,6 @@ static int __build_all_zonelists(void *data)
> >  			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
> >  #endif
> >  	}
> > -
> > -	return 0;
> >  }
> >  
> >  static noinline void __init
> > @@ -5147,9 +5144,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
> >  	if (system_state == SYSTEM_BOOTING) {
> >  		build_all_zonelists_init();
> >  	} else {
> > -		/* we have to stop all cpus to guarantee there is no user
> > -		   of zonelist */
> > -		stop_machine_cpuslocked(__build_all_zonelists, pgdat, NULL);
> > +		__build_all_zonelists(pgdat);
> >  		/* cpuset refresh routine should be here */
> >  	}
> >  	vm_total_pages = nr_free_pagecache_pages();
> > 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists
  2017-07-14 11:43     ` Michal Hocko
@ 2017-07-14 11:45       ` Michal Hocko
  2017-07-20  6:16         ` Vlastimil Babka
  0 siblings, 1 reply; 55+ messages in thread
From: Michal Hocko @ 2017-07-14 11:45 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Mel Gorman, Johannes Weiner, LKML

On Fri 14-07-17 13:43:21, Michal Hocko wrote:
> On Fri 14-07-17 13:29:14, Vlastimil Babka wrote:
> > On 07/14/2017 10:00 AM, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > build_all_zonelists has been (ab)using stop_machine to make sure that
> > > zonelists do not change while somebody is looking at them. This is
> > > is just a gross hack because a) it complicates the context from which
> > > we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> > > switch locking to a percpu rwsem")) and b) is is not really necessary
> > > especially after "mm, page_alloc: simplify zonelist initialization".
> > > 
> > > Updates of the zonelists happen very seldom, basically only when a zone
> > > becomes populated during memory online or when it loses all the memory
> > > during offline. A racing iteration over zonelists could either miss a
> > > zone or try to work on one zone twice. Both of these are something we
> > > can live with occasionally because there will always be at least one
> > > zone visible so we are not likely to fail allocation too easily for
> > > example.
> > 
> > Given the experience with with cpusets and mempolicies, I would rather
> > avoid the risk of allocation not seeing the only zone(s) that are
> > allowed by its nodemask, and triggering premature OOM.
> 
> I would argue, those are a different beast because they are directly
> under control of not fully priviledged user and change between the empty
> nodemask and cpusets very often. For this one to trigger we
> would have to online/offline the last memory block in the zone very
> often and that doesn't resemble a sensible usecase even remotely.
> 
> > So maybe the
> > updates could be done in a way to avoid that, e.g. first append a copy
> > of the old zonelist to the end, then overwrite and terminate with NULL.
> > But if this requires any barriers or something similar on the iteration
> > site, which is performance critical, then it's bad.
> > Maybe a seqcount, that the iteration side only starts checking in the
> > slowpath? Like we have with cpusets now.
> > I know that Mel noted that stop_machine() also never had such guarantees
> > to prevent this, but it could have made the chances smaller.
> 
> I think we can come up with some scheme but is this really worth it
> considering how unlikely the whole thing is? Well, if somebody hits a
> premature OOM killer or allocations failures it would have to be along
> with a heavy memory hotplug operations and then it would be quite easy
> to spot what is going on and try to fix it. I would rather not
> overcomplicate it, to be honest.

And one more thing, Mel has already brought this up in his response.
stop_machine haven't is very roughly same strenght wrt. double zone
visit or a missed zone because we do not restart zonelist iteration.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 5/9] mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node
  2017-07-14  8:00 ` [PATCH 5/9] mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node Michal Hocko
@ 2017-07-14 12:14   ` Michal Hocko
  2017-07-20  6:13   ` Vlastimil Babka
  1 sibling, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-14 12:14 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	LKML, Kani Toshimitsu

[Fixup email to Toshi Kani - the cover is
http://lkml.kernel.org/r/20170714080006.7250-1-mhocko@kernel.org]

On Fri 14-07-17 10:00:02, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> try_online_node calls hotadd_new_pgdat which already calls
> build_all_zonelists. So the additional call is redundant.  Even though
> hotadd_new_pgdat will only initialize zonelists of the new node this is
> the right thing to do because such a node doesn't have any memory so
> other zonelists would ignore all the zones from this node anyway.
> 
> Cc: Toshi Kani <toshi.kani@hp.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/memory_hotplug.c | 7 -------
>  1 file changed, 7 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 639b8af37c45..0d2f6a11075c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1104,13 +1104,6 @@ int try_online_node(int nid)
>  	node_set_online(nid);
>  	ret = register_one_node(nid);
>  	BUG_ON(ret);
> -
> -	if (pgdat->node_zonelists->_zonerefs->zone == NULL) {
> -		mutex_lock(&zonelists_mutex);
> -		build_all_zonelists(NULL);
> -		mutex_unlock(&zonelists_mutex);
> -	}
> -
>  out:
>  	mem_hotplug_done();
>  	return ret;
> -- 
> 2.11.0
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/9] mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization
  2017-07-14 10:50     ` Michal Hocko
@ 2017-07-14 12:32       ` Mel Gorman
  2017-07-14 12:39         ` Michal Hocko
  0 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2017-07-14 12:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML

On Fri, Jul 14, 2017 at 12:50:04PM +0200, Michal Hocko wrote:
> On Fri 14-07-17 10:48:10, Mel Gorman wrote:
> > On Fri, Jul 14, 2017 at 10:00:00AM +0200, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > __build_all_zonelists reinitializes each online cpu local node for
> > > CONFIG_HAVE_MEMORYLESS_NODES. This makes sense because previously memory
> > > less nodes could gain some memory during memory hotplug and so the local
> > > node should be changed for CPUs close to such a node. It makes less
> > > sense to do that unconditionally for a newly creaded NUMA node which is
> > > still offline and without any memory.
> > > 
> > > Let's also simplify the cpu loop and use for_each_online_cpu instead of
> > > an explicit cpu_online check for all possible cpus.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  mm/page_alloc.c | 6 ++----
> > >  1 file changed, 2 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 7746824a425d..ebc3311555b1 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -5096,10 +5096,8 @@ static int __build_all_zonelists(void *data)
> > >  
> > >  			build_zonelists(pgdat);
> > >  		}
> > > -	}
> > >  
> > >  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
> > > -	for_each_possible_cpu(cpu) {
> > >  		/*
> > >  		 * We now know the "local memory node" for each node--
> > >  		 * i.e., the node of the first zone in the generic zonelist.
> > > @@ -5108,10 +5106,10 @@ static int __build_all_zonelists(void *data)
> > >  		 * secondary cpus' numa_mem as they come on-line.  During
> > >  		 * node/memory hotplug, we'll fixup all on-line cpus.
> > >  		 */
> > > -		if (cpu_online(cpu))
> > > +		for_each_online_cpu(cpu)
> > >  			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
> > > -	}
> > >  #endif
> > > +	}
> > >  
> > 
> > This is not as clear a benefit. For each online node, we now go through
> > all online CPUs once per node. There would be some rationale for using
> > for_each_online_cpu.
> 
> I am not sure I understand. I am using for_each_online_cpu...

Yes, but within a loop that looks like

for_each_online_node(nid)
	...
	for_each_online_cpu(cpu)

Or maybe you aren't because we are looking at different baselines. I had
minor fuzz and conflicts applying the series.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/9] mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization
  2017-07-14 12:32       ` Mel Gorman
@ 2017-07-14 12:39         ` Michal Hocko
  2017-07-14 12:56           ` Mel Gorman
  0 siblings, 1 reply; 55+ messages in thread
From: Michal Hocko @ 2017-07-14 12:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML

On Fri 14-07-17 13:32:42, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 12:50:04PM +0200, Michal Hocko wrote:
> > On Fri 14-07-17 10:48:10, Mel Gorman wrote:
> > > On Fri, Jul 14, 2017 at 10:00:00AM +0200, Michal Hocko wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > __build_all_zonelists reinitializes each online cpu local node for
> > > > CONFIG_HAVE_MEMORYLESS_NODES. This makes sense because previously memory
> > > > less nodes could gain some memory during memory hotplug and so the local
> > > > node should be changed for CPUs close to such a node. It makes less
> > > > sense to do that unconditionally for a newly creaded NUMA node which is
> > > > still offline and without any memory.
> > > > 
> > > > Let's also simplify the cpu loop and use for_each_online_cpu instead of
> > > > an explicit cpu_online check for all possible cpus.
> > > > 
> > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > > ---
> > > >  mm/page_alloc.c | 6 ++----
> > > >  1 file changed, 2 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 7746824a425d..ebc3311555b1 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -5096,10 +5096,8 @@ static int __build_all_zonelists(void *data)
> > > >  
> > > >  			build_zonelists(pgdat);
> > > >  		}
> > > > -	}
> > > >  
> > > >  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
> > > > -	for_each_possible_cpu(cpu) {
> > > >  		/*
> > > >  		 * We now know the "local memory node" for each node--
> > > >  		 * i.e., the node of the first zone in the generic zonelist.
> > > > @@ -5108,10 +5106,10 @@ static int __build_all_zonelists(void *data)
> > > >  		 * secondary cpus' numa_mem as they come on-line.  During
> > > >  		 * node/memory hotplug, we'll fixup all on-line cpus.
> > > >  		 */
> > > > -		if (cpu_online(cpu))
> > > > +		for_each_online_cpu(cpu)
> > > >  			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
> > > > -	}
> > > >  #endif
> > > > +	}
> > > >  
> > > 
> > > This is not as clear a benefit. For each online node, we now go through
> > > all online CPUs once per node. There would be some rationale for using
> > > for_each_online_cpu.
> > 
> > I am not sure I understand. I am using for_each_online_cpu...
> 
> Yes, but within a loop that looks like
> 
> for_each_online_node(nid)
> 	...
> 	for_each_online_cpu(cpu)
> 
> Or maybe you aren't because we are looking at different baselines. I had
> minor fuzz and conflicts applying the series.

The current mmotm after this patch looks like this
	if (self && !node_online(self->node_id)) {
		build_zonelists(self);
	} else {
		for_each_online_node(nid) {
			pg_data_t *pgdat = NODE_DATA(nid);

			build_zonelists(pgdat);
		}

#ifdef CONFIG_HAVE_MEMORYLESS_NODES
		/*
		 * We now know the "local memory node" for each node--
		 * i.e., the node of the first zone in the generic zonelist.
		 * Set up numa_mem percpu variable for on-line cpus.  During
		 * boot, only the boot cpu should be on-line;  we'll init the
		 * secondary cpus' numa_mem as they come on-line.  During
		 * node/memory hotplug, we'll fixup all on-line cpus.
		 */
		for_each_online_cpu(cpu)
			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
#endif
	}

So for_each_online_cpu is called outside of the for_each_online_node.
Have a look at
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
branch attempts/zonlists-build-simplification
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
  2017-07-14  8:00 ` [PATCH 6/9] mm, page_alloc: simplify zonelist initialization Michal Hocko
  2017-07-14  9:55   ` Mel Gorman
@ 2017-07-14 12:46   ` Mel Gorman
  2017-07-14 13:02     ` Michal Hocko
  2017-07-20  6:55   ` Vlastimil Babka
  2 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2017-07-14 12:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	Michal Hocko

On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> build_zonelists gradually builds zonelists from the nearest to the most
> distant node. As we do not know how many populated zones we will have in
> each node we rely on the _zoneref to terminate initialized part of the
> zonelist by a NULL zone. While this is functionally correct it is quite
> suboptimal because we cannot allow updaters to race with zonelists
> users because they could see an empty zonelist and fail the allocation
> or hit the OOM killer in the worst case.
> 
> We can do much better, though. We can store the node ordering into an
> already existing node_order array and then give this array to
> build_zonelists_in_node_order and do the whole initialization at once.
> zonelists consumers still might see halfway initialized state but that
> should be much more tolerateable because the list will not be empty and
> they would either see some zone twice or skip over some zone(s) in the
> worst case which shouldn't lead to immediate failures.
> 
> This patch alone doesn't introduce any functional change yet, though, it
> is merely a preparatory work for later changes.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/page_alloc.c | 42 ++++++++++++++++++------------------------
>  1 file changed, 18 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 00e117922b3f..78bd62418380 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4913,17 +4913,20 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
>   * This results in maximum locality--normal zone overflows into local
>   * DMA zone, if any--but risks exhausting DMA zone.
>   */
> -static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
> +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order)
>  {
> -	int j;
>  	struct zonelist *zonelist;
> +	int i, zoneref_idx = 0;
>  
>  	zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
> -	for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++)
> -		;
> -	j = build_zonelists_node(NODE_DATA(node), zonelist, j);
> -	zonelist->_zonerefs[j].zone = NULL;
> -	zonelist->_zonerefs[j].zone_idx = 0;
> +
> +	for (i = 0; i < MAX_NUMNODES; i++) {
> +		pg_data_t *node = NODE_DATA(node_order[i]);
> +
> +		zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
> +	}

The naming here is weird to say the least and makes this a lot more
confusing than it needs to be. Primarily, it's because the zoneref_idx
parameter gets renamed to nr_zones in build_zonelists_node where it's
nothing to do with the number of zones at all.

It also iterates for longer than it needs to. MAX_NUMNODES can be a
large value of mostly empty nodes but it happily goes through them
anyway. Pass zoneref_idx in as a pointer that is updated by the function
and use the return value to break the loop when an empty node is
encountered?

> +	zonelist->_zonerefs[zoneref_idx].zone = NULL;
> +	zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
>  }
>  

It *might* be safer given the next patch to zero out the remainder of
the _zonerefs to that there is no combination of node add/remove that has
an iterator working with a semi-valid _zoneref which is beyond the last
correct value. It *should* be safe as the very last entry will always
be null but if you don't zero it out, it is possible for iterators to be
working beyond the "end" of the zonelist for a short window.

Otherwise think it's ok including my stupid comment about node_order
stack usage.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists
  2017-07-14 11:00     ` Michal Hocko
@ 2017-07-14 12:47       ` Mel Gorman
  0 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2017-07-14 12:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML

On Fri, Jul 14, 2017 at 01:00:25PM +0200, Michal Hocko wrote:
> On Fri 14-07-17 10:59:32, Mel Gorman wrote:
> > On Fri, Jul 14, 2017 at 10:00:04AM +0200, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > build_all_zonelists has been (ab)using stop_machine to make sure that
> > > zonelists do not change while somebody is looking at them. This is
> > > is just a gross hack because a) it complicates the context from which
> > > we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> > > switch locking to a percpu rwsem")) and b) is is not really necessary
> > > especially after "mm, page_alloc: simplify zonelist initialization".
> > > 
> > > Updates of the zonelists happen very seldom, basically only when a zone
> > > becomes populated during memory online or when it loses all the memory
> > > during offline. A racing iteration over zonelists could either miss a
> > > zone or try to work on one zone twice. Both of these are something we
> > > can live with occasionally because there will always be at least one
> > > zone visible so we are not likely to fail allocation too easily for
> > > example.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > 
> > This patch is contingent on the last patch which updates in place
> > instead of zeroing the early part of the zonelist first but needs to fix
> > the stack usage issues. I think it's also worth pointing out in the
> > changelog that stop_machine never gave the guarantees it claimed as a
> > process iterating through the zonelist can be stopped so when it resumes
> > the zonelist has changed underneath it. Doing it online is roughly
> > equivalent in terms of safety.
> 
> OK, what about the following addendum?
> "
> Please note that the original stop_machine approach doesn't really
> provide a better exclusion because the iteration might be interrupted
> half way (unless the whole iteration is preempt disabled which is not the
> case in most cases) so the some zones could still be seen twice or a
> zone missed.
> "

Works for me.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14 11:38         ` Michal Hocko
@ 2017-07-14 12:56           ` Mel Gorman
  2017-07-14 13:01             ` Mel Gorman
  2017-07-14 13:08             ` Michal Hocko
  0 siblings, 2 replies; 55+ messages in thread
From: Mel Gorman @ 2017-07-14 12:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	linux-api

On Fri, Jul 14, 2017 at 01:38:40PM +0200, Michal Hocko wrote:
> On Fri 14-07-17 12:16:33, Mel Gorman wrote:
> > On Fri, Jul 14, 2017 at 12:47:57PM +0200, Michal Hocko wrote:
> > > > That should to be "default" because the original code would have the proc
> > > > entry display "default" unless it was set at runtime. Pretty weird I
> > > > know but it's always possible someone is parsing the original default
> > > > and not handling it properly.
> > > 
> > > Ohh, right! That is indeed strange. Then I guess it would be probably
> > > better to simply return Node to make it clear what the default is. What
> > > do you think?
> > > 
> > 
> > That would work too. The casing still matches.
> 
> This folded in?
> ---
> From c7c36f011590680b254813be00ed791ddbc1bf1c Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 14 Jul 2017 13:36:05 +0200
> Subject: [PATCH] fold me "mm, page_alloc: rip out ZONELIST_ORDER_ZONE"
> 
> - do not print Default in sysctl handler because our behavior was rather
> inconsistent in the past numa_zonelist_order was lowecase while
> zonelist_order_name was uppercase so boot time unchanged value woul
> print lowercase while updated value could be uppercase. Print "Node"
> which is the default instead - Mel
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/page_alloc.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dd4c96edcec3..49bade7ff049 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4828,8 +4828,8 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
>  	int ret;
>  
>  	if (!write) {
> -		int len = sizeof("Default");
> -		if (copy_to_user(buffer, "Default", len))
> +		int len = sizeof("Node");
> +		if (copy_to_user(buffer, "Node", len))
>  			return -EFAULT;

Ok for the name. But what's with using sizeof? The type is char * so it
just happens to work for Default, but not for Node. Also strongly suggest
you continue using proc_dostring because it catches all the corner-cases
that can occur.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/9] mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization
  2017-07-14 12:39         ` Michal Hocko
@ 2017-07-14 12:56           ` Mel Gorman
  0 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2017-07-14 12:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML

On Fri, Jul 14, 2017 at 02:39:41PM +0200, Michal Hocko wrote:
> On Fri 14-07-17 13:32:42, Mel Gorman wrote:
> > On Fri, Jul 14, 2017 at 12:50:04PM +0200, Michal Hocko wrote:
> > > On Fri 14-07-17 10:48:10, Mel Gorman wrote:
> > > > On Fri, Jul 14, 2017 at 10:00:00AM +0200, Michal Hocko wrote:
> > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > 
> > > > > __build_all_zonelists reinitializes each online cpu local node for
> > > > > CONFIG_HAVE_MEMORYLESS_NODES. This makes sense because previously memory
> > > > > less nodes could gain some memory during memory hotplug and so the local
> > > > > node should be changed for CPUs close to such a node. It makes less
> > > > > sense to do that unconditionally for a newly creaded NUMA node which is
> > > > > still offline and without any memory.
> > > > > 
> > > > > Let's also simplify the cpu loop and use for_each_online_cpu instead of
> > > > > an explicit cpu_online check for all possible cpus.
> > > > > 
> > > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > > > ---
> > > > >  mm/page_alloc.c | 6 ++----
> > > > >  1 file changed, 2 insertions(+), 4 deletions(-)
> > > > > 
> > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > index 7746824a425d..ebc3311555b1 100644
> > > > > --- a/mm/page_alloc.c
> > > > > +++ b/mm/page_alloc.c
> > > > > @@ -5096,10 +5096,8 @@ static int __build_all_zonelists(void *data)
> > > > >  
> > > > >  			build_zonelists(pgdat);
> > > > >  		}
> > > > > -	}
> > > > >  
> > > > >  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
> > > > > -	for_each_possible_cpu(cpu) {
> > > > >  		/*
> > > > >  		 * We now know the "local memory node" for each node--
> > > > >  		 * i.e., the node of the first zone in the generic zonelist.
> > > > > @@ -5108,10 +5106,10 @@ static int __build_all_zonelists(void *data)
> > > > >  		 * secondary cpus' numa_mem as they come on-line.  During
> > > > >  		 * node/memory hotplug, we'll fixup all on-line cpus.
> > > > >  		 */
> > > > > -		if (cpu_online(cpu))
> > > > > +		for_each_online_cpu(cpu)
> > > > >  			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
> > > > > -	}
> > > > >  #endif
> > > > > +	}
> > > > >  
> > > > 
> > > > This is not as clear a benefit. For each online node, we now go through
> > > > all online CPUs once per node. There would be some rationale for using
> > > > for_each_online_cpu.
> > > 
> > > I am not sure I understand. I am using for_each_online_cpu...
> > 
> > Yes, but within a loop that looks like
> > 
> > for_each_online_node(nid)
> > 	...
> > 	for_each_online_cpu(cpu)
> > 
> > Or maybe you aren't because we are looking at different baselines. I had
> > minor fuzz and conflicts applying the series.
> 
> The current mmotm after this patch looks like this
> 	if (self && !node_online(self->node_id)) {
> 		build_zonelists(self);
> 	} else {
> 		for_each_online_node(nid) {
> 			pg_data_t *pgdat = NODE_DATA(nid);
> 
> 			build_zonelists(pgdat);
> 		}
> 
> #ifdef CONFIG_HAVE_MEMORYLESS_NODES
> 		/*
> 		 * We now know the "local memory node" for each node--
> 		 * i.e., the node of the first zone in the generic zonelist.
> 		 * Set up numa_mem percpu variable for on-line cpus.  During
> 		 * boot, only the boot cpu should be on-line;  we'll init the
> 		 * secondary cpus' numa_mem as they come on-line.  During
> 		 * node/memory hotplug, we'll fixup all on-line cpus.
> 		 */
> 		for_each_online_cpu(cpu)
> 			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
> #endif
> 	}
> 

Ok, that makes more sense.

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14 12:56           ` Mel Gorman
@ 2017-07-14 13:01             ` Mel Gorman
  2017-07-14 13:08             ` Michal Hocko
  1 sibling, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2017-07-14 13:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	linux-api

On Fri, Jul 14, 2017 at 01:56:16PM +0100, Mel Gorman wrote:
> >  	if (!write) {
> > -		int len = sizeof("Default");
> > -		if (copy_to_user(buffer, "Default", len))
> > +		int len = sizeof("Node");
> > +		if (copy_to_user(buffer, "Node", len))
> >  			return -EFAULT;
> 
> Ok for the name. But what's with using sizeof?

Bah, sizeof static compile-time string versus char *. Never mind.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
  2017-07-14 12:46   ` Mel Gorman
@ 2017-07-14 13:02     ` Michal Hocko
  2017-07-14 14:18       ` Mel Gorman
  0 siblings, 1 reply; 55+ messages in thread
From: Michal Hocko @ 2017-07-14 13:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML

On Fri 14-07-17 13:46:46, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > build_zonelists gradually builds zonelists from the nearest to the most
> > distant node. As we do not know how many populated zones we will have in
> > each node we rely on the _zoneref to terminate initialized part of the
> > zonelist by a NULL zone. While this is functionally correct it is quite
> > suboptimal because we cannot allow updaters to race with zonelists
> > users because they could see an empty zonelist and fail the allocation
> > or hit the OOM killer in the worst case.
> > 
> > We can do much better, though. We can store the node ordering into an
> > already existing node_order array and then give this array to
> > build_zonelists_in_node_order and do the whole initialization at once.
> > zonelists consumers still might see halfway initialized state but that
> > should be much more tolerateable because the list will not be empty and
> > they would either see some zone twice or skip over some zone(s) in the
> > worst case which shouldn't lead to immediate failures.
> > 
> > This patch alone doesn't introduce any functional change yet, though, it
> > is merely a preparatory work for later changes.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/page_alloc.c | 42 ++++++++++++++++++------------------------
> >  1 file changed, 18 insertions(+), 24 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 00e117922b3f..78bd62418380 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4913,17 +4913,20 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
> >   * This results in maximum locality--normal zone overflows into local
> >   * DMA zone, if any--but risks exhausting DMA zone.
> >   */
> > -static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
> > +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order)
> >  {
> > -	int j;
> >  	struct zonelist *zonelist;
> > +	int i, zoneref_idx = 0;
> >  
> >  	zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
> > -	for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++)
> > -		;
> > -	j = build_zonelists_node(NODE_DATA(node), zonelist, j);
> > -	zonelist->_zonerefs[j].zone = NULL;
> > -	zonelist->_zonerefs[j].zone_idx = 0;
> > +
> > +	for (i = 0; i < MAX_NUMNODES; i++) {
> > +		pg_data_t *node = NODE_DATA(node_order[i]);
> > +
> > +		zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
> > +	}
> 
> The naming here is weird to say the least and makes this a lot more
> confusing than it needs to be. Primarily, it's because the zoneref_idx
> parameter gets renamed to nr_zones in build_zonelists_node where it's
> nothing to do with the number of zones at all.

you are right. I just wanted to get rid of `j' and didn't realize
nr_zones would fit much better.

> It also iterates for longer than it needs to. MAX_NUMNODES can be a
> large value of mostly empty nodes but it happily goes through them
> anyway. Pass zoneref_idx in as a pointer that is updated by the function
> and use the return value to break the loop when an empty node is
> encountered?
> 
> > +	zonelist->_zonerefs[zoneref_idx].zone = NULL;
> > +	zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
> >  }
> >  
> 
> It *might* be safer given the next patch to zero out the remainder of
> the _zonerefs to that there is no combination of node add/remove that has
> an iterator working with a semi-valid _zoneref which is beyond the last
> correct value. It *should* be safe as the very last entry will always
> be null but if you don't zero it out, it is possible for iterators to be
> working beyond the "end" of the zonelist for a short window.

yes that is true but there will always be terminating NULL zone and I
found that acceptable. It is basically the same thing as accessing an
empty zone or a zone twice. Or do you think this is absolutely necessary
to handle?

> Otherwise think it's ok including my stupid comment about node_order
> stack usage.

What do you think about this on top?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 49bade7ff049..3b98524c04ec 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4913,20 +4913,21 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
  * This results in maximum locality--normal zone overflows into local
  * DMA zone, if any--but risks exhausting DMA zone.
  */
-static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order)
+static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
+		unsigned nr_nodes)
 {
 	struct zonelist *zonelist;
-	int i, zoneref_idx = 0;
+	int i, nr_zones = 0;
 
 	zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
 
-	for (i = 0; i < MAX_NUMNODES; i++) {
+	for (i = 0; i < nr_nodes; i++) {
 		pg_data_t *node = NODE_DATA(node_order[i]);
 
-		zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
+		nr_zones = build_zonelists_node(node, zonelist, nr_zones);
 	}
-	zonelist->_zonerefs[zoneref_idx].zone = NULL;
-	zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
+	zonelist->_zonerefs[nr_zones].zone = NULL;
+	zonelist->_zonerefs[nr_zones].zone_idx = 0;
 }
 
 /*
@@ -4935,12 +4936,12 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order)
 static void build_thisnode_zonelists(pg_data_t *pgdat)
 {
 	struct zonelist *zonelist;
-	int zoneref_idx = 0;
+	int nr_zones = 0;
 
 	zonelist = &pgdat->node_zonelists[ZONELIST_NOFALLBACK];
-	zoneref_idx = build_zonelists_node(pgdat, zonelist, zoneref_idx);
-	zonelist->_zonerefs[zoneref_idx].zone = NULL;
-	zonelist->_zonerefs[zoneref_idx].zone_idx = 0;
+	nr_zones = build_zonelists_node(pgdat, zonelist, nr_zones);
+	zonelist->_zonerefs[nr_zones].zone = NULL;
+	zonelist->_zonerefs[nr_zones].zone_idx = 0;
 }
 
 /*
@@ -4979,7 +4980,7 @@ static void build_zonelists(pg_data_t *pgdat)
 		load--;
 	}
 
-	build_zonelists_in_node_order(pgdat, node_order);
+	build_zonelists_in_node_order(pgdat, node_order, i);
 	build_thisnode_zonelists(pgdat);
 }
 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14 12:56           ` Mel Gorman
  2017-07-14 13:01             ` Mel Gorman
@ 2017-07-14 13:08             ` Michal Hocko
  1 sibling, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-14 13:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	linux-api

On Fri 14-07-17 13:56:16, Mel Gorman wrote:
> Also strongly suggest you continue using proc_dostring because it
> catches all the corner-cases that can occur.

proc_dostring would need a data in sysctl table and I found that more
confusing than having a simplistic read/write. I can do that if you
insist though.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
  2017-07-14 13:02     ` Michal Hocko
@ 2017-07-14 14:18       ` Mel Gorman
  2017-07-17  6:06         ` Michal Hocko
  0 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2017-07-14 14:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML

On Fri, Jul 14, 2017 at 03:02:42PM +0200, Michal Hocko wrote:
> > It *might* be safer given the next patch to zero out the remainder of
> > the _zonerefs to that there is no combination of node add/remove that has
> > an iterator working with a semi-valid _zoneref which is beyond the last
> > correct value. It *should* be safe as the very last entry will always
> > be null but if you don't zero it out, it is possible for iterators to be
> > working beyond the "end" of the zonelist for a short window.
> 
> yes that is true but there will always be terminating NULL zone and I
> found that acceptable. It is basically the same thing as accessing an
> empty zone or a zone twice. Or do you think this is absolutely necessary
> to handle?
> 

I don't think it's absolutely necessary. While you could construct some
odd behaviour for iterators currently past the end of the list, they would
eventually encounter a NULL.

> > Otherwise think it's ok including my stupid comment about node_order
> > stack usage.
> 
> What do you think about this on top?
> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 49bade7ff049..3b98524c04ec 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4913,20 +4913,21 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
>   * This results in maximum locality--normal zone overflows into local
>   * DMA zone, if any--but risks exhausting DMA zone.
>   */
> -static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order)
> +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
> +		unsigned nr_nodes)
>  {
>  	struct zonelist *zonelist;
> -	int i, zoneref_idx = 0;
> +	int i, nr_zones = 0;
>  
>  	zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
>  
> -	for (i = 0; i < MAX_NUMNODES; i++) {
> +	for (i = 0; i < nr_nodes; i++) {

The first iteration is then -- for (i = 0; i < 0; i++)

Fairly sure that's not what you meant.


>  		pg_data_t *node = NODE_DATA(node_order[i]);
>  
> -		zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
> +		nr_zones = build_zonelists_node(node, zonelist, nr_zones);

I meant converting build_zonelists_node and passing in &nr_zones and
returning false when an empty node is encountered. In this context,
it's also not about zones, it really is nr_zonerefs. Rename nr_zones in
build_zonelists_node as well.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
  2017-07-14 14:18       ` Mel Gorman
@ 2017-07-17  6:06         ` Michal Hocko
  2017-07-17  8:07           ` Mel Gorman
  0 siblings, 1 reply; 55+ messages in thread
From: Michal Hocko @ 2017-07-17  6:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML

On Fri 14-07-17 15:18:23, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 03:02:42PM +0200, Michal Hocko wrote:
[...]
> > What do you think about this on top?
> > ---
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 49bade7ff049..3b98524c04ec 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4913,20 +4913,21 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
> >   * This results in maximum locality--normal zone overflows into local
> >   * DMA zone, if any--but risks exhausting DMA zone.
> >   */
> > -static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order)
> > +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
> > +		unsigned nr_nodes)
> >  {
> >  	struct zonelist *zonelist;
> > -	int i, zoneref_idx = 0;
> > +	int i, nr_zones = 0;
> >  
> >  	zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
> >  
> > -	for (i = 0; i < MAX_NUMNODES; i++) {
> > +	for (i = 0; i < nr_nodes; i++) {
> 
> The first iteration is then -- for (i = 0; i < 0; i++)

find_next_best_node always returns at least one node (the local one) so
I believe that nr_nodes should never be 0.

> Fairly sure that's not what you meant.
> 
> 
> >  		pg_data_t *node = NODE_DATA(node_order[i]);
> >  
> > -		zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
> > +		nr_zones = build_zonelists_node(node, zonelist, nr_zones);
> 
> I meant converting build_zonelists_node and passing in &nr_zones and
> returning false when an empty node is encountered. In this context,
> it's also not about zones, it really is nr_zonerefs. Rename nr_zones in
> build_zonelists_node as well.

hmm, why don't we rather make it zonerefs based then. Something
like the following?
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b98524c04ec..01e67e629351 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4772,18 +4772,17 @@ static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
  *
  * Add all populated zones of a node to the zonelist.
  */
-static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
-				int nr_zones)
+static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
 {
 	struct zone *zone;
 	enum zone_type zone_type = MAX_NR_ZONES;
+	int nr_zones = 0;
 
 	do {
 		zone_type--;
 		zone = pgdat->node_zones + zone_type;
 		if (managed_zone(zone)) {
-			zoneref_set_zone(zone,
-				&zonelist->_zonerefs[nr_zones++]);
+			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
 			check_highest_zone(zone_type);
 		}
 	} while (zone_type);
@@ -4916,18 +4915,21 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
 static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
 		unsigned nr_nodes)
 {
-	struct zonelist *zonelist;
-	int i, nr_zones = 0;
+	struct zoneref *zonerefs;
+	int i;
 
-	zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
+	zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
 
 	for (i = 0; i < nr_nodes; i++) {
+		int nr_zones;
+
 		pg_data_t *node = NODE_DATA(node_order[i]);
 
-		nr_zones = build_zonelists_node(node, zonelist, nr_zones);
+		nr_zones = build_zonerefs_node(node, zonerefs);
+		zonerefs += nr_zones;
 	}
-	zonelist->_zonerefs[nr_zones].zone = NULL;
-	zonelist->_zonerefs[nr_zones].zone_idx = 0;
+	zonerefs->zone = NULL;
+	zonerefs->zone_idx = 0;
 }
 
 /*
@@ -4935,13 +4937,14 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
  */
 static void build_thisnode_zonelists(pg_data_t *pgdat)
 {
-	struct zonelist *zonelist;
-	int nr_zones = 0;
+	struct zoneref *zonerefs;
+	int nr_zones;
 
-	zonelist = &pgdat->node_zonelists[ZONELIST_NOFALLBACK];
-	nr_zones = build_zonelists_node(pgdat, zonelist, nr_zones);
-	zonelist->_zonerefs[nr_zones].zone = NULL;
-	zonelist->_zonerefs[nr_zones].zone_idx = 0;
+	zonerefs = pgdat->node_zonelists[ZONELIST_NOFALLBACK]._zonerefs;
+	nr_zones = build_zonerefs_node(pgdat, zonerefs);
+	zonerefs += nr_zones;
+	zonerefs->zone = NULL;
+	zonerefs->zone_idx = 0;
 }
 
 /*
@@ -5009,13 +5012,13 @@ static void setup_min_slab_ratio(void);
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int node, local_node;
-	enum zone_type j;
-	struct zonelist *zonelist;
+	struct zoneref *zonerefs;
 
 	local_node = pgdat->node_id;
 
-	zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
-	j = build_zonelists_node(pgdat, zonelist, 0);
+	zonrefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
+	nr_zones = build_zonerefs_node(pgdat, zonerefs);
+	zonerefs += nr_zones;
 
 	/*
 	 * Now we build the zonelist so that it contains the zones
@@ -5028,16 +5031,18 @@ static void build_zonelists(pg_data_t *pgdat)
 	for (node = local_node + 1; node < MAX_NUMNODES; node++) {
 		if (!node_online(node))
 			continue;
-		j = build_zonelists_node(NODE_DATA(node), zonelist, j);
+		nr_zones = build_zonerefs_node(NODE_DATA(node), zonerefs);
+		zonerefs += nr_zones;
 	}
 	for (node = 0; node < local_node; node++) {
 		if (!node_online(node))
 			continue;
-		j = build_zonelists_node(NODE_DATA(node), zonelist, j);
+		nr_zones = build_zonerefs_node(NODE_DATA(node), zonerefs);
+		zonerefs += nr_zones;
 	}
 
-	zonelist->_zonerefs[j].zone = NULL;
-	zonelist->_zonerefs[j].zone_idx = 0;
+	zonerefs->zone = NULL;
+	zonerefs->zone_idx = 0;
 }
 
 #endif	/* CONFIG_NUMA */

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
  2017-07-17  6:06         ` Michal Hocko
@ 2017-07-17  8:07           ` Mel Gorman
  2017-07-17  8:19             ` Michal Hocko
  0 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2017-07-17  8:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML

On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote:
> On Fri 14-07-17 15:18:23, Mel Gorman wrote:
> > Fairly sure that's not what you meant.
> > 
> > 
> > >  		pg_data_t *node = NODE_DATA(node_order[i]);
> > >  
> > > -		zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
> > > +		nr_zones = build_zonelists_node(node, zonelist, nr_zones);
> > 
> > I meant converting build_zonelists_node and passing in &nr_zones and
> > returning false when an empty node is encountered. In this context,
> > it's also not about zones, it really is nr_zonerefs. Rename nr_zones in
> > build_zonelists_node as well.
> 
> hmm, why don't we rather make it zonerefs based then. Something
> like the following?

Works for me.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
  2017-07-17  8:07           ` Mel Gorman
@ 2017-07-17  8:19             ` Michal Hocko
  2017-07-17  8:58               ` Mel Gorman
  0 siblings, 1 reply; 55+ messages in thread
From: Michal Hocko @ 2017-07-17  8:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML

On Mon 17-07-17 09:07:23, Mel Gorman wrote:
> On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote:
> > On Fri 14-07-17 15:18:23, Mel Gorman wrote:
> > > Fairly sure that's not what you meant.
> > > 
> > > 
> > > >  		pg_data_t *node = NODE_DATA(node_order[i]);
> > > >  
> > > > -		zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
> > > > +		nr_zones = build_zonelists_node(node, zonelist, nr_zones);
> > > 
> > > I meant converting build_zonelists_node and passing in &nr_zones and
> > > returning false when an empty node is encountered. In this context,
> > > it's also not about zones, it really is nr_zonerefs. Rename nr_zones in
> > > build_zonelists_node as well.
> > 
> > hmm, why don't we rather make it zonerefs based then. Something
> > like the following?
> 
> Works for me.

Should I fold it to the patch or make it a patch on its own?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
  2017-07-17  8:19             ` Michal Hocko
@ 2017-07-17  8:58               ` Mel Gorman
  2017-07-17  9:15                 ` Michal Hocko
  0 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2017-07-17  8:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML

On Mon, Jul 17, 2017 at 10:19:42AM +0200, Michal Hocko wrote:
> On Mon 17-07-17 09:07:23, Mel Gorman wrote:
> > On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote:
> > > On Fri 14-07-17 15:18:23, Mel Gorman wrote:
> > > > Fairly sure that's not what you meant.
> > > > 
> > > > 
> > > > >  		pg_data_t *node = NODE_DATA(node_order[i]);
> > > > >  
> > > > > -		zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
> > > > > +		nr_zones = build_zonelists_node(node, zonelist, nr_zones);
> > > > 
> > > > I meant converting build_zonelists_node and passing in &nr_zones and
> > > > returning false when an empty node is encountered. In this context,
> > > > it's also not about zones, it really is nr_zonerefs. Rename nr_zones in
> > > > build_zonelists_node as well.
> > > 
> > > hmm, why don't we rather make it zonerefs based then. Something
> > > like the following?
> > 
> > Works for me.
> 
> Should I fold it to the patch or make it a patch on its own?

I have no strong feelings either way but if it was folded then the
overall naming should be easier to follow (at least for me).

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
  2017-07-17  8:58               ` Mel Gorman
@ 2017-07-17  9:15                 ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-17  9:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML

On Mon 17-07-17 09:58:04, Mel Gorman wrote:
> On Mon, Jul 17, 2017 at 10:19:42AM +0200, Michal Hocko wrote:
> > On Mon 17-07-17 09:07:23, Mel Gorman wrote:
> > > On Mon, Jul 17, 2017 at 08:06:40AM +0200, Michal Hocko wrote:
> > > > On Fri 14-07-17 15:18:23, Mel Gorman wrote:
> > > > > Fairly sure that's not what you meant.
> > > > > 
> > > > > 
> > > > > >  		pg_data_t *node = NODE_DATA(node_order[i]);
> > > > > >  
> > > > > > -		zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx);
> > > > > > +		nr_zones = build_zonelists_node(node, zonelist, nr_zones);
> > > > > 
> > > > > I meant converting build_zonelists_node and passing in &nr_zones and
> > > > > returning false when an empty node is encountered. In this context,
> > > > > it's also not about zones, it really is nr_zonerefs. Rename nr_zones in
> > > > > build_zonelists_node as well.
> > > > 
> > > > hmm, why don't we rather make it zonerefs based then. Something
> > > > like the following?
> > > 
> > > Works for me.
> > 
> > Should I fold it to the patch or make it a patch on its own?
> 
> I have no strong feelings either way but if it was folded then the
> overall naming should be easier to follow (at least for me).

OK, I will fold it in then. Unless there are more issues/feedback to
address I will repost the full series in few days.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14  7:59 ` [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE Michal Hocko
  2017-07-14  9:36   ` Mel Gorman
@ 2017-07-19  9:33   ` Vlastimil Babka
  2017-07-19 13:44     ` Michal Hocko
  1 sibling, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2017-07-19  9:33 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, LKML, Michal Hocko,
	linux-api

On 07/14/2017 09:59 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Supporting zone ordered zonelists costs us just a lot of code while
> the usefulness is arguable if existent at all. Mel has already made
> node ordering default on 64b systems. 32b systems are still using
> ZONELIST_ORDER_ZONE because it is considered better to fallback to
> a different NUMA node rather than consume precious lowmem zones.
> 
> This argument is, however, weaken by the fact that the memory reclaim
> has been reworked to be node rather than zone oriented. This means
> that lowmem requests have to skip over all highmem pages on LRUs already
> and so zone ordering doesn't save the reclaim time much. So the only
> advantage of the zone ordering is under a light memory pressure when
> highmem requests do not ever hit into lowmem zones and the lowmem
> pressure doesn't need to reclaim.
> 
> Considering that 32b NUMA systems are rather suboptimal already and
> it is generally advisable to use 64b kernel on such a HW I believe we
> should rather care about the code maintainability and just get rid of
> ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
> somebody tries to set zone ordering either from kernel command line
> or the sysctl.
> 
> Cc: <linux-api@vger.kernel.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Found some leftovers to cleanup:

include/linux/mmzone.h:
extern char numa_zonelist_order[];
#define NUMA_ZONELIST_ORDER_LEN 16      /* string buffer size */

Also update docs?
Documentation/sysctl/vm.txt:zone.  Specify "[Zz]one" for zone order.
Documentation/admin-guide/kernel-parameters.txt:
numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
Documentation/vm/numa:a default zonelist order based on the sizes of the
various zone types relative
Documentation/vm/numa:default zonelist order may be overridden using the
numa_zonelist_order kernel

Otherwise,
Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 2/9] mm, page_alloc: remove boot pageset initialization from memory hotplug
  2017-07-14  7:59 ` [PATCH 2/9] mm, page_alloc: remove boot pageset initialization from memory hotplug Michal Hocko
  2017-07-14  9:39   ` Mel Gorman
@ 2017-07-19 13:15   ` Vlastimil Babka
  1 sibling, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2017-07-19 13:15 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, LKML, Michal Hocko

On 07/14/2017 09:59 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> boot_pageset is a boot time hack which gets superseded by normal
> pagesets later in the boot process. It makes zero sense to reinitialize
> it again and again during memory hotplug.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 38 +++++++++++++++++++++-----------------
>  1 file changed, 21 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d9f4ea057e74..7746824a425d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5098,23 +5098,8 @@ static int __build_all_zonelists(void *data)
>  		}
>  	}
>  
> -	/*
> -	 * Initialize the boot_pagesets that are going to be used
> -	 * for bootstrapping processors. The real pagesets for
> -	 * each zone will be allocated later when the per cpu
> -	 * allocator is available.
> -	 *
> -	 * boot_pagesets are used also for bootstrapping offline
> -	 * cpus if the system is already booted because the pagesets
> -	 * are needed to initialize allocators on a specific cpu too.
> -	 * F.e. the percpu allocator needs the page allocator which
> -	 * needs the percpu allocator in order to allocate its pagesets
> -	 * (a chicken-egg dilemma).
> -	 */
> -	for_each_possible_cpu(cpu) {
> -		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
> -
>  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
> +	for_each_possible_cpu(cpu) {
>  		/*
>  		 * We now know the "local memory node" for each node--
>  		 * i.e., the node of the first zone in the generic zonelist.
> @@ -5125,8 +5110,8 @@ static int __build_all_zonelists(void *data)
>  		 */
>  		if (cpu_online(cpu))
>  			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
> -#endif
>  	}
> +#endif
>  
>  	return 0;
>  }
> @@ -5134,7 +5119,26 @@ static int __build_all_zonelists(void *data)
>  static noinline void __init
>  build_all_zonelists_init(void)
>  {
> +	int cpu;
> +
>  	__build_all_zonelists(NULL);
> +
> +	/*
> +	 * Initialize the boot_pagesets that are going to be used
> +	 * for bootstrapping processors. The real pagesets for
> +	 * each zone will be allocated later when the per cpu
> +	 * allocator is available.
> +	 *
> +	 * boot_pagesets are used also for bootstrapping offline
> +	 * cpus if the system is already booted because the pagesets
> +	 * are needed to initialize allocators on a specific cpu too.
> +	 * F.e. the percpu allocator needs the page allocator which
> +	 * needs the percpu allocator in order to allocate its pagesets
> +	 * (a chicken-egg dilemma).
> +	 */
> +	for_each_possible_cpu(cpu)
> +		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
> +
>  	mminit_verify_zonelist();
>  	cpuset_init_current_mems_allowed();
>  }
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 3/9] mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization
  2017-07-14  8:00 ` [PATCH 3/9] mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization Michal Hocko
  2017-07-14  9:48   ` Mel Gorman
@ 2017-07-19 13:19   ` Vlastimil Babka
  1 sibling, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2017-07-19 13:19 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, LKML, Michal Hocko

On 07/14/2017 10:00 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> __build_all_zonelists reinitializes each online cpu local node for
> CONFIG_HAVE_MEMORYLESS_NODES. This makes sense because previously memory
> less nodes could gain some memory during memory hotplug and so the local
> node should be changed for CPUs close to such a node. It makes less
> sense to do that unconditionally for a newly creaded NUMA node which is
> still offline and without any memory.
> 
> Let's also simplify the cpu loop and use for_each_online_cpu instead of
> an explicit cpu_online check for all possible cpus.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 6 ++----
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7746824a425d..ebc3311555b1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5096,10 +5096,8 @@ static int __build_all_zonelists(void *data)
>  
>  			build_zonelists(pgdat);
>  		}
> -	}
>  
>  #ifdef CONFIG_HAVE_MEMORYLESS_NODES
> -	for_each_possible_cpu(cpu) {
>  		/*
>  		 * We now know the "local memory node" for each node--
>  		 * i.e., the node of the first zone in the generic zonelist.
> @@ -5108,10 +5106,10 @@ static int __build_all_zonelists(void *data)
>  		 * secondary cpus' numa_mem as they come on-line.  During
>  		 * node/memory hotplug, we'll fixup all on-line cpus.
>  		 */
> -		if (cpu_online(cpu))
> +		for_each_online_cpu(cpu)
>  			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
> -	}
>  #endif
> +	}
>  
>  	return 0;
>  }
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/9] mm, memory_hotplug: drop zone from build_all_zonelists
  2017-07-14  8:00 ` [PATCH 4/9] mm, memory_hotplug: drop zone from build_all_zonelists Michal Hocko
@ 2017-07-19 13:33   ` Vlastimil Babka
  2017-07-20  8:15     ` Michal Hocko
  0 siblings, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2017-07-19 13:33 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, LKML, Michal Hocko,
	Wen Congyang

On 07/14/2017 10:00 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> build_all_zonelists gets a zone parameter to initialize zone's
> pagesets. There is only a single user which gives a non-NULL
> zone parameter and that one doesn't really need the rest of the
> build_all_zonelists (see 6dcd73d7011b ("memory-hotplug: allocate zone's
> pcp before onlining pages")).
> 
> Therefore remove setup_zone_pageset from build_all_zonelists and call it
> from its only user directly. This will also remove a pointless zonlists
> rebuilding which is always good.
> 
> Cc: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

with a small point below

...

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ebc3311555b1..00e117922b3f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5065,7 +5065,6 @@ static void build_zonelists(pg_data_t *pgdat)
>  static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
>  static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
>  static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
> -static void setup_zone_pageset(struct zone *zone);
>  
>  /*
>   * Global mutex to protect against size modification of zonelists
> @@ -5146,19 +5145,14 @@ build_all_zonelists_init(void)
>   * unless system_state == SYSTEM_BOOTING.
>   *
>   * __ref due to (1) call of __meminit annotated setup_zone_pageset

Isn't the whole (1) in the comment invalid now?

> - * [we're only called with non-NULL zone through __meminit paths] and
> - * (2) call of __init annotated helper build_all_zonelists_init
> + * and (2) call of __init annotated helper build_all_zonelists_init
>   * [protected by SYSTEM_BOOTING].
>   */
> -void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
> +void __ref build_all_zonelists(pg_data_t *pgdat)
>  {
>  	if (system_state == SYSTEM_BOOTING) {
>  		build_all_zonelists_init();
>  	} else {
> -#ifdef CONFIG_MEMORY_HOTPLUG
> -		if (zone)
> -			setup_zone_pageset(zone);
> -#endif
>  		/* we have to stop all cpus to guarantee there is no user
>  		   of zonelist */
>  		stop_machine_cpuslocked(__build_all_zonelists, pgdat, NULL);
> @@ -5432,7 +5426,7 @@ static void __meminit zone_pageset_init(struct zone *zone, int cpu)
>  	pageset_set_high_and_batch(zone, pcp);
>  }
>  
> -static void __meminit setup_zone_pageset(struct zone *zone)
> +void __meminit setup_zone_pageset(struct zone *zone)
>  {
>  	int cpu;
>  	zone->pageset = alloc_percpu(struct per_cpu_pageset);
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-19  9:33   ` Vlastimil Babka
@ 2017-07-19 13:44     ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-19 13:44 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Mel Gorman, Johannes Weiner, LKML, linux-api

On Wed 19-07-17 11:33:49, Vlastimil Babka wrote:
> On 07/14/2017 09:59 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Supporting zone ordered zonelists costs us just a lot of code while
> > the usefulness is arguable if existent at all. Mel has already made
> > node ordering default on 64b systems. 32b systems are still using
> > ZONELIST_ORDER_ZONE because it is considered better to fallback to
> > a different NUMA node rather than consume precious lowmem zones.
> > 
> > This argument is, however, weaken by the fact that the memory reclaim
> > has been reworked to be node rather than zone oriented. This means
> > that lowmem requests have to skip over all highmem pages on LRUs already
> > and so zone ordering doesn't save the reclaim time much. So the only
> > advantage of the zone ordering is under a light memory pressure when
> > highmem requests do not ever hit into lowmem zones and the lowmem
> > pressure doesn't need to reclaim.
> > 
> > Considering that 32b NUMA systems are rather suboptimal already and
> > it is generally advisable to use 64b kernel on such a HW I believe we
> > should rather care about the code maintainability and just get rid of
> > ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
> > somebody tries to set zone ordering either from kernel command line
> > or the sysctl.
> > 
> > Cc: <linux-api@vger.kernel.org>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Found some leftovers to cleanup:
> 
> include/linux/mmzone.h:
> extern char numa_zonelist_order[];
> #define NUMA_ZONELIST_ORDER_LEN 16      /* string buffer size */
> 
> Also update docs?
> Documentation/sysctl/vm.txt:zone.  Specify "[Zz]one" for zone order.
> Documentation/admin-guide/kernel-parameters.txt:
> numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
> Documentation/vm/numa:a default zonelist order based on the sizes of the
> various zone types relative
> Documentation/vm/numa:default zonelist order may be overridden using the
> numa_zonelist_order kernel
> 
> Otherwise,
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

This?
---
commit 06c69d6785447160717ca8cc476dc9cac7a3f964
Author: Michal Hocko <mhocko@suse.com>
Date:   Wed Jul 19 14:28:31 2017 +0200

    fold me
    
    - update documentation as per Vlastimil
    
    Acked-by: Vlastimil Babka <vbabka@suse.cz>

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 60530c8490ff..28f1a0f84456 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2724,7 +2724,7 @@
 			Allowed values are enable and disable
 
 	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
-			one of ['zone', 'node', 'default'] can be specified
+			'node', 'default' can be specified
 			This can be set from sysctl after boot.
 			See Documentation/sysctl/vm.txt for details.
 
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 48244c42ff52..9baf66a9ef4e 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -572,7 +572,9 @@ See Documentation/nommu-mmap.txt for more information.
 
 numa_zonelist_order
 
-This sysctl is only for NUMA.
+This sysctl is only for NUMA and it is deprecated. Anything but
+Node order will fail!
+
 'where the memory is allocated from' is controlled by zonelists.
 (This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
  you may be able to read ZONE_DMA as ZONE_DMA32...)
diff --git a/Documentation/vm/numa b/Documentation/vm/numa
index a08f71647714..a31b85b9bb88 100644
--- a/Documentation/vm/numa
+++ b/Documentation/vm/numa
@@ -79,11 +79,8 @@ memory, Linux must decide whether to order the zonelists such that allocations
 fall back to the same zone type on a different node, or to a different zone
 type on the same node.  This is an important consideration because some zones,
 such as DMA or DMA32, represent relatively scarce resources.  Linux chooses
-a default zonelist order based on the sizes of the various zone types relative
-to the total memory of the node and the total memory of the system.  The
-default zonelist order may be overridden using the numa_zonelist_order kernel
-boot parameter or sysctl.  [see Documentation/admin-guide/kernel-parameters.rst and
-Documentation/sysctl/vm.txt]
+a default Node ordered zonelist. This means it tries to fallback to other zones
+from the same node before using remote nodes which are ordered by NUMA distance.
 
 By default, Linux will attempt to satisfy memory allocation requests from the
 node to which the CPU that executes the request is assigned.  Specifically,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fc14b8b3f6ce..b849006b20d3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -895,8 +895,6 @@ int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
-extern char numa_zonelist_order[];
-#define NUMA_ZONELIST_ORDER_LEN 16	/* string buffer size */
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 5/9] mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node
  2017-07-14  8:00 ` [PATCH 5/9] mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node Michal Hocko
  2017-07-14 12:14   ` Michal Hocko
@ 2017-07-20  6:13   ` Vlastimil Babka
  1 sibling, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2017-07-20  6:13 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, LKML, Michal Hocko,
	Toshi Kani

On 07/14/2017 10:00 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> try_online_node calls hotadd_new_pgdat which already calls
> build_all_zonelists. So the additional call is redundant.  Even though
> hotadd_new_pgdat will only initialize zonelists of the new node this is
> the right thing to do because such a node doesn't have any memory so
> other zonelists would ignore all the zones from this node anyway.

Doesn't the "if (pgdat<...>zone == NULL) in fact mean, that this is just
always dead code? Even more reason to remove it.

> Cc: Toshi Kani <toshi.kani@hp.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/memory_hotplug.c | 7 -------
>  1 file changed, 7 deletions(-)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 639b8af37c45..0d2f6a11075c 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1104,13 +1104,6 @@ int try_online_node(int nid)
>  	node_set_online(nid);
>  	ret = register_one_node(nid);
>  	BUG_ON(ret);
> -
> -	if (pgdat->node_zonelists->_zonerefs->zone == NULL) {
> -		mutex_lock(&zonelists_mutex);
> -		build_all_zonelists(NULL);
> -		mutex_unlock(&zonelists_mutex);
> -	}
> -
>  out:
>  	mem_hotplug_done();
>  	return ret;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists
  2017-07-14 11:45       ` Michal Hocko
@ 2017-07-20  6:16         ` Vlastimil Babka
  0 siblings, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2017-07-20  6:16 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Andrew Morton, Mel Gorman, Johannes Weiner, LKML

On 07/14/2017 01:45 PM, Michal Hocko wrote:
> On Fri 14-07-17 13:43:21, Michal Hocko wrote:
>> On Fri 14-07-17 13:29:14, Vlastimil Babka wrote:
>>> On 07/14/2017 10:00 AM, Michal Hocko wrote:
>>>> From: Michal Hocko <mhocko@suse.com>
>>>>
>>>> build_all_zonelists has been (ab)using stop_machine to make sure that
>>>> zonelists do not change while somebody is looking at them. This is
>>>> is just a gross hack because a) it complicates the context from which
>>>> we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
>>>> switch locking to a percpu rwsem")) and b) is is not really necessary
>>>> especially after "mm, page_alloc: simplify zonelist initialization".
>>>>
>>>> Updates of the zonelists happen very seldom, basically only when a zone
>>>> becomes populated during memory online or when it loses all the memory
>>>> during offline. A racing iteration over zonelists could either miss a
>>>> zone or try to work on one zone twice. Both of these are something we
>>>> can live with occasionally because there will always be at least one
>>>> zone visible so we are not likely to fail allocation too easily for
>>>> example.
>>>
>>> Given the experience with with cpusets and mempolicies, I would rather
>>> avoid the risk of allocation not seeing the only zone(s) that are
>>> allowed by its nodemask, and triggering premature OOM.
>>
>> I would argue, those are a different beast because they are directly
>> under control of not fully priviledged user and change between the empty
>> nodemask and cpusets very often. For this one to trigger we
>> would have to online/offline the last memory block in the zone very
>> often and that doesn't resemble a sensible usecase even remotely.

OK.

>>> So maybe the
>>> updates could be done in a way to avoid that, e.g. first append a copy
>>> of the old zonelist to the end, then overwrite and terminate with NULL.
>>> But if this requires any barriers or something similar on the iteration
>>> site, which is performance critical, then it's bad.
>>> Maybe a seqcount, that the iteration side only starts checking in the
>>> slowpath? Like we have with cpusets now.
>>> I know that Mel noted that stop_machine() also never had such guarantees
>>> to prevent this, but it could have made the chances smaller.
>>
>> I think we can come up with some scheme but is this really worth it
>> considering how unlikely the whole thing is? Well, if somebody hits a
>> premature OOM killer or allocations failures it would have to be along
>> with a heavy memory hotplug operations and then it would be quite easy
>> to spot what is going on and try to fix it. I would rather not
>> overcomplicate it, to be honest.

Fine, we can always add it later.

> And one more thing, Mel has already brought this up in his response.
> stop_machine haven't is very roughly same strenght wrt. double zone
> visit or a missed zone because we do not restart zonelist iteration.

I know, that's why I wrote "I know that Mel noted that stop_machine()
also never had such guarantees to prevent this, but it could have made
the chances smaller." But I don't have any good proof that your patch is
indeed making things worse, so let's apply and see...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
  2017-07-14  8:00 ` [PATCH 6/9] mm, page_alloc: simplify zonelist initialization Michal Hocko
  2017-07-14  9:55   ` Mel Gorman
  2017-07-14 12:46   ` Mel Gorman
@ 2017-07-20  6:55   ` Vlastimil Babka
  2017-07-20  7:19     ` Michal Hocko
  2 siblings, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2017-07-20  6:55 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, LKML, Michal Hocko

On 07/14/2017 10:00 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> build_zonelists gradually builds zonelists from the nearest to the most
> distant node. As we do not know how many populated zones we will have in
> each node we rely on the _zoneref to terminate initialized part of the
> zonelist by a NULL zone. While this is functionally correct it is quite
> suboptimal because we cannot allow updaters to race with zonelists
> users because they could see an empty zonelist and fail the allocation
> or hit the OOM killer in the worst case.
> 
> We can do much better, though. We can store the node ordering into an
> already existing node_order array and then give this array to
> build_zonelists_in_node_order and do the whole initialization at once.
> zonelists consumers still might see halfway initialized state but that
> should be much more tolerateable because the list will not be empty and
> they would either see some zone twice or skip over some zone(s) in the
> worst case which shouldn't lead to immediate failures.
> 
> This patch alone doesn't introduce any functional change yet, though, it
> is merely a preparatory work for later changes.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

I've collected the fold-ups from this thread and looked at the result as
single patch. Sems OK, just two things:
- please rename variable "i" in build_zonelists() to e.g. "nr_nodes"
- the !CONFIG_NUMA variant of build_zonelists() won't build, because it
doesn't declare nr_zones variable

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
  2017-07-20  6:55   ` Vlastimil Babka
@ 2017-07-20  7:19     ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-20  7:19 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Mel Gorman, Johannes Weiner, LKML

On Thu 20-07-17 08:55:42, Vlastimil Babka wrote:
> On 07/14/2017 10:00 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > build_zonelists gradually builds zonelists from the nearest to the most
> > distant node. As we do not know how many populated zones we will have in
> > each node we rely on the _zoneref to terminate initialized part of the
> > zonelist by a NULL zone. While this is functionally correct it is quite
> > suboptimal because we cannot allow updaters to race with zonelists
> > users because they could see an empty zonelist and fail the allocation
> > or hit the OOM killer in the worst case.
> > 
> > We can do much better, though. We can store the node ordering into an
> > already existing node_order array and then give this array to
> > build_zonelists_in_node_order and do the whole initialization at once.
> > zonelists consumers still might see halfway initialized state but that
> > should be much more tolerateable because the list will not be empty and
> > they would either see some zone twice or skip over some zone(s) in the
> > worst case which shouldn't lead to immediate failures.
> > 
> > This patch alone doesn't introduce any functional change yet, though, it
> > is merely a preparatory work for later changes.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> I've collected the fold-ups from this thread and looked at the result as
> single patch. Sems OK, just two things:
> - please rename variable "i" in build_zonelists() to e.g. "nr_nodes"
> - the !CONFIG_NUMA variant of build_zonelists() won't build, because it
> doesn't declare nr_zones variable

Thanks! I will fold this in.
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c0d3e8eeb150..6f192405e469 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4957,7 +4957,7 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
 static void build_zonelists(pg_data_t *pgdat)
 {
 	static int node_order[MAX_NUMNODES];
-	int node, load, i = 0;
+	int node, load, nr_nodes = 0;
 	nodemask_t used_mask;
 	int local_node, prev_node;
 
@@ -4978,12 +4978,12 @@ static void build_zonelists(pg_data_t *pgdat)
 		    node_distance(local_node, prev_node))
 			node_load[node] = load;
 
-		node_order[i++] = node;
+		node_order[nr_nodes++] = node;
 		prev_node = node;
 		load--;
 	}
 
-	build_zonelists_in_node_order(pgdat, node_order, i);
+	build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
 	build_thisnode_zonelists(pgdat);
 }
 
@@ -5013,10 +5013,11 @@ static void build_zonelists(pg_data_t *pgdat)
 {
 	int node, local_node;
 	struct zoneref *zonerefs;
+	int nr_zones;
 
 	local_node = pgdat->node_id;
 
-	zonrefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
+	zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
 	nr_zones = build_zonerefs_node(pgdat, zonerefs);
 	zonerefs += nr_zones;
 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists
  2017-07-14  8:00 ` [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists Michal Hocko
  2017-07-14  9:59   ` Mel Gorman
  2017-07-14 11:29   ` Vlastimil Babka
@ 2017-07-20  7:24   ` Vlastimil Babka
  2017-07-20  9:21     ` Michal Hocko
  2 siblings, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2017-07-20  7:24 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, LKML, Michal Hocko

On 07/14/2017 10:00 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> build_all_zonelists has been (ab)using stop_machine to make sure that
> zonelists do not change while somebody is looking at them. This is
> is just a gross hack because a) it complicates the context from which
> we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> switch locking to a percpu rwsem")) and b) is is not really necessary
> especially after "mm, page_alloc: simplify zonelist initialization".
> 
> Updates of the zonelists happen very seldom, basically only when a zone
> becomes populated during memory online or when it loses all the memory
> during offline. A racing iteration over zonelists could either miss a
> zone or try to work on one zone twice. Both of these are something we
> can live with occasionally because there will always be at least one
> zone visible so we are not likely to fail allocation too easily for
> example.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Some stress testing of this would still be worth, IMHO.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 78bd62418380..217889ecd13f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5066,8 +5066,7 @@ static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
>   */
>  DEFINE_MUTEX(zonelists_mutex);
>  
> -/* return values int ....just for stop_machine() */
> -static int __build_all_zonelists(void *data)
> +static void __build_all_zonelists(void *data)
>  {
>  	int nid;
>  	int cpu;
> @@ -5103,8 +5102,6 @@ static int __build_all_zonelists(void *data)
>  			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
>  #endif
>  	}
> -
> -	return 0;
>  }
>  
>  static noinline void __init
> @@ -5147,9 +5144,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
>  	if (system_state == SYSTEM_BOOTING) {
>  		build_all_zonelists_init();
>  	} else {
> -		/* we have to stop all cpus to guarantee there is no user
> -		   of zonelist */
> -		stop_machine_cpuslocked(__build_all_zonelists, pgdat, NULL);
> +		__build_all_zonelists(pgdat);
>  		/* cpuset refresh routine should be here */
>  	}
>  	vm_total_pages = nr_free_pagecache_pages();
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 9/9] mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations
  2017-07-14  8:00 ` [PATCH 9/9] mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations Michal Hocko
@ 2017-07-20  8:04   ` Vlastimil Babka
  0 siblings, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2017-07-20  8:04 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, LKML, Michal Hocko,
	joonsoo kim, Shaohua Li

On 07/14/2017 10:00 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> f52407ce2dea ("memory hotplug: alloc page from other node in memory
> online") has introduced N_HIGH_MEMORY checks to only use NUMA aware
> allocations when there is some memory present because the respective
> node might not have any memory yet at the time and so it could fail
> or even OOM. Things have changed since then though. Zonelists are
> now always initialized before we do any allocations even for hotplug
> (see 959ecc48fc75 ("mm/memory_hotplug.c: fix building of node hotplug
> zonelist")). Therefore these checks are not really needed. In fact
> caller of the allocator should never care about whether the node is
> populated because that might change at any time.
> 
> Cc: Shaohua Li <shaohua.li@intel.com>
> Cc: joonsoo kim <js1304@gmail.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_ext.c       |  5 +----
>  mm/sparse-vmemmap.c | 11 +++--------
>  mm/sparse.c         | 10 +++-------
>  3 files changed, 7 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 88ccc044b09a..714ce79256c5 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -222,10 +222,7 @@ static void *__meminit alloc_page_ext(size_t size, int nid)
>  		return addr;
>  	}
>  
> -	if (node_state(nid, N_HIGH_MEMORY))
> -		addr = vzalloc_node(size, nid);
> -	else
> -		addr = vzalloc(size);
> +	addr = vzalloc_node(size, nid);
>  
>  	return addr;
>  }
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index c50b1a14d55e..d1a39b8051e0 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -54,14 +54,9 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
>  	if (slab_is_available()) {
>  		struct page *page;
>  
> -		if (node_state(node, N_HIGH_MEMORY))
> -			page = alloc_pages_node(
> -				node, GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
> -				get_order(size));
> -		else
> -			page = alloc_pages(
> -				GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
> -				get_order(size));
> +		page = alloc_pages_node(node,
> +			GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
> +			get_order(size));
>  		if (page)
>  			return page_address(page);
>  		return NULL;
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 7b4be3fd5cac..a9783acf2bb9 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -65,14 +65,10 @@ static noinline struct mem_section __ref *sparse_index_alloc(int nid)
>  	unsigned long array_size = SECTIONS_PER_ROOT *
>  				   sizeof(struct mem_section);
>  
> -	if (slab_is_available()) {
> -		if (node_state(nid, N_HIGH_MEMORY))
> -			section = kzalloc_node(array_size, GFP_KERNEL, nid);
> -		else
> -			section = kzalloc(array_size, GFP_KERNEL);
> -	} else {
> +	if (slab_is_available())
> +		section = kzalloc_node(array_size, GFP_KERNEL, nid);
> +	else
>  		section = memblock_virt_alloc_node(array_size, nid);
> -	}
>  
>  	return section;
>  }
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 4/9] mm, memory_hotplug: drop zone from build_all_zonelists
  2017-07-19 13:33   ` Vlastimil Babka
@ 2017-07-20  8:15     ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-20  8:15 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Mel Gorman, Johannes Weiner, LKML, Wen Congyang

On Wed 19-07-17 15:33:32, Vlastimil Babka wrote:
> On 07/14/2017 10:00 AM, Michal Hocko wrote:
[...]
> > @@ -5146,19 +5145,14 @@ build_all_zonelists_init(void)
> >   * unless system_state == SYSTEM_BOOTING.
> >   *
> >   * __ref due to (1) call of __meminit annotated setup_zone_pageset
> 
> Isn't the whole (1) in the comment invalid now?

Yeah, I will drop it.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists
  2017-07-20  7:24   ` Vlastimil Babka
@ 2017-07-20  9:21     ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2017-07-20  9:21 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Mel Gorman, Johannes Weiner, LKML

On Thu 20-07-17 09:24:58, Vlastimil Babka wrote:
> On 07/14/2017 10:00 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > build_all_zonelists has been (ab)using stop_machine to make sure that
> > zonelists do not change while somebody is looking at them. This is
> > is just a gross hack because a) it complicates the context from which
> > we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
> > switch locking to a percpu rwsem")) and b) is is not really necessary
> > especially after "mm, page_alloc: simplify zonelist initialization".
> > 
> > Updates of the zonelists happen very seldom, basically only when a zone
> > becomes populated during memory online or when it loses all the memory
> > during offline. A racing iteration over zonelists could either miss a
> > zone or try to work on one zone twice. Both of these are something we
> > can live with occasionally because there will always be at least one
> > zone visible so we are not likely to fail allocation too easily for
> > example.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Some stress testing of this would still be worth, IMHO.

I have run the pathological online/offline of the single memblock in the
movable zone while stressing the same small node with some memory pressure.
Node 1, zone      DMA
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 943, 943, 943)
Node 1, zone    DMA32
  pages free     227310
        min      8294
        low      10367
        high     12440
        spanned  262112
        present  262112
        managed  241436
        protection: (0, 0, 0, 0)
Node 1, zone   Normal
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 0, 1024)
Node 1, zone  Movable
  pages free     32722
        min      85
        low      117
        high     149
        spanned  32768
        present  32768
        managed  32768
        protection: (0, 0, 0, 0)

root@test1:/sys/devices/system/node/node1# while true
do 
echo offline > memory34/state
echo online_movable > memory34/state
done

root@test1:/mnt/data/test/linux-3.7-rc5# numactl --preferred=1 make -j4

and it survived without any unexpected behavior. While this is not
really a great testing coverage it should exercise the allocation path
quite a lot.

I can add this to the changelog if you think it is worth it.
 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
  2017-07-21 14:39 ` [PATCH 6/9] mm, page_alloc: simplify zonelist initialization Michal Hocko
@ 2017-07-24  9:25   ` Vlastimil Babka
  0 siblings, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2017-07-24  9:25 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, linux-mm, LKML, Michal Hocko

On 07/21/2017 04:39 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> build_zonelists gradually builds zonelists from the nearest to the most
> distant node. As we do not know how many populated zones we will have in
> each node we rely on the _zoneref to terminate initialized part of the
> zonelist by a NULL zone. While this is functionally correct it is quite
> suboptimal because we cannot allow updaters to race with zonelists
> users because they could see an empty zonelist and fail the allocation
> or hit the OOM killer in the worst case.
> 
> We can do much better, though. We can store the node ordering into an
> already existing node_order array and then give this array to
> build_zonelists_in_node_order and do the whole initialization at once.
> zonelists consumers still might see halfway initialized state but that
> should be much more tolerateable because the list will not be empty and
> they would either see some zone twice or skip over some zone(s) in the
> worst case which shouldn't lead to immediate failures.
> 
> While at it let's simplify build_zonelists_node which is rather
> confusing now. It gets an index into the zoneref array and returns
> the updated index for the next iteration. Let's rename the function
> to build_zonerefs_node to better reflect its purpose and give it
> zoneref array to update. The function doesn't the index anymore. It
> just returns the number of added zones so that the caller can advance
> the zonered array start for the next update.
> 
> This patch alone doesn't introduce any functional change yet, though, it
> is merely a preparatory work for later changes.
> 
> Changes since v1
> - build_zonelists_node -> build_zonerefs_node and operate directly on
>   zonerefs array rather than play tricks with index into the array.
> - give build_zonelists_in_node_order nr_nodes to not iterate over all
>   MAX_NUMNODES as per Mel
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
  2017-07-21 14:39 [PATCH -v1 0/9] cleanup zonelists initialization Michal Hocko
@ 2017-07-21 14:39 ` Michal Hocko
  2017-07-24  9:25   ` Vlastimil Babka
  0 siblings, 1 reply; 55+ messages in thread
From: Michal Hocko @ 2017-07-21 14:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

build_zonelists gradually builds zonelists from the nearest to the most
distant node. As we do not know how many populated zones we will have in
each node we rely on the _zoneref to terminate initialized part of the
zonelist by a NULL zone. While this is functionally correct it is quite
suboptimal because we cannot allow updaters to race with zonelists
users because they could see an empty zonelist and fail the allocation
or hit the OOM killer in the worst case.

We can do much better, though. We can store the node ordering into an
already existing node_order array and then give this array to
build_zonelists_in_node_order and do the whole initialization at once.
zonelists consumers still might see halfway initialized state but that
should be much more tolerateable because the list will not be empty and
they would either see some zone twice or skip over some zone(s) in the
worst case which shouldn't lead to immediate failures.

While at it let's simplify build_zonelists_node which is rather
confusing now. It gets an index into the zoneref array and returns
the updated index for the next iteration. Let's rename the function
to build_zonerefs_node to better reflect its purpose and give it
zoneref array to update. The function doesn't the index anymore. It
just returns the number of added zones so that the caller can advance
the zonered array start for the next update.

This patch alone doesn't introduce any functional change yet, though, it
is merely a preparatory work for later changes.

Changes since v1
- build_zonelists_node -> build_zonerefs_node and operate directly on
  zonerefs array rather than play tricks with index into the array.
- give build_zonelists_in_node_order nr_nodes to not iterate over all
  MAX_NUMNODES as per Mel

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 81 +++++++++++++++++++++++++++++----------------------------
 1 file changed, 41 insertions(+), 40 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e9aca464f66..0d78dc5a708f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4772,18 +4772,17 @@ static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
  *
  * Add all populated zones of a node to the zonelist.
  */
-static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
-				int nr_zones)
+static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
 {
 	struct zone *zone;
 	enum zone_type zone_type = MAX_NR_ZONES;
+	int nr_zones = 0;
 
 	do {
 		zone_type--;
 		zone = pgdat->node_zones + zone_type;
 		if (managed_zone(zone)) {
-			zoneref_set_zone(zone,
-				&zonelist->_zonerefs[nr_zones++]);
+			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
 			check_highest_zone(zone_type);
 		}
 	} while (zone_type);
@@ -4913,17 +4912,24 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
  * This results in maximum locality--normal zone overflows into local
  * DMA zone, if any--but risks exhausting DMA zone.
  */
-static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
+static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
+		unsigned nr_nodes)
 {
-	int j;
-	struct zonelist *zonelist;
+	struct zoneref *zonerefs;
+	int i;
+
+	zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
+
+	for (i = 0; i < nr_nodes; i++) {
+		int nr_zones;
 
-	zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
-	for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++)
-		;
-	j = build_zonelists_node(NODE_DATA(node), zonelist, j);
-	zonelist->_zonerefs[j].zone = NULL;
-	zonelist->_zonerefs[j].zone_idx = 0;
+		pg_data_t *node = NODE_DATA(node_order[i]);
+
+		nr_zones = build_zonerefs_node(node, zonerefs);
+		zonerefs += nr_zones;
+	}
+	zonerefs->zone = NULL;
+	zonerefs->zone_idx = 0;
 }
 
 /*
@@ -4931,13 +4937,14 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
  */
 static void build_thisnode_zonelists(pg_data_t *pgdat)
 {
-	int j;
-	struct zonelist *zonelist;
+	struct zoneref *zonerefs;
+	int nr_zones;
 
-	zonelist = &pgdat->node_zonelists[ZONELIST_NOFALLBACK];
-	j = build_zonelists_node(pgdat, zonelist, 0);
-	zonelist->_zonerefs[j].zone = NULL;
-	zonelist->_zonerefs[j].zone_idx = 0;
+	zonerefs = pgdat->node_zonelists[ZONELIST_NOFALLBACK]._zonerefs;
+	nr_zones = build_zonerefs_node(pgdat, zonerefs);
+	zonerefs += nr_zones;
+	zonerefs->zone = NULL;
+	zonerefs->zone_idx = 0;
 }
 
 /*
@@ -4946,21 +4953,13 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
  * exhausted, but results in overflowing to remote node while memory
  * may still exist in local DMA zone.
  */
-static int node_order[MAX_NUMNODES];
 
 static void build_zonelists(pg_data_t *pgdat)
 {
-	int i, node, load;
+	static int node_order[MAX_NUMNODES];
+	int node, load, nr_nodes = 0;
 	nodemask_t used_mask;
 	int local_node, prev_node;
-	struct zonelist *zonelist;
-
-	/* initialize zonelists */
-	for (i = 0; i < MAX_ZONELISTS; i++) {
-		zonelist = pgdat->node_zonelists + i;
-		zonelist->_zonerefs[0].zone = NULL;
-		zonelist->_zonerefs[0].zone_idx = 0;
-	}
 
 	/* NUMA-aware ordering of nodes */
 	local_node = pgdat->node_id;
@@ -4969,8 +4968,6 @@ static void build_zonelists(pg_data_t *pgdat)
 	nodes_clear(used_mask);
 
 	memset(node_order, 0, sizeof(node_order));
-	i = 0;
-
 	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
 		/*
 		 * We don't want to pressure a particular node.
@@ -4981,11 +4978,12 @@ static void build_zonelists(pg_data_t *pgdat)
 		    node_distance(local_node, prev_node))
 			node_load[node] = load;
 
+		node_order[nr_nodes++] = node;
 		prev_node = node;
 		load--;
-		build_zonelists_in_node_order(pgdat, node);
 	}
 
+	build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
 	build_thisnode_zonelists(pgdat);
 }
 
@@ -5014,13 +5012,14 @@ static void setup_min_slab_ratio(void);
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int node, local_node;
-	enum zone_type j;
-	struct zonelist *zonelist;
+	struct zoneref *zonerefs;
+	int nr_zones;
 
 	local_node = pgdat->node_id;
 
-	zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
-	j = build_zonelists_node(pgdat, zonelist, 0);
+	zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
+	nr_zones = build_zonerefs_node(pgdat, zonerefs);
+	zonerefs += nr_zones;
 
 	/*
 	 * Now we build the zonelist so that it contains the zones
@@ -5033,16 +5032,18 @@ static void build_zonelists(pg_data_t *pgdat)
 	for (node = local_node + 1; node < MAX_NUMNODES; node++) {
 		if (!node_online(node))
 			continue;
-		j = build_zonelists_node(NODE_DATA(node), zonelist, j);
+		nr_zones = build_zonerefs_node(NODE_DATA(node), zonerefs);
+		zonerefs += nr_zones;
 	}
 	for (node = 0; node < local_node; node++) {
 		if (!node_online(node))
 			continue;
-		j = build_zonelists_node(NODE_DATA(node), zonelist, j);
+		nr_zones = build_zonerefs_node(NODE_DATA(node), zonerefs);
+		zonerefs += nr_zones;
 	}
 
-	zonelist->_zonerefs[j].zone = NULL;
-	zonelist->_zonerefs[j].zone_idx = 0;
+	zonerefs->zone = NULL;
+	zonerefs->zone_idx = 0;
 }
 
 #endif	/* CONFIG_NUMA */
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2017-07-24  9:25 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-14  7:59 [PATCH 0/9] cleanup zonelists initialization Michal Hocko
2017-07-14  7:59 ` [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE Michal Hocko
2017-07-14  9:36   ` Mel Gorman
2017-07-14 10:47     ` Michal Hocko
2017-07-14 11:16       ` Mel Gorman
2017-07-14 11:38         ` Michal Hocko
2017-07-14 12:56           ` Mel Gorman
2017-07-14 13:01             ` Mel Gorman
2017-07-14 13:08             ` Michal Hocko
2017-07-19  9:33   ` Vlastimil Babka
2017-07-19 13:44     ` Michal Hocko
2017-07-14  7:59 ` [PATCH 2/9] mm, page_alloc: remove boot pageset initialization from memory hotplug Michal Hocko
2017-07-14  9:39   ` Mel Gorman
2017-07-19 13:15   ` Vlastimil Babka
2017-07-14  8:00 ` [PATCH 3/9] mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization Michal Hocko
2017-07-14  9:48   ` Mel Gorman
2017-07-14 10:50     ` Michal Hocko
2017-07-14 12:32       ` Mel Gorman
2017-07-14 12:39         ` Michal Hocko
2017-07-14 12:56           ` Mel Gorman
2017-07-19 13:19   ` Vlastimil Babka
2017-07-14  8:00 ` [PATCH 4/9] mm, memory_hotplug: drop zone from build_all_zonelists Michal Hocko
2017-07-19 13:33   ` Vlastimil Babka
2017-07-20  8:15     ` Michal Hocko
2017-07-14  8:00 ` [PATCH 5/9] mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node Michal Hocko
2017-07-14 12:14   ` Michal Hocko
2017-07-20  6:13   ` Vlastimil Babka
2017-07-14  8:00 ` [PATCH 6/9] mm, page_alloc: simplify zonelist initialization Michal Hocko
2017-07-14  9:55   ` Mel Gorman
2017-07-14 10:51     ` Michal Hocko
2017-07-14 12:46   ` Mel Gorman
2017-07-14 13:02     ` Michal Hocko
2017-07-14 14:18       ` Mel Gorman
2017-07-17  6:06         ` Michal Hocko
2017-07-17  8:07           ` Mel Gorman
2017-07-17  8:19             ` Michal Hocko
2017-07-17  8:58               ` Mel Gorman
2017-07-17  9:15                 ` Michal Hocko
2017-07-20  6:55   ` Vlastimil Babka
2017-07-20  7:19     ` Michal Hocko
2017-07-14  8:00 ` [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists Michal Hocko
2017-07-14  9:59   ` Mel Gorman
2017-07-14 11:00     ` Michal Hocko
2017-07-14 12:47       ` Mel Gorman
2017-07-14 11:29   ` Vlastimil Babka
2017-07-14 11:43     ` Michal Hocko
2017-07-14 11:45       ` Michal Hocko
2017-07-20  6:16         ` Vlastimil Babka
2017-07-20  7:24   ` Vlastimil Babka
2017-07-20  9:21     ` Michal Hocko
2017-07-14  8:00 ` [PATCH 8/9] mm, memory_hotplug: get rid of zonelists_mutex Michal Hocko
2017-07-14  8:00 ` [PATCH 9/9] mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations Michal Hocko
2017-07-20  8:04   ` Vlastimil Babka
2017-07-21 14:39 [PATCH -v1 0/9] cleanup zonelists initialization Michal Hocko
2017-07-21 14:39 ` [PATCH 6/9] mm, page_alloc: simplify zonelist initialization Michal Hocko
2017-07-24  9:25   ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).