Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH -v1 0/9] cleanup zonelists initialization
@ 2017-07-21 14:39 Michal Hocko
  2017-07-21 14:39 ` [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE Michal Hocko
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Michal Hocko @ 2017-07-21 14:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, linux-mm, LKML,
	Joonsoo Kim, Michal Hocko, Shaohua Li, Toshi Kani

Hi,
previous version of the series has been posted here [1]. I have hopefully
addressed all review feedback. There are some small fixes/cleanups
here and there but no fundamental changes since the last time.

This is aimed at cleaning up the zonelists initialization code we have
but the primary motivation was bug report [2] which got resolved but
the usage of stop_machine is just too ugly to live. Most patches are
straightforward but 3 of them need a special consideration.

Patch 1 removes zone ordered zonelists completely. I am CCing linux-api
because this is a user visible change. As I argue in the patch
description I do not think we have a strong usecase for it these days.
I have kept sysctl in place and warn into the log if somebody tries to
configure zone lists ordering. If somebody has a real usecase for it
we can revert this patch but I do not expect anybody will actually notice
runtime differences. This patch is not strictly needed for the rest but
it made patch 6 easier to implement.

Patch 7 removes stop_machine from build_all_zonelists without adding any
special synchronization between iterators and updater which I _believe_
is acceptable as explained in the changelog. I hope I am not missing
anything.

Patch 8 then removes zonelists_mutex which is kind of ugly as well and
not really needed AFAICS but a care should be taken when double checking
my thinking.

This has passed my light testing but I currently do not have a HW to
test hotadd_new_pgdat path (aka a completely new node added to the
system in runtime).

This is based on the current mmomt git tree (mmotm-2017-07-12-15-11).
Any feedback is highly appreciated.

The diffstat looks really promissing
 Documentation/admin-guide/kernel-parameters.txt |   2 +-
 Documentation/sysctl/vm.txt                     |   4 +-
 Documentation/vm/numa                           |   7 +-
 include/linux/mmzone.h                          |   5 +-
 init/main.c                                     |   2 +-
 kernel/sysctl.c                                 |   2 -
 mm/internal.h                                   |   1 +
 mm/memory_hotplug.c                             |  27 +-
 mm/page_alloc.c                                 | 335 +++++++-----------------
 mm/page_ext.c                                   |   5 +-
 mm/sparse-vmemmap.c                             |  11 +-
 mm/sparse.c                                     |  10 +-
 12 files changed, 119 insertions(+), 292 deletions(-)

Shortlog says
Michal Hocko (9):
      mm, page_alloc: rip out ZONELIST_ORDER_ZONE
      mm, page_alloc: remove boot pageset initialization from memory hotplug
      mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization
      mm, memory_hotplug: drop zone from build_all_zonelists
      mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node
      mm, page_alloc: simplify zonelist initialization
      mm, page_alloc: remove stop_machine from build_all_zonelists
      mm, memory_hotplug: get rid of zonelists_mutex
      mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations

[1] http://lkml.kernel.org/r/20170714080006.7250-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/alpine.DEB.2.20.1706291803380.1861@nanos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-21 14:39 [PATCH -v1 0/9] cleanup zonelists initialization Michal Hocko
@ 2017-07-21 14:39 ` Michal Hocko
  2017-07-21 14:45   ` Michal Hocko
  2017-07-21 14:39 ` [PATCH 2/9] mm, page_alloc: remove boot pageset initialization from memory hotplug Michal Hocko
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2017-07-21 14:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Supporting zone ordered zonelists costs us just a lot of code while
the usefulness is arguable if existent at all. Mel has already made
node ordering default on 64b systems. 32b systems are still using
ZONELIST_ORDER_ZONE because it is considered better to fallback to
a different NUMA node rather than consume precious lowmem zones.

This argument is, however, weaken by the fact that the memory reclaim
has been reworked to be node rather than zone oriented. This means
that lowmem requests have to skip over all highmem pages on LRUs already
and so zone ordering doesn't save the reclaim time much. So the only
advantage of the zone ordering is under a light memory pressure when
highmem requests do not ever hit into lowmem zones and the lowmem
pressure doesn't need to reclaim.

Considering that 32b NUMA systems are rather suboptimal already and
it is generally advisable to use 64b kernel on such a HW I believe we
should rather care about the code maintainability and just get rid of
ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
somebody tries to set zone ordering either from kernel command line
or the sysctl.

Changes since v1
- do not print Default in sysctl handler because our behavior was rather
  inconsistent in the past numa_zonelist_order was lowecase while
  zonelist_order_name was uppercase so boot time unchanged value woul
  print lowercase while updated value could be uppercase. Print "Node"
  which is the default instead - Mel
- update documentation - Vlastimil

Cc: linux-api@vger.kernel.org
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 Documentation/admin-guide/kernel-parameters.txt |   2 +-
 Documentation/sysctl/vm.txt                     |   4 +-
 Documentation/vm/numa                           |   7 +-
 include/linux/mmzone.h                          |   2 -
 kernel/sysctl.c                                 |   2 -
 mm/page_alloc.c                                 | 178 +++---------------------
 6 files changed, 29 insertions(+), 166 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 60530c8490ff..28f1a0f84456 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2724,7 +2724,7 @@
 			Allowed values are enable and disable
 
 	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
-			one of ['zone', 'node', 'default'] can be specified
+			'node', 'default' can be specified
 			This can be set from sysctl after boot.
 			See Documentation/sysctl/vm.txt for details.
 
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 48244c42ff52..9baf66a9ef4e 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -572,7 +572,9 @@ See Documentation/nommu-mmap.txt for more information.
 
 numa_zonelist_order
 
-This sysctl is only for NUMA.
+This sysctl is only for NUMA and it is deprecated. Anything but
+Node order will fail!
+
 'where the memory is allocated from' is controlled by zonelists.
 (This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
  you may be able to read ZONE_DMA as ZONE_DMA32...)
diff --git a/Documentation/vm/numa b/Documentation/vm/numa
index a08f71647714..a31b85b9bb88 100644
--- a/Documentation/vm/numa
+++ b/Documentation/vm/numa
@@ -79,11 +79,8 @@ memory, Linux must decide whether to order the zonelists such that allocations
 fall back to the same zone type on a different node, or to a different zone
 type on the same node.  This is an important consideration because some zones,
 such as DMA or DMA32, represent relatively scarce resources.  Linux chooses
-a default zonelist order based on the sizes of the various zone types relative
-to the total memory of the node and the total memory of the system.  The
-default zonelist order may be overridden using the numa_zonelist_order kernel
-boot parameter or sysctl.  [see Documentation/admin-guide/kernel-parameters.rst and
-Documentation/sysctl/vm.txt]
+a default Node ordered zonelist. This means it tries to fallback to other zones
+from the same node before using remote nodes which are ordered by NUMA distance.
 
 By default, Linux will attempt to satisfy memory allocation requests from the
 node to which the CPU that executes the request is assigned.  Specifically,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fc14b8b3f6ce..b849006b20d3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -895,8 +895,6 @@ int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
-extern char numa_zonelist_order[];
-#define NUMA_ZONELIST_ORDER_LEN 16	/* string buffer size */
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 655686d546cb..0cbce40f5426 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1553,8 +1553,6 @@ static struct ctl_table vm_table[] = {
 #ifdef CONFIG_NUMA
 	{
 		.procname	= "numa_zonelist_order",
-		.data		= &numa_zonelist_order,
-		.maxlen		= NUMA_ZONELIST_ORDER_LEN,
 		.mode		= 0644,
 		.proc_handler	= numa_zonelist_order_handler,
 	},
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80e4adb4c360..b839a90e24c8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4791,52 +4791,18 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
 	return nr_zones;
 }
 
-
-/*
- *  zonelist_order:
- *  0 = automatic detection of better ordering.
- *  1 = order by ([node] distance, -zonetype)
- *  2 = order by (-zonetype, [node] distance)
- *
- *  If not NUMA, ZONELIST_ORDER_ZONE and ZONELIST_ORDER_NODE will create
- *  the same zonelist. So only NUMA can configure this param.
- */
-#define ZONELIST_ORDER_DEFAULT  0
-#define ZONELIST_ORDER_NODE     1
-#define ZONELIST_ORDER_ZONE     2
-
-/* zonelist order in the kernel.
- * set_zonelist_order() will set this to NODE or ZONE.
- */
-static int current_zonelist_order = ZONELIST_ORDER_DEFAULT;
-static char zonelist_order_name[3][8] = {"Default", "Node", "Zone"};
-
-
 #ifdef CONFIG_NUMA
-/* The value user specified ....changed by config */
-static int user_zonelist_order = ZONELIST_ORDER_DEFAULT;
-/* string for sysctl */
-#define NUMA_ZONELIST_ORDER_LEN	16
-char numa_zonelist_order[16] = "default";
-
-/*
- * interface for configure zonelist ordering.
- * command line option "numa_zonelist_order"
- *	= "[dD]efault	- default, automatic configuration.
- *	= "[nN]ode 	- order by node locality, then by zone within node
- *	= "[zZ]one      - order by zone, then by locality within zone
- */
 
 static int __parse_numa_zonelist_order(char *s)
 {
-	if (*s == 'd' || *s == 'D') {
-		user_zonelist_order = ZONELIST_ORDER_DEFAULT;
-	} else if (*s == 'n' || *s == 'N') {
-		user_zonelist_order = ZONELIST_ORDER_NODE;
-	} else if (*s == 'z' || *s == 'Z') {
-		user_zonelist_order = ZONELIST_ORDER_ZONE;
-	} else {
-		pr_warn("Ignoring invalid numa_zonelist_order value:  %s\n", s);
+	/*
+	 * We used to support different zonlists modes but they turned
+	 * out to be just not useful. Let's keep the warning in place
+	 * if somebody still use the cmd line parameter so that we do
+	 * not fail it silently
+	 */
+	if (!(*s == 'd' || *s == 'D' || *s == 'n' || *s == 'N')) {
+		pr_warn("Ignoring unsupported numa_zonelist_order value:  %s\n", s);
 		return -EINVAL;
 	}
 	return 0;
@@ -4844,16 +4810,10 @@ static int __parse_numa_zonelist_order(char *s)
 
 static __init int setup_numa_zonelist_order(char *s)
 {
-	int ret;
-
 	if (!s)
 		return 0;
 
-	ret = __parse_numa_zonelist_order(s);
-	if (ret == 0)
-		strlcpy(numa_zonelist_order, s, NUMA_ZONELIST_ORDER_LEN);
-
-	return ret;
+	return __parse_numa_zonelist_order(s);
 }
 early_param("numa_zonelist_order", setup_numa_zonelist_order);
 
@@ -4864,40 +4824,22 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
 		loff_t *ppos)
 {
-	char saved_string[NUMA_ZONELIST_ORDER_LEN];
+	char *str;
 	int ret;
-	static DEFINE_MUTEX(zl_order_mutex);
 
-	mutex_lock(&zl_order_mutex);
-	if (write) {
-		if (strlen((char *)table->data) >= NUMA_ZONELIST_ORDER_LEN) {
-			ret = -EINVAL;
-			goto out;
-		}
-		strcpy(saved_string, (char *)table->data);
+	if (!write) {
+		int len = sizeof("Node");
+		if (copy_to_user(buffer, "Node", len))
+			return -EFAULT;
+		return len;
 	}
-	ret = proc_dostring(table, write, buffer, length, ppos);
-	if (ret)
-		goto out;
-	if (write) {
-		int oldval = user_zonelist_order;
 
-		ret = __parse_numa_zonelist_order((char *)table->data);
-		if (ret) {
-			/*
-			 * bogus value.  restore saved string
-			 */
-			strncpy((char *)table->data, saved_string,
-				NUMA_ZONELIST_ORDER_LEN);
-			user_zonelist_order = oldval;
-		} else if (oldval != user_zonelist_order) {
-			mutex_lock(&zonelists_mutex);
-			build_all_zonelists(NULL, NULL);
-			mutex_unlock(&zonelists_mutex);
-		}
-	}
-out:
-	mutex_unlock(&zl_order_mutex);
+	str = memdup_user_nul(buffer, 16);
+	if (IS_ERR(str))
+		return PTR_ERR(str);
+
+	ret = __parse_numa_zonelist_order(str);
+	kfree(str);
 	return ret;
 }
 
@@ -5006,70 +4948,12 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
  */
 static int node_order[MAX_NUMNODES];
 
-static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
-{
-	int pos, j, node;
-	int zone_type;		/* needs to be signed */
-	struct zone *z;
-	struct zonelist *zonelist;
-
-	zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
-	pos = 0;
-	for (zone_type = MAX_NR_ZONES - 1; zone_type >= 0; zone_type--) {
-		for (j = 0; j < nr_nodes; j++) {
-			node = node_order[j];
-			z = &NODE_DATA(node)->node_zones[zone_type];
-			if (managed_zone(z)) {
-				zoneref_set_zone(z,
-					&zonelist->_zonerefs[pos++]);
-				check_highest_zone(zone_type);
-			}
-		}
-	}
-	zonelist->_zonerefs[pos].zone = NULL;
-	zonelist->_zonerefs[pos].zone_idx = 0;
-}
-
-#if defined(CONFIG_64BIT)
-/*
- * Devices that require DMA32/DMA are relatively rare and do not justify a
- * penalty to every machine in case the specialised case applies. Default
- * to Node-ordering on 64-bit NUMA machines
- */
-static int default_zonelist_order(void)
-{
-	return ZONELIST_ORDER_NODE;
-}
-#else
-/*
- * On 32-bit, the Normal zone needs to be preserved for allocations accessible
- * by the kernel. If processes running on node 0 deplete the low memory zone
- * then reclaim will occur more frequency increasing stalls and potentially
- * be easier to OOM if a large percentage of the zone is under writeback or
- * dirty. The problem is significantly worse if CONFIG_HIGHPTE is not set.
- * Hence, default to zone ordering on 32-bit.
- */
-static int default_zonelist_order(void)
-{
-	return ZONELIST_ORDER_ZONE;
-}
-#endif /* CONFIG_64BIT */
-
-static void set_zonelist_order(void)
-{
-	if (user_zonelist_order == ZONELIST_ORDER_DEFAULT)
-		current_zonelist_order = default_zonelist_order();
-	else
-		current_zonelist_order = user_zonelist_order;
-}
-
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int i, node, load;
 	nodemask_t used_mask;
 	int local_node, prev_node;
 	struct zonelist *zonelist;
-	unsigned int order = current_zonelist_order;
 
 	/* initialize zonelists */
 	for (i = 0; i < MAX_ZONELISTS; i++) {
@@ -5099,15 +4983,7 @@ static void build_zonelists(pg_data_t *pgdat)
 
 		prev_node = node;
 		load--;
-		if (order == ZONELIST_ORDER_NODE)
-			build_zonelists_in_node_order(pgdat, node);
-		else
-			node_order[i++] = node;	/* remember order */
-	}
-
-	if (order == ZONELIST_ORDER_ZONE) {
-		/* calculate node order -- i.e., DMA last! */
-		build_zonelists_in_zone_order(pgdat, i);
+		build_zonelists_in_node_order(pgdat, node);
 	}
 
 	build_thisnode_zonelists(pgdat);
@@ -5135,11 +5011,6 @@ static void setup_min_unmapped_ratio(void);
 static void setup_min_slab_ratio(void);
 #else	/* CONFIG_NUMA */
 
-static void set_zonelist_order(void)
-{
-	current_zonelist_order = ZONELIST_ORDER_ZONE;
-}
-
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int node, local_node;
@@ -5279,8 +5150,6 @@ build_all_zonelists_init(void)
  */
 void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
 {
-	set_zonelist_order();
-
 	if (system_state == SYSTEM_BOOTING) {
 		build_all_zonelists_init();
 	} else {
@@ -5306,9 +5175,8 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
 	else
 		page_group_by_mobility_disabled = 0;
 
-	pr_info("Built %i zonelists in %s order, mobility grouping %s.  Total pages: %ld\n",
+	pr_info("Built %i zonelists, mobility grouping %s.  Total pages: %ld\n",
 		nr_online_nodes,
-		zonelist_order_name[current_zonelist_order],
 		page_group_by_mobility_disabled ? "off" : "on",
 		vm_total_pages);
 #ifdef CONFIG_NUMA
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/9] mm, page_alloc: remove boot pageset initialization from memory hotplug
  2017-07-21 14:39 [PATCH -v1 0/9] cleanup zonelists initialization Michal Hocko
  2017-07-21 14:39 ` [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE Michal Hocko
@ 2017-07-21 14:39 ` Michal Hocko
  2017-07-21 14:39 ` [PATCH 3/9] mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization Michal Hocko
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2017-07-21 14:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

boot_pageset is a boot time hack which gets superseded by normal
pagesets later in the boot process. It makes zero sense to reinitialize
it again and again during memory hotplug.

Changes since v1
- use __maybe_unused for cpu to silence annoying warning for
  CONFIG_HAVE_MEMORYLESS_NODES=y

Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 40 ++++++++++++++++++++++------------------
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b839a90e24c8..ff53f709c44c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5077,7 +5077,7 @@ DEFINE_MUTEX(zonelists_mutex);
 static int __build_all_zonelists(void *data)
 {
 	int nid;
-	int cpu;
+	int __maybe_unused cpu;
 	pg_data_t *self = data;
 
 #ifdef CONFIG_NUMA
@@ -5098,23 +5098,8 @@ static int __build_all_zonelists(void *data)
 		}
 	}
 
-	/*
-	 * Initialize the boot_pagesets that are going to be used
-	 * for bootstrapping processors. The real pagesets for
-	 * each zone will be allocated later when the per cpu
-	 * allocator is available.
-	 *
-	 * boot_pagesets are used also for bootstrapping offline
-	 * cpus if the system is already booted because the pagesets
-	 * are needed to initialize allocators on a specific cpu too.
-	 * F.e. the percpu allocator needs the page allocator which
-	 * needs the percpu allocator in order to allocate its pagesets
-	 * (a chicken-egg dilemma).
-	 */
-	for_each_possible_cpu(cpu) {
-		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
-
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
+	for_each_possible_cpu(cpu) {
 		/*
 		 * We now know the "local memory node" for each node--
 		 * i.e., the node of the first zone in the generic zonelist.
@@ -5125,8 +5110,8 @@ static int __build_all_zonelists(void *data)
 		 */
 		if (cpu_online(cpu))
 			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
-#endif
 	}
+#endif
 
 	return 0;
 }
@@ -5134,7 +5119,26 @@ static int __build_all_zonelists(void *data)
 static noinline void __init
 build_all_zonelists_init(void)
 {
+	int cpu;
+
 	__build_all_zonelists(NULL);
+
+	/*
+	 * Initialize the boot_pagesets that are going to be used
+	 * for bootstrapping processors. The real pagesets for
+	 * each zone will be allocated later when the per cpu
+	 * allocator is available.
+	 *
+	 * boot_pagesets are used also for bootstrapping offline
+	 * cpus if the system is already booted because the pagesets
+	 * are needed to initialize allocators on a specific cpu too.
+	 * F.e. the percpu allocator needs the page allocator which
+	 * needs the percpu allocator in order to allocate its pagesets
+	 * (a chicken-egg dilemma).
+	 */
+	for_each_possible_cpu(cpu)
+		setup_pageset(&per_cpu(boot_pageset, cpu), 0);
+
 	mminit_verify_zonelist();
 	cpuset_init_current_mems_allowed();
 }
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 3/9] mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization
  2017-07-21 14:39 [PATCH -v1 0/9] cleanup zonelists initialization Michal Hocko
  2017-07-21 14:39 ` [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE Michal Hocko
  2017-07-21 14:39 ` [PATCH 2/9] mm, page_alloc: remove boot pageset initialization from memory hotplug Michal Hocko
@ 2017-07-21 14:39 ` Michal Hocko
  2017-07-21 14:39 ` [PATCH 4/9] mm, memory_hotplug: drop zone from build_all_zonelists Michal Hocko
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2017-07-21 14:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__build_all_zonelists reinitializes each online cpu local node for
CONFIG_HAVE_MEMORYLESS_NODES. This makes sense because previously memory
less nodes could gain some memory during memory hotplug and so the local
node should be changed for CPUs close to such a node. It makes less
sense to do that unconditionally for a newly creaded NUMA node which is
still offline and without any memory.

Let's also simplify the cpu loop and use for_each_online_cpu instead of
an explicit cpu_online check for all possible cpus.

Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ff53f709c44c..52dc5ba40914 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5096,10 +5096,8 @@ static int __build_all_zonelists(void *data)
 
 			build_zonelists(pgdat);
 		}
-	}
 
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
-	for_each_possible_cpu(cpu) {
 		/*
 		 * We now know the "local memory node" for each node--
 		 * i.e., the node of the first zone in the generic zonelist.
@@ -5108,10 +5106,10 @@ static int __build_all_zonelists(void *data)
 		 * secondary cpus' numa_mem as they come on-line.  During
 		 * node/memory hotplug, we'll fixup all on-line cpus.
 		 */
-		if (cpu_online(cpu))
+		for_each_online_cpu(cpu)
 			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
-	}
 #endif
+	}
 
 	return 0;
 }
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 4/9] mm, memory_hotplug: drop zone from build_all_zonelists
  2017-07-21 14:39 [PATCH -v1 0/9] cleanup zonelists initialization Michal Hocko
                   ` (2 preceding siblings ...)
  2017-07-21 14:39 ` [PATCH 3/9] mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization Michal Hocko
@ 2017-07-21 14:39 ` Michal Hocko
  2017-07-21 14:39 ` [PATCH 5/9] mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node Michal Hocko
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2017-07-21 14:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

build_all_zonelists gets a zone parameter to initialize zone's
pagesets. There is only a single user which gives a non-NULL
zone parameter and that one doesn't really need the rest of the
build_all_zonelists (see 6dcd73d7011b ("memory-hotplug: allocate zone's
pcp before onlining pages")).

Therefore remove setup_zone_pageset from build_all_zonelists and call it
from its only user directly. This will also remove a pointless zonlists
rebuilding which is always good.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mmzone.h |  2 +-
 init/main.c            |  2 +-
 mm/internal.h          |  1 +
 mm/memory_hotplug.c    | 10 +++++-----
 mm/page_alloc.c        | 13 +++----------
 5 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b849006b20d3..0add5ee09729 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -771,7 +771,7 @@ static inline bool is_dev_zone(const struct zone *zone)
 #include <linux/memory_hotplug.h>
 
 extern struct mutex zonelists_mutex;
-void build_all_zonelists(pg_data_t *pgdat, struct zone *zone);
+void build_all_zonelists(pg_data_t *pgdat);
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
 bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			 int classzone_idx, unsigned int alloc_flags,
diff --git a/init/main.c b/init/main.c
index f866510472d7..e1baf6544d15 100644
--- a/init/main.c
+++ b/init/main.c
@@ -519,7 +519,7 @@ asmlinkage __visible void __init start_kernel(void)
 	boot_cpu_state_init();
 	smp_prepare_boot_cpu();	/* arch-specific boot-cpu hooks */
 
-	build_all_zonelists(NULL, NULL);
+	build_all_zonelists(NULL);
 	page_alloc_init();
 
 	pr_notice("Kernel command line: %s\n", boot_command_line);
diff --git a/mm/internal.h b/mm/internal.h
index 24d88f084705..19d1f35aa6d7 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -522,4 +522,5 @@ static inline bool is_migrate_highatomic_page(struct page *page)
 	return get_pageblock_migratetype(page) == MIGRATE_HIGHATOMIC;
 }
 
+void setup_zone_pageset(struct zone *zone);
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d620d0427b6b..639b8af37c45 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -952,7 +952,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	mutex_lock(&zonelists_mutex);
 	if (!populated_zone(zone)) {
 		need_zonelists_rebuild = 1;
-		build_all_zonelists(NULL, zone);
+		setup_zone_pageset(zone);
 	}
 
 	ret = walk_system_ram_range(pfn, nr_pages, &onlined_pages,
@@ -973,7 +973,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	if (onlined_pages) {
 		node_states_set_node(nid, &arg);
 		if (need_zonelists_rebuild)
-			build_all_zonelists(NULL, NULL);
+			build_all_zonelists(NULL);
 		else
 			zone_pcp_update(zone);
 	}
@@ -1051,7 +1051,7 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
 	 * to access not-initialized zonelist, build here.
 	 */
 	mutex_lock(&zonelists_mutex);
-	build_all_zonelists(pgdat, NULL);
+	build_all_zonelists(pgdat);
 	mutex_unlock(&zonelists_mutex);
 
 	/*
@@ -1107,7 +1107,7 @@ int try_online_node(int nid)
 
 	if (pgdat->node_zonelists->_zonerefs->zone == NULL) {
 		mutex_lock(&zonelists_mutex);
-		build_all_zonelists(NULL, NULL);
+		build_all_zonelists(NULL);
 		mutex_unlock(&zonelists_mutex);
 	}
 
@@ -1727,7 +1727,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	if (!populated_zone(zone)) {
 		zone_pcp_reset(zone);
 		mutex_lock(&zonelists_mutex);
-		build_all_zonelists(NULL, NULL);
+		build_all_zonelists(NULL);
 		mutex_unlock(&zonelists_mutex);
 	} else
 		zone_pcp_update(zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 52dc5ba40914..6e9aca464f66 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5065,7 +5065,6 @@ static void build_zonelists(pg_data_t *pgdat)
 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
 static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
-static void setup_zone_pageset(struct zone *zone);
 
 /*
  * Global mutex to protect against size modification of zonelists
@@ -5145,20 +5144,14 @@ build_all_zonelists_init(void)
  * Called with zonelists_mutex held always
  * unless system_state == SYSTEM_BOOTING.
  *
- * __ref due to (1) call of __meminit annotated setup_zone_pageset
- * [we're only called with non-NULL zone through __meminit paths] and
- * (2) call of __init annotated helper build_all_zonelists_init
+ * __ref due to call of __init annotated helper build_all_zonelists_init
  * [protected by SYSTEM_BOOTING].
  */
-void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
+void __ref build_all_zonelists(pg_data_t *pgdat)
 {
 	if (system_state == SYSTEM_BOOTING) {
 		build_all_zonelists_init();
 	} else {
-#ifdef CONFIG_MEMORY_HOTPLUG
-		if (zone)
-			setup_zone_pageset(zone);
-#endif
 		/* we have to stop all cpus to guarantee there is no user
 		   of zonelist */
 		stop_machine_cpuslocked(__build_all_zonelists, pgdat, NULL);
@@ -5432,7 +5425,7 @@ static void __meminit zone_pageset_init(struct zone *zone, int cpu)
 	pageset_set_high_and_batch(zone, pcp);
 }
 
-static void __meminit setup_zone_pageset(struct zone *zone)
+void __meminit setup_zone_pageset(struct zone *zone)
 {
 	int cpu;
 	zone->pageset = alloc_percpu(struct per_cpu_pageset);
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 5/9] mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node
  2017-07-21 14:39 [PATCH -v1 0/9] cleanup zonelists initialization Michal Hocko
                   ` (3 preceding siblings ...)
  2017-07-21 14:39 ` [PATCH 4/9] mm, memory_hotplug: drop zone from build_all_zonelists Michal Hocko
@ 2017-07-21 14:39 ` Michal Hocko
  2017-07-21 14:39 ` [PATCH 6/9] mm, page_alloc: simplify zonelist initialization Michal Hocko
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2017-07-21 14:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, linux-mm, LKML,
	Michal Hocko, Toshi Kani

From: Michal Hocko <mhocko@suse.com>

try_online_node calls hotadd_new_pgdat which already calls
build_all_zonelists. So the additional call is redundant.  Even though
hotadd_new_pgdat will only initialize zonelists of the new node this is
the right thing to do because such a node doesn't have any memory so
other zonelists would ignore all the zones from this node anyway.

Cc: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/memory_hotplug.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 639b8af37c45..0d2f6a11075c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1104,13 +1104,6 @@ int try_online_node(int nid)
 	node_set_online(nid);
 	ret = register_one_node(nid);
 	BUG_ON(ret);
-
-	if (pgdat->node_zonelists->_zonerefs->zone == NULL) {
-		mutex_lock(&zonelists_mutex);
-		build_all_zonelists(NULL);
-		mutex_unlock(&zonelists_mutex);
-	}
-
 out:
 	mem_hotplug_done();
 	return ret;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
  2017-07-21 14:39 [PATCH -v1 0/9] cleanup zonelists initialization Michal Hocko
                   ` (4 preceding siblings ...)
  2017-07-21 14:39 ` [PATCH 5/9] mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node Michal Hocko
@ 2017-07-21 14:39 ` Michal Hocko
  2017-07-24  9:25   ` Vlastimil Babka
  2017-07-21 14:39 ` [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists Michal Hocko
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2017-07-21 14:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

build_zonelists gradually builds zonelists from the nearest to the most
distant node. As we do not know how many populated zones we will have in
each node we rely on the _zoneref to terminate initialized part of the
zonelist by a NULL zone. While this is functionally correct it is quite
suboptimal because we cannot allow updaters to race with zonelists
users because they could see an empty zonelist and fail the allocation
or hit the OOM killer in the worst case.

We can do much better, though. We can store the node ordering into an
already existing node_order array and then give this array to
build_zonelists_in_node_order and do the whole initialization at once.
zonelists consumers still might see halfway initialized state but that
should be much more tolerateable because the list will not be empty and
they would either see some zone twice or skip over some zone(s) in the
worst case which shouldn't lead to immediate failures.

While at it let's simplify build_zonelists_node which is rather
confusing now. It gets an index into the zoneref array and returns
the updated index for the next iteration. Let's rename the function
to build_zonerefs_node to better reflect its purpose and give it
zoneref array to update. The function doesn't the index anymore. It
just returns the number of added zones so that the caller can advance
the zonered array start for the next update.

This patch alone doesn't introduce any functional change yet, though, it
is merely a preparatory work for later changes.

Changes since v1
- build_zonelists_node -> build_zonerefs_node and operate directly on
  zonerefs array rather than play tricks with index into the array.
- give build_zonelists_in_node_order nr_nodes to not iterate over all
  MAX_NUMNODES as per Mel

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 81 +++++++++++++++++++++++++++++----------------------------
 1 file changed, 41 insertions(+), 40 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e9aca464f66..0d78dc5a708f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4772,18 +4772,17 @@ static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref)
  *
  * Add all populated zones of a node to the zonelist.
  */
-static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
-				int nr_zones)
+static int build_zonerefs_node(pg_data_t *pgdat, struct zoneref *zonerefs)
 {
 	struct zone *zone;
 	enum zone_type zone_type = MAX_NR_ZONES;
+	int nr_zones = 0;
 
 	do {
 		zone_type--;
 		zone = pgdat->node_zones + zone_type;
 		if (managed_zone(zone)) {
-			zoneref_set_zone(zone,
-				&zonelist->_zonerefs[nr_zones++]);
+			zoneref_set_zone(zone, &zonerefs[nr_zones++]);
 			check_highest_zone(zone_type);
 		}
 	} while (zone_type);
@@ -4913,17 +4912,24 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
  * This results in maximum locality--normal zone overflows into local
  * DMA zone, if any--but risks exhausting DMA zone.
  */
-static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
+static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order,
+		unsigned nr_nodes)
 {
-	int j;
-	struct zonelist *zonelist;
+	struct zoneref *zonerefs;
+	int i;
+
+	zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
+
+	for (i = 0; i < nr_nodes; i++) {
+		int nr_zones;
 
-	zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
-	for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++)
-		;
-	j = build_zonelists_node(NODE_DATA(node), zonelist, j);
-	zonelist->_zonerefs[j].zone = NULL;
-	zonelist->_zonerefs[j].zone_idx = 0;
+		pg_data_t *node = NODE_DATA(node_order[i]);
+
+		nr_zones = build_zonerefs_node(node, zonerefs);
+		zonerefs += nr_zones;
+	}
+	zonerefs->zone = NULL;
+	zonerefs->zone_idx = 0;
 }
 
 /*
@@ -4931,13 +4937,14 @@ static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
  */
 static void build_thisnode_zonelists(pg_data_t *pgdat)
 {
-	int j;
-	struct zonelist *zonelist;
+	struct zoneref *zonerefs;
+	int nr_zones;
 
-	zonelist = &pgdat->node_zonelists[ZONELIST_NOFALLBACK];
-	j = build_zonelists_node(pgdat, zonelist, 0);
-	zonelist->_zonerefs[j].zone = NULL;
-	zonelist->_zonerefs[j].zone_idx = 0;
+	zonerefs = pgdat->node_zonelists[ZONELIST_NOFALLBACK]._zonerefs;
+	nr_zones = build_zonerefs_node(pgdat, zonerefs);
+	zonerefs += nr_zones;
+	zonerefs->zone = NULL;
+	zonerefs->zone_idx = 0;
 }
 
 /*
@@ -4946,21 +4953,13 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
  * exhausted, but results in overflowing to remote node while memory
  * may still exist in local DMA zone.
  */
-static int node_order[MAX_NUMNODES];
 
 static void build_zonelists(pg_data_t *pgdat)
 {
-	int i, node, load;
+	static int node_order[MAX_NUMNODES];
+	int node, load, nr_nodes = 0;
 	nodemask_t used_mask;
 	int local_node, prev_node;
-	struct zonelist *zonelist;
-
-	/* initialize zonelists */
-	for (i = 0; i < MAX_ZONELISTS; i++) {
-		zonelist = pgdat->node_zonelists + i;
-		zonelist->_zonerefs[0].zone = NULL;
-		zonelist->_zonerefs[0].zone_idx = 0;
-	}
 
 	/* NUMA-aware ordering of nodes */
 	local_node = pgdat->node_id;
@@ -4969,8 +4968,6 @@ static void build_zonelists(pg_data_t *pgdat)
 	nodes_clear(used_mask);
 
 	memset(node_order, 0, sizeof(node_order));
-	i = 0;
-
 	while ((node = find_next_best_node(local_node, &used_mask)) >= 0) {
 		/*
 		 * We don't want to pressure a particular node.
@@ -4981,11 +4978,12 @@ static void build_zonelists(pg_data_t *pgdat)
 		    node_distance(local_node, prev_node))
 			node_load[node] = load;
 
+		node_order[nr_nodes++] = node;
 		prev_node = node;
 		load--;
-		build_zonelists_in_node_order(pgdat, node);
 	}
 
+	build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
 	build_thisnode_zonelists(pgdat);
 }
 
@@ -5014,13 +5012,14 @@ static void setup_min_slab_ratio(void);
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int node, local_node;
-	enum zone_type j;
-	struct zonelist *zonelist;
+	struct zoneref *zonerefs;
+	int nr_zones;
 
 	local_node = pgdat->node_id;
 
-	zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
-	j = build_zonelists_node(pgdat, zonelist, 0);
+	zonerefs = pgdat->node_zonelists[ZONELIST_FALLBACK]._zonerefs;
+	nr_zones = build_zonerefs_node(pgdat, zonerefs);
+	zonerefs += nr_zones;
 
 	/*
 	 * Now we build the zonelist so that it contains the zones
@@ -5033,16 +5032,18 @@ static void build_zonelists(pg_data_t *pgdat)
 	for (node = local_node + 1; node < MAX_NUMNODES; node++) {
 		if (!node_online(node))
 			continue;
-		j = build_zonelists_node(NODE_DATA(node), zonelist, j);
+		nr_zones = build_zonerefs_node(NODE_DATA(node), zonerefs);
+		zonerefs += nr_zones;
 	}
 	for (node = 0; node < local_node; node++) {
 		if (!node_online(node))
 			continue;
-		j = build_zonelists_node(NODE_DATA(node), zonelist, j);
+		nr_zones = build_zonerefs_node(NODE_DATA(node), zonerefs);
+		zonerefs += nr_zones;
 	}
 
-	zonelist->_zonerefs[j].zone = NULL;
-	zonelist->_zonerefs[j].zone_idx = 0;
+	zonerefs->zone = NULL;
+	zonerefs->zone_idx = 0;
 }
 
 #endif	/* CONFIG_NUMA */
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists
  2017-07-21 14:39 [PATCH -v1 0/9] cleanup zonelists initialization Michal Hocko
                   ` (5 preceding siblings ...)
  2017-07-21 14:39 ` [PATCH 6/9] mm, page_alloc: simplify zonelist initialization Michal Hocko
@ 2017-07-21 14:39 ` Michal Hocko
  2017-07-21 14:39 ` [PATCH 8/9] mm, memory_hotplug: get rid of zonelists_mutex Michal Hocko
  2017-07-21 14:39 ` [PATCH 9/9] mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations Michal Hocko
  8 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2017-07-21 14:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

build_all_zonelists has been (ab)using stop_machine to make sure that
zonelists do not change while somebody is looking at them. This is
is just a gross hack because a) it complicates the context from which
we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug:
switch locking to a percpu rwsem")) and b) is is not really necessary
especially after "mm, page_alloc: simplify zonelist initialization"
and c) it doesn't really provide the protection it claims (see below).

Updates of the zonelists happen very seldom, basically only when a zone
becomes populated during memory online or when it loses all the memory
during offline. A racing iteration over zonelists could either miss a
zone or try to work on one zone twice. Both of these are something we
can live with occasionally because there will always be at least one
zone visible so we are not likely to fail allocation too easily for
example.

Please note that the original stop_machine approach doesn't really
provide a better exclusion because the iteration might be interrupted
half way (unless the whole iteration is preempt disabled which is not
the case in most cases) so the some zones could still be seen twice or a
zone missed.

I have run the pathological online/offline of the single memblock in the
movable zone while stressing the same small node with some memory pressure.
Node 1, zone      DMA
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 943, 943, 943)
Node 1, zone    DMA32
  pages free     227310
        min      8294
        low      10367
        high     12440
        spanned  262112
        present  262112
        managed  241436
        protection: (0, 0, 0, 0)
Node 1, zone   Normal
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 0, 1024)
Node 1, zone  Movable
  pages free     32722
        min      85
        low      117
        high     149
        spanned  32768
        present  32768
        managed  32768
        protection: (0, 0, 0, 0)

root@test1:/sys/devices/system/node/node1# while true
do
	echo offline > memory34/state
	echo online_movable > memory34/state
done

root@test1:/mnt/data/test/linux-3.7-rc5# numactl --preferred=1 make -j4

and it survived without any unexpected behavior. While this is not
really a great testing coverage it should exercise the allocation path
quite a lot.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d78dc5a708f..cf2eb3cf2cc5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5073,8 +5073,7 @@ static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
  */
 DEFINE_MUTEX(zonelists_mutex);
 
-/* return values int ....just for stop_machine() */
-static int __build_all_zonelists(void *data)
+static void __build_all_zonelists(void *data)
 {
 	int nid;
 	int __maybe_unused cpu;
@@ -5110,8 +5109,6 @@ static int __build_all_zonelists(void *data)
 			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
 #endif
 	}
-
-	return 0;
 }
 
 static noinline void __init
@@ -5153,9 +5150,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
 	if (system_state == SYSTEM_BOOTING) {
 		build_all_zonelists_init();
 	} else {
-		/* we have to stop all cpus to guarantee there is no user
-		   of zonelist */
-		stop_machine_cpuslocked(__build_all_zonelists, pgdat, NULL);
+		__build_all_zonelists(pgdat);
 		/* cpuset refresh routine should be here */
 	}
 	vm_total_pages = nr_free_pagecache_pages();
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 8/9] mm, memory_hotplug: get rid of zonelists_mutex
  2017-07-21 14:39 [PATCH -v1 0/9] cleanup zonelists initialization Michal Hocko
                   ` (6 preceding siblings ...)
  2017-07-21 14:39 ` [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists Michal Hocko
@ 2017-07-21 14:39 ` Michal Hocko
  2017-07-21 14:39 ` [PATCH 9/9] mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations Michal Hocko
  8 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2017-07-21 14:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, linux-mm, LKML,
	Michal Hocko

From: Michal Hocko <mhocko@suse.com>

zonelists_mutex has been introduced by 4eaf3f64397c ("mem-hotplug: fix
potential race while building zonelist for new populated zone") to
protect zonelist building from races. This is no longer needed though
because both memory online and offline are fully serialized. New users
have grown since then.

Notably setup_per_zone_wmarks wants to prevent from races between memory
hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
(see cfd3da1e49bb ("mm: Serialize access to min_free_kbytes"). Let's
add a private lock for that purpose. This will not prevent from seeing
halfway through memory hotplug operation but that shouldn't be a big
deal becuse memory hotplug will update watermarks explicitly so we will
eventually get a full picture. The lock just makes sure we won't race
when updating watermarks leading to weird results.

Also __build_all_zonelists manipulates global data so add a private lock
for it as well. This doesn't seem to be necessary today but it is more
robust to have a lock there.

While we are at it make sure we document that memory online/offline
depends on a full serialization either via mem_hotplug_begin() or
device_lock.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mmzone.h |  1 -
 mm/memory_hotplug.c    | 12 ++----------
 mm/page_alloc.c        | 18 +++++++++---------
 3 files changed, 11 insertions(+), 20 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0add5ee09729..fda9afbd14d9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -770,7 +770,6 @@ static inline bool is_dev_zone(const struct zone *zone)
 
 #include <linux/memory_hotplug.h>
 
-extern struct mutex zonelists_mutex;
 void build_all_zonelists(pg_data_t *pgdat);
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
 bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0d2f6a11075c..c1b0077eb9d1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -917,7 +917,7 @@ static struct zone * __meminit move_pfn_range(int online_type, int nid,
 	return zone;
 }
 
-/* Must be protected by mem_hotplug_begin() */
+/* Must be protected by mem_hotplug_begin() or a device_lock */
 int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type)
 {
 	unsigned long flags;
@@ -949,7 +949,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	 * This means the page allocator ignores this zone.
 	 * So, zonelist must be updated after online.
 	 */
-	mutex_lock(&zonelists_mutex);
 	if (!populated_zone(zone)) {
 		need_zonelists_rebuild = 1;
 		setup_zone_pageset(zone);
@@ -960,7 +959,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 	if (ret) {
 		if (need_zonelists_rebuild)
 			zone_pcp_reset(zone);
-		mutex_unlock(&zonelists_mutex);
 		goto failed_addition;
 	}
 
@@ -978,8 +976,6 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 			zone_pcp_update(zone);
 	}
 
-	mutex_unlock(&zonelists_mutex);
-
 	init_per_zone_wmark_min();
 
 	if (onlined_pages) {
@@ -1050,9 +1046,7 @@ static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
 	 * The node we allocated has no zone fallback lists. For avoiding
 	 * to access not-initialized zonelist, build here.
 	 */
-	mutex_lock(&zonelists_mutex);
 	build_all_zonelists(pgdat);
-	mutex_unlock(&zonelists_mutex);
 
 	/*
 	 * zone->managed_pages is set to an approximate value in
@@ -1719,9 +1713,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 
 	if (!populated_zone(zone)) {
 		zone_pcp_reset(zone);
-		mutex_lock(&zonelists_mutex);
 		build_all_zonelists(NULL);
-		mutex_unlock(&zonelists_mutex);
 	} else
 		zone_pcp_update(zone);
 
@@ -1747,7 +1739,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
 	return ret;
 }
 
-/* Must be protected by mem_hotplug_begin() */
+/* Must be protected by mem_hotplug_begin() or a device_lock */
 int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
 {
 	return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cf2eb3cf2cc5..369a263ce2fb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5067,17 +5067,14 @@ static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch);
 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset);
 static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats);
 
-/*
- * Global mutex to protect against size modification of zonelists
- * as well as to serialize pageset setup for the new populated zone.
- */
-DEFINE_MUTEX(zonelists_mutex);
-
 static void __build_all_zonelists(void *data)
 {
 	int nid;
 	int __maybe_unused cpu;
 	pg_data_t *self = data;
+	static DEFINE_SPINLOCK(lock);
+
+	spin_lock(&lock);
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -5109,6 +5106,8 @@ static void __build_all_zonelists(void *data)
 			set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu)));
 #endif
 	}
+
+	spin_unlock(&lock);
 }
 
 static noinline void __init
@@ -5139,7 +5138,6 @@ build_all_zonelists_init(void)
 }
 
 /*
- * Called with zonelists_mutex held always
  * unless system_state == SYSTEM_BOOTING.
  *
  * __ref due to call of __init annotated helper build_all_zonelists_init
@@ -6880,9 +6878,11 @@ static void __setup_per_zone_wmarks(void)
  */
 void setup_per_zone_wmarks(void)
 {
-	mutex_lock(&zonelists_mutex);
+	static DEFINE_SPINLOCK(lock);
+
+	spin_lock(&lock);
 	__setup_per_zone_wmarks();
-	mutex_unlock(&zonelists_mutex);
+	spin_unlock(&lock);
 }
 
 /*
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 9/9] mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations
  2017-07-21 14:39 [PATCH -v1 0/9] cleanup zonelists initialization Michal Hocko
                   ` (7 preceding siblings ...)
  2017-07-21 14:39 ` [PATCH 8/9] mm, memory_hotplug: get rid of zonelists_mutex Michal Hocko
@ 2017-07-21 14:39 ` Michal Hocko
  8 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2017-07-21 14:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, linux-mm, LKML,
	Michal Hocko, Joonsoo Kim, Shaohua Li

From: Michal Hocko <mhocko@suse.com>

f52407ce2dea ("memory hotplug: alloc page from other node in memory
online") has introduced N_HIGH_MEMORY checks to only use NUMA aware
allocations when there is some memory present because the respective
node might not have any memory yet at the time and so it could fail
or even OOM. Things have changed since then though. Zonelists are
now always initialized before we do any allocations even for hotplug
(see 959ecc48fc75 ("mm/memory_hotplug.c: fix building of node hotplug
zonelist")). Therefore these checks are not really needed. In fact
caller of the allocator should never care about whether the node is
populated because that might change at any time.

Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_ext.c       |  5 +----
 mm/sparse-vmemmap.c | 11 +++--------
 mm/sparse.c         | 10 +++-------
 3 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/mm/page_ext.c b/mm/page_ext.c
index 88ccc044b09a..714ce79256c5 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -222,10 +222,7 @@ static void *__meminit alloc_page_ext(size_t size, int nid)
 		return addr;
 	}
 
-	if (node_state(nid, N_HIGH_MEMORY))
-		addr = vzalloc_node(size, nid);
-	else
-		addr = vzalloc(size);
+	addr = vzalloc_node(size, nid);
 
 	return addr;
 }
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index c50b1a14d55e..d1a39b8051e0 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -54,14 +54,9 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
 	if (slab_is_available()) {
 		struct page *page;
 
-		if (node_state(node, N_HIGH_MEMORY))
-			page = alloc_pages_node(
-				node, GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
-				get_order(size));
-		else
-			page = alloc_pages(
-				GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
-				get_order(size));
+		page = alloc_pages_node(node,
+			GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
+			get_order(size));
 		if (page)
 			return page_address(page);
 		return NULL;
diff --git a/mm/sparse.c b/mm/sparse.c
index 7b4be3fd5cac..a9783acf2bb9 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -65,14 +65,10 @@ static noinline struct mem_section __ref *sparse_index_alloc(int nid)
 	unsigned long array_size = SECTIONS_PER_ROOT *
 				   sizeof(struct mem_section);
 
-	if (slab_is_available()) {
-		if (node_state(nid, N_HIGH_MEMORY))
-			section = kzalloc_node(array_size, GFP_KERNEL, nid);
-		else
-			section = kzalloc(array_size, GFP_KERNEL);
-	} else {
+	if (slab_is_available())
+		section = kzalloc_node(array_size, GFP_KERNEL, nid);
+	else
 		section = memblock_virt_alloc_node(array_size, nid);
-	}
 
 	return section;
 }
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-21 14:39 ` [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE Michal Hocko
@ 2017-07-21 14:45   ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2017-07-21 14:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, linux-mm, LKML, linux-api

Ups, linux-api, didn't make it into the CC list. Let's add it here.
The cover which will give you some context is here
http://lkml.kernel.org/r/20170721143915.14161-1-mhocko@kernel.org

On Fri 21-07-17 16:39:07, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Supporting zone ordered zonelists costs us just a lot of code while
> the usefulness is arguable if existent at all. Mel has already made
> node ordering default on 64b systems. 32b systems are still using
> ZONELIST_ORDER_ZONE because it is considered better to fallback to
> a different NUMA node rather than consume precious lowmem zones.
> 
> This argument is, however, weaken by the fact that the memory reclaim
> has been reworked to be node rather than zone oriented. This means
> that lowmem requests have to skip over all highmem pages on LRUs already
> and so zone ordering doesn't save the reclaim time much. So the only
> advantage of the zone ordering is under a light memory pressure when
> highmem requests do not ever hit into lowmem zones and the lowmem
> pressure doesn't need to reclaim.
> 
> Considering that 32b NUMA systems are rather suboptimal already and
> it is generally advisable to use 64b kernel on such a HW I believe we
> should rather care about the code maintainability and just get rid of
> ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
> somebody tries to set zone ordering either from kernel command line
> or the sysctl.
> 
> Changes since v1
> - do not print Default in sysctl handler because our behavior was rather
>   inconsistent in the past numa_zonelist_order was lowecase while
>   zonelist_order_name was uppercase so boot time unchanged value woul
>   print lowercase while updated value could be uppercase. Print "Node"
>   which is the default instead - Mel
> - update documentation - Vlastimil
> 
> Cc: linux-api@vger.kernel.org
> Acked-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |   2 +-
>  Documentation/sysctl/vm.txt                     |   4 +-
>  Documentation/vm/numa                           |   7 +-
>  include/linux/mmzone.h                          |   2 -
>  kernel/sysctl.c                                 |   2 -
>  mm/page_alloc.c                                 | 178 +++---------------------
>  6 files changed, 29 insertions(+), 166 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 60530c8490ff..28f1a0f84456 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2724,7 +2724,7 @@
>  			Allowed values are enable and disable
>  
>  	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
> -			one of ['zone', 'node', 'default'] can be specified
> +			'node', 'default' can be specified
>  			This can be set from sysctl after boot.
>  			See Documentation/sysctl/vm.txt for details.
>  
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 48244c42ff52..9baf66a9ef4e 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -572,7 +572,9 @@ See Documentation/nommu-mmap.txt for more information.
>  
>  numa_zonelist_order
>  
> -This sysctl is only for NUMA.
> +This sysctl is only for NUMA and it is deprecated. Anything but
> +Node order will fail!
> +
>  'where the memory is allocated from' is controlled by zonelists.
>  (This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
>   you may be able to read ZONE_DMA as ZONE_DMA32...)
> diff --git a/Documentation/vm/numa b/Documentation/vm/numa
> index a08f71647714..a31b85b9bb88 100644
> --- a/Documentation/vm/numa
> +++ b/Documentation/vm/numa
> @@ -79,11 +79,8 @@ memory, Linux must decide whether to order the zonelists such that allocations
>  fall back to the same zone type on a different node, or to a different zone
>  type on the same node.  This is an important consideration because some zones,
>  such as DMA or DMA32, represent relatively scarce resources.  Linux chooses
> -a default zonelist order based on the sizes of the various zone types relative
> -to the total memory of the node and the total memory of the system.  The
> -default zonelist order may be overridden using the numa_zonelist_order kernel
> -boot parameter or sysctl.  [see Documentation/admin-guide/kernel-parameters.rst and
> -Documentation/sysctl/vm.txt]
> +a default Node ordered zonelist. This means it tries to fallback to other zones
> +from the same node before using remote nodes which are ordered by NUMA distance.
>  
>  By default, Linux will attempt to satisfy memory allocation requests from the
>  node to which the CPU that executes the request is assigned.  Specifically,
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fc14b8b3f6ce..b849006b20d3 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -895,8 +895,6 @@ int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
>  
>  extern int numa_zonelist_order_handler(struct ctl_table *, int,
>  			void __user *, size_t *, loff_t *);
> -extern char numa_zonelist_order[];
> -#define NUMA_ZONELIST_ORDER_LEN 16	/* string buffer size */
>  
>  #ifndef CONFIG_NEED_MULTIPLE_NODES
>  
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 655686d546cb..0cbce40f5426 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1553,8 +1553,6 @@ static struct ctl_table vm_table[] = {
>  #ifdef CONFIG_NUMA
>  	{
>  		.procname	= "numa_zonelist_order",
> -		.data		= &numa_zonelist_order,
> -		.maxlen		= NUMA_ZONELIST_ORDER_LEN,
>  		.mode		= 0644,
>  		.proc_handler	= numa_zonelist_order_handler,
>  	},
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 80e4adb4c360..b839a90e24c8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4791,52 +4791,18 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
>  	return nr_zones;
>  }
>  
> -
> -/*
> - *  zonelist_order:
> - *  0 = automatic detection of better ordering.
> - *  1 = order by ([node] distance, -zonetype)
> - *  2 = order by (-zonetype, [node] distance)
> - *
> - *  If not NUMA, ZONELIST_ORDER_ZONE and ZONELIST_ORDER_NODE will create
> - *  the same zonelist. So only NUMA can configure this param.
> - */
> -#define ZONELIST_ORDER_DEFAULT  0
> -#define ZONELIST_ORDER_NODE     1
> -#define ZONELIST_ORDER_ZONE     2
> -
> -/* zonelist order in the kernel.
> - * set_zonelist_order() will set this to NODE or ZONE.
> - */
> -static int current_zonelist_order = ZONELIST_ORDER_DEFAULT;
> -static char zonelist_order_name[3][8] = {"Default", "Node", "Zone"};
> -
> -
>  #ifdef CONFIG_NUMA
> -/* The value user specified ....changed by config */
> -static int user_zonelist_order = ZONELIST_ORDER_DEFAULT;
> -/* string for sysctl */
> -#define NUMA_ZONELIST_ORDER_LEN	16
> -char numa_zonelist_order[16] = "default";
> -
> -/*
> - * interface for configure zonelist ordering.
> - * command line option "numa_zonelist_order"
> - *	= "[dD]efault	- default, automatic configuration.
> - *	= "[nN]ode 	- order by node locality, then by zone within node
> - *	= "[zZ]one      - order by zone, then by locality within zone
> - */
>  
>  static int __parse_numa_zonelist_order(char *s)
>  {
> -	if (*s == 'd' || *s == 'D') {
> -		user_zonelist_order = ZONELIST_ORDER_DEFAULT;
> -	} else if (*s == 'n' || *s == 'N') {
> -		user_zonelist_order = ZONELIST_ORDER_NODE;
> -	} else if (*s == 'z' || *s == 'Z') {
> -		user_zonelist_order = ZONELIST_ORDER_ZONE;
> -	} else {
> -		pr_warn("Ignoring invalid numa_zonelist_order value:  %s\n", s);
> +	/*
> +	 * We used to support different zonlists modes but they turned
> +	 * out to be just not useful. Let's keep the warning in place
> +	 * if somebody still use the cmd line parameter so that we do
> +	 * not fail it silently
> +	 */
> +	if (!(*s == 'd' || *s == 'D' || *s == 'n' || *s == 'N')) {
> +		pr_warn("Ignoring unsupported numa_zonelist_order value:  %s\n", s);
>  		return -EINVAL;
>  	}
>  	return 0;
> @@ -4844,16 +4810,10 @@ static int __parse_numa_zonelist_order(char *s)
>  
>  static __init int setup_numa_zonelist_order(char *s)
>  {
> -	int ret;
> -
>  	if (!s)
>  		return 0;
>  
> -	ret = __parse_numa_zonelist_order(s);
> -	if (ret == 0)
> -		strlcpy(numa_zonelist_order, s, NUMA_ZONELIST_ORDER_LEN);
> -
> -	return ret;
> +	return __parse_numa_zonelist_order(s);
>  }
>  early_param("numa_zonelist_order", setup_numa_zonelist_order);
>  
> @@ -4864,40 +4824,22 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
>  		void __user *buffer, size_t *length,
>  		loff_t *ppos)
>  {
> -	char saved_string[NUMA_ZONELIST_ORDER_LEN];
> +	char *str;
>  	int ret;
> -	static DEFINE_MUTEX(zl_order_mutex);
>  
> -	mutex_lock(&zl_order_mutex);
> -	if (write) {
> -		if (strlen((char *)table->data) >= NUMA_ZONELIST_ORDER_LEN) {
> -			ret = -EINVAL;
> -			goto out;
> -		}
> -		strcpy(saved_string, (char *)table->data);
> +	if (!write) {
> +		int len = sizeof("Node");
> +		if (copy_to_user(buffer, "Node", len))
> +			return -EFAULT;
> +		return len;
>  	}
> -	ret = proc_dostring(table, write, buffer, length, ppos);
> -	if (ret)
> -		goto out;
> -	if (write) {
> -		int oldval = user_zonelist_order;
>  
> -		ret = __parse_numa_zonelist_order((char *)table->data);
> -		if (ret) {
> -			/*
> -			 * bogus value.  restore saved string
> -			 */
> -			strncpy((char *)table->data, saved_string,
> -				NUMA_ZONELIST_ORDER_LEN);
> -			user_zonelist_order = oldval;
> -		} else if (oldval != user_zonelist_order) {
> -			mutex_lock(&zonelists_mutex);
> -			build_all_zonelists(NULL, NULL);
> -			mutex_unlock(&zonelists_mutex);
> -		}
> -	}
> -out:
> -	mutex_unlock(&zl_order_mutex);
> +	str = memdup_user_nul(buffer, 16);
> +	if (IS_ERR(str))
> +		return PTR_ERR(str);
> +
> +	ret = __parse_numa_zonelist_order(str);
> +	kfree(str);
>  	return ret;
>  }
>  
> @@ -5006,70 +4948,12 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
>   */
>  static int node_order[MAX_NUMNODES];
>  
> -static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
> -{
> -	int pos, j, node;
> -	int zone_type;		/* needs to be signed */
> -	struct zone *z;
> -	struct zonelist *zonelist;
> -
> -	zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
> -	pos = 0;
> -	for (zone_type = MAX_NR_ZONES - 1; zone_type >= 0; zone_type--) {
> -		for (j = 0; j < nr_nodes; j++) {
> -			node = node_order[j];
> -			z = &NODE_DATA(node)->node_zones[zone_type];
> -			if (managed_zone(z)) {
> -				zoneref_set_zone(z,
> -					&zonelist->_zonerefs[pos++]);
> -				check_highest_zone(zone_type);
> -			}
> -		}
> -	}
> -	zonelist->_zonerefs[pos].zone = NULL;
> -	zonelist->_zonerefs[pos].zone_idx = 0;
> -}
> -
> -#if defined(CONFIG_64BIT)
> -/*
> - * Devices that require DMA32/DMA are relatively rare and do not justify a
> - * penalty to every machine in case the specialised case applies. Default
> - * to Node-ordering on 64-bit NUMA machines
> - */
> -static int default_zonelist_order(void)
> -{
> -	return ZONELIST_ORDER_NODE;
> -}
> -#else
> -/*
> - * On 32-bit, the Normal zone needs to be preserved for allocations accessible
> - * by the kernel. If processes running on node 0 deplete the low memory zone
> - * then reclaim will occur more frequency increasing stalls and potentially
> - * be easier to OOM if a large percentage of the zone is under writeback or
> - * dirty. The problem is significantly worse if CONFIG_HIGHPTE is not set.
> - * Hence, default to zone ordering on 32-bit.
> - */
> -static int default_zonelist_order(void)
> -{
> -	return ZONELIST_ORDER_ZONE;
> -}
> -#endif /* CONFIG_64BIT */
> -
> -static void set_zonelist_order(void)
> -{
> -	if (user_zonelist_order == ZONELIST_ORDER_DEFAULT)
> -		current_zonelist_order = default_zonelist_order();
> -	else
> -		current_zonelist_order = user_zonelist_order;
> -}
> -
>  static void build_zonelists(pg_data_t *pgdat)
>  {
>  	int i, node, load;
>  	nodemask_t used_mask;
>  	int local_node, prev_node;
>  	struct zonelist *zonelist;
> -	unsigned int order = current_zonelist_order;
>  
>  	/* initialize zonelists */
>  	for (i = 0; i < MAX_ZONELISTS; i++) {
> @@ -5099,15 +4983,7 @@ static void build_zonelists(pg_data_t *pgdat)
>  
>  		prev_node = node;
>  		load--;
> -		if (order == ZONELIST_ORDER_NODE)
> -			build_zonelists_in_node_order(pgdat, node);
> -		else
> -			node_order[i++] = node;	/* remember order */
> -	}
> -
> -	if (order == ZONELIST_ORDER_ZONE) {
> -		/* calculate node order -- i.e., DMA last! */
> -		build_zonelists_in_zone_order(pgdat, i);
> +		build_zonelists_in_node_order(pgdat, node);
>  	}
>  
>  	build_thisnode_zonelists(pgdat);
> @@ -5135,11 +5011,6 @@ static void setup_min_unmapped_ratio(void);
>  static void setup_min_slab_ratio(void);
>  #else	/* CONFIG_NUMA */
>  
> -static void set_zonelist_order(void)
> -{
> -	current_zonelist_order = ZONELIST_ORDER_ZONE;
> -}
> -
>  static void build_zonelists(pg_data_t *pgdat)
>  {
>  	int node, local_node;
> @@ -5279,8 +5150,6 @@ build_all_zonelists_init(void)
>   */
>  void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
>  {
> -	set_zonelist_order();
> -
>  	if (system_state == SYSTEM_BOOTING) {
>  		build_all_zonelists_init();
>  	} else {
> @@ -5306,9 +5175,8 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
>  	else
>  		page_group_by_mobility_disabled = 0;
>  
> -	pr_info("Built %i zonelists in %s order, mobility grouping %s.  Total pages: %ld\n",
> +	pr_info("Built %i zonelists, mobility grouping %s.  Total pages: %ld\n",
>  		nr_online_nodes,
> -		zonelist_order_name[current_zonelist_order],
>  		page_group_by_mobility_disabled ? "off" : "on",
>  		vm_total_pages);
>  #ifdef CONFIG_NUMA
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization
  2017-07-21 14:39 ` [PATCH 6/9] mm, page_alloc: simplify zonelist initialization Michal Hocko
@ 2017-07-24  9:25   ` Vlastimil Babka
  0 siblings, 0 replies; 22+ messages in thread
From: Vlastimil Babka @ 2017-07-24  9:25 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, linux-mm, LKML, Michal Hocko

On 07/21/2017 04:39 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> build_zonelists gradually builds zonelists from the nearest to the most
> distant node. As we do not know how many populated zones we will have in
> each node we rely on the _zoneref to terminate initialized part of the
> zonelist by a NULL zone. While this is functionally correct it is quite
> suboptimal because we cannot allow updaters to race with zonelists
> users because they could see an empty zonelist and fail the allocation
> or hit the OOM killer in the worst case.
> 
> We can do much better, though. We can store the node ordering into an
> already existing node_order array and then give this array to
> build_zonelists_in_node_order and do the whole initialization at once.
> zonelists consumers still might see halfway initialized state but that
> should be much more tolerateable because the list will not be empty and
> they would either see some zone twice or skip over some zone(s) in the
> worst case which shouldn't lead to immediate failures.
> 
> While at it let's simplify build_zonelists_node which is rather
> confusing now. It gets an index into the zoneref array and returns
> the updated index for the next iteration. Let's rename the function
> to build_zonerefs_node to better reflect its purpose and give it
> zoneref array to update. The function doesn't the index anymore. It
> just returns the number of added zones so that the caller can advance
> the zonered array start for the next update.
> 
> This patch alone doesn't introduce any functional change yet, though, it
> is merely a preparatory work for later changes.
> 
> Changes since v1
> - build_zonelists_node -> build_zonerefs_node and operate directly on
>   zonerefs array rather than play tricks with index into the array.
> - give build_zonelists_in_node_order nr_nodes to not iterate over all
>   MAX_NUMNODES as per Mel
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-19  9:33   ` Vlastimil Babka
@ 2017-07-19 13:44     ` Michal Hocko
  0 siblings, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2017-07-19 13:44 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Mel Gorman, Johannes Weiner, LKML, linux-api

On Wed 19-07-17 11:33:49, Vlastimil Babka wrote:
> On 07/14/2017 09:59 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Supporting zone ordered zonelists costs us just a lot of code while
> > the usefulness is arguable if existent at all. Mel has already made
> > node ordering default on 64b systems. 32b systems are still using
> > ZONELIST_ORDER_ZONE because it is considered better to fallback to
> > a different NUMA node rather than consume precious lowmem zones.
> > 
> > This argument is, however, weaken by the fact that the memory reclaim
> > has been reworked to be node rather than zone oriented. This means
> > that lowmem requests have to skip over all highmem pages on LRUs already
> > and so zone ordering doesn't save the reclaim time much. So the only
> > advantage of the zone ordering is under a light memory pressure when
> > highmem requests do not ever hit into lowmem zones and the lowmem
> > pressure doesn't need to reclaim.
> > 
> > Considering that 32b NUMA systems are rather suboptimal already and
> > it is generally advisable to use 64b kernel on such a HW I believe we
> > should rather care about the code maintainability and just get rid of
> > ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
> > somebody tries to set zone ordering either from kernel command line
> > or the sysctl.
> > 
> > Cc: <linux-api@vger.kernel.org>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Found some leftovers to cleanup:
> 
> include/linux/mmzone.h:
> extern char numa_zonelist_order[];
> #define NUMA_ZONELIST_ORDER_LEN 16      /* string buffer size */
> 
> Also update docs?
> Documentation/sysctl/vm.txt:zone.  Specify "[Zz]one" for zone order.
> Documentation/admin-guide/kernel-parameters.txt:
> numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
> Documentation/vm/numa:a default zonelist order based on the sizes of the
> various zone types relative
> Documentation/vm/numa:default zonelist order may be overridden using the
> numa_zonelist_order kernel
> 
> Otherwise,
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

This?
---
commit 06c69d6785447160717ca8cc476dc9cac7a3f964
Author: Michal Hocko <mhocko@suse.com>
Date:   Wed Jul 19 14:28:31 2017 +0200

    fold me
    
    - update documentation as per Vlastimil
    
    Acked-by: Vlastimil Babka <vbabka@suse.cz>

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 60530c8490ff..28f1a0f84456 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2724,7 +2724,7 @@
 			Allowed values are enable and disable
 
 	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
-			one of ['zone', 'node', 'default'] can be specified
+			'node', 'default' can be specified
 			This can be set from sysctl after boot.
 			See Documentation/sysctl/vm.txt for details.
 
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 48244c42ff52..9baf66a9ef4e 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -572,7 +572,9 @@ See Documentation/nommu-mmap.txt for more information.
 
 numa_zonelist_order
 
-This sysctl is only for NUMA.
+This sysctl is only for NUMA and it is deprecated. Anything but
+Node order will fail!
+
 'where the memory is allocated from' is controlled by zonelists.
 (This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
  you may be able to read ZONE_DMA as ZONE_DMA32...)
diff --git a/Documentation/vm/numa b/Documentation/vm/numa
index a08f71647714..a31b85b9bb88 100644
--- a/Documentation/vm/numa
+++ b/Documentation/vm/numa
@@ -79,11 +79,8 @@ memory, Linux must decide whether to order the zonelists such that allocations
 fall back to the same zone type on a different node, or to a different zone
 type on the same node.  This is an important consideration because some zones,
 such as DMA or DMA32, represent relatively scarce resources.  Linux chooses
-a default zonelist order based on the sizes of the various zone types relative
-to the total memory of the node and the total memory of the system.  The
-default zonelist order may be overridden using the numa_zonelist_order kernel
-boot parameter or sysctl.  [see Documentation/admin-guide/kernel-parameters.rst and
-Documentation/sysctl/vm.txt]
+a default Node ordered zonelist. This means it tries to fallback to other zones
+from the same node before using remote nodes which are ordered by NUMA distance.
 
 By default, Linux will attempt to satisfy memory allocation requests from the
 node to which the CPU that executes the request is assigned.  Specifically,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fc14b8b3f6ce..b849006b20d3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -895,8 +895,6 @@ int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
-extern char numa_zonelist_order[];
-#define NUMA_ZONELIST_ORDER_LEN 16	/* string buffer size */
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14  7:59 ` [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE Michal Hocko
  2017-07-14  9:36   ` Mel Gorman
@ 2017-07-19  9:33   ` Vlastimil Babka
  2017-07-19 13:44     ` Michal Hocko
  1 sibling, 1 reply; 22+ messages in thread
From: Vlastimil Babka @ 2017-07-19  9:33 UTC (permalink / raw)
  To: Michal Hocko, linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, LKML, Michal Hocko,
	linux-api

On 07/14/2017 09:59 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Supporting zone ordered zonelists costs us just a lot of code while
> the usefulness is arguable if existent at all. Mel has already made
> node ordering default on 64b systems. 32b systems are still using
> ZONELIST_ORDER_ZONE because it is considered better to fallback to
> a different NUMA node rather than consume precious lowmem zones.
> 
> This argument is, however, weaken by the fact that the memory reclaim
> has been reworked to be node rather than zone oriented. This means
> that lowmem requests have to skip over all highmem pages on LRUs already
> and so zone ordering doesn't save the reclaim time much. So the only
> advantage of the zone ordering is under a light memory pressure when
> highmem requests do not ever hit into lowmem zones and the lowmem
> pressure doesn't need to reclaim.
> 
> Considering that 32b NUMA systems are rather suboptimal already and
> it is generally advisable to use 64b kernel on such a HW I believe we
> should rather care about the code maintainability and just get rid of
> ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
> somebody tries to set zone ordering either from kernel command line
> or the sysctl.
> 
> Cc: <linux-api@vger.kernel.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Found some leftovers to cleanup:

include/linux/mmzone.h:
extern char numa_zonelist_order[];
#define NUMA_ZONELIST_ORDER_LEN 16      /* string buffer size */

Also update docs?
Documentation/sysctl/vm.txt:zone.  Specify "[Zz]one" for zone order.
Documentation/admin-guide/kernel-parameters.txt:
numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
Documentation/vm/numa:a default zonelist order based on the sizes of the
various zone types relative
Documentation/vm/numa:default zonelist order may be overridden using the
numa_zonelist_order kernel

Otherwise,
Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14 12:56           ` Mel Gorman
  2017-07-14 13:01             ` Mel Gorman
@ 2017-07-14 13:08             ` Michal Hocko
  1 sibling, 0 replies; 22+ messages in thread
From: Michal Hocko @ 2017-07-14 13:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	linux-api

On Fri 14-07-17 13:56:16, Mel Gorman wrote:
> Also strongly suggest you continue using proc_dostring because it
> catches all the corner-cases that can occur.

proc_dostring would need a data in sysctl table and I found that more
confusing than having a simplistic read/write. I can do that if you
insist though.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14 12:56           ` Mel Gorman
@ 2017-07-14 13:01             ` Mel Gorman
  2017-07-14 13:08             ` Michal Hocko
  1 sibling, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2017-07-14 13:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	linux-api

On Fri, Jul 14, 2017 at 01:56:16PM +0100, Mel Gorman wrote:
> >  	if (!write) {
> > -		int len = sizeof("Default");
> > -		if (copy_to_user(buffer, "Default", len))
> > +		int len = sizeof("Node");
> > +		if (copy_to_user(buffer, "Node", len))
> >  			return -EFAULT;
> 
> Ok for the name. But what's with using sizeof?

Bah, sizeof static compile-time string versus char *. Never mind.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14 11:38         ` Michal Hocko
@ 2017-07-14 12:56           ` Mel Gorman
  2017-07-14 13:01             ` Mel Gorman
  2017-07-14 13:08             ` Michal Hocko
  0 siblings, 2 replies; 22+ messages in thread
From: Mel Gorman @ 2017-07-14 12:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	linux-api

On Fri, Jul 14, 2017 at 01:38:40PM +0200, Michal Hocko wrote:
> On Fri 14-07-17 12:16:33, Mel Gorman wrote:
> > On Fri, Jul 14, 2017 at 12:47:57PM +0200, Michal Hocko wrote:
> > > > That should to be "default" because the original code would have the proc
> > > > entry display "default" unless it was set at runtime. Pretty weird I
> > > > know but it's always possible someone is parsing the original default
> > > > and not handling it properly.
> > > 
> > > Ohh, right! That is indeed strange. Then I guess it would be probably
> > > better to simply return Node to make it clear what the default is. What
> > > do you think?
> > > 
> > 
> > That would work too. The casing still matches.
> 
> This folded in?
> ---
> From c7c36f011590680b254813be00ed791ddbc1bf1c Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 14 Jul 2017 13:36:05 +0200
> Subject: [PATCH] fold me "mm, page_alloc: rip out ZONELIST_ORDER_ZONE"
> 
> - do not print Default in sysctl handler because our behavior was rather
> inconsistent in the past numa_zonelist_order was lowecase while
> zonelist_order_name was uppercase so boot time unchanged value woul
> print lowercase while updated value could be uppercase. Print "Node"
> which is the default instead - Mel
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/page_alloc.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dd4c96edcec3..49bade7ff049 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4828,8 +4828,8 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
>  	int ret;
>  
>  	if (!write) {
> -		int len = sizeof("Default");
> -		if (copy_to_user(buffer, "Default", len))
> +		int len = sizeof("Node");
> +		if (copy_to_user(buffer, "Node", len))
>  			return -EFAULT;

Ok for the name. But what's with using sizeof? The type is char * so it
just happens to work for Default, but not for Node. Also strongly suggest
you continue using proc_dostring because it catches all the corner-cases
that can occur.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14 11:16       ` Mel Gorman
@ 2017-07-14 11:38         ` Michal Hocko
  2017-07-14 12:56           ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2017-07-14 11:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	linux-api

On Fri 14-07-17 12:16:33, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 12:47:57PM +0200, Michal Hocko wrote:
> > > That should to be "default" because the original code would have the proc
> > > entry display "default" unless it was set at runtime. Pretty weird I
> > > know but it's always possible someone is parsing the original default
> > > and not handling it properly.
> > 
> > Ohh, right! That is indeed strange. Then I guess it would be probably
> > better to simply return Node to make it clear what the default is. What
> > do you think?
> > 
> 
> That would work too. The casing still matches.

This folded in?
---

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14 10:47     ` Michal Hocko
@ 2017-07-14 11:16       ` Mel Gorman
  2017-07-14 11:38         ` Michal Hocko
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2017-07-14 11:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	linux-api

On Fri, Jul 14, 2017 at 12:47:57PM +0200, Michal Hocko wrote:
> > That should to be "default" because the original code would have the proc
> > entry display "default" unless it was set at runtime. Pretty weird I
> > know but it's always possible someone is parsing the original default
> > and not handling it properly.
> 
> Ohh, right! That is indeed strange. Then I guess it would be probably
> better to simply return Node to make it clear what the default is. What
> do you think?
> 

That would work too. The casing still matches.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14  9:36   ` Mel Gorman
@ 2017-07-14 10:47     ` Michal Hocko
  2017-07-14 11:16       ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Michal Hocko @ 2017-07-14 10:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	linux-api

On Fri 14-07-17 10:36:50, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 09:59:58AM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Supporting zone ordered zonelists costs us just a lot of code while
> > the usefulness is arguable if existent at all. Mel has already made
> > node ordering default on 64b systems. 32b systems are still using
> > ZONELIST_ORDER_ZONE because it is considered better to fallback to
> > a different NUMA node rather than consume precious lowmem zones.
> > 
> > This argument is, however, weaken by the fact that the memory reclaim
> > has been reworked to be node rather than zone oriented. This means
> > that lowmem requests have to skip over all highmem pages on LRUs already
> > and so zone ordering doesn't save the reclaim time much. So the only
> > advantage of the zone ordering is under a light memory pressure when
> > highmem requests do not ever hit into lowmem zones and the lowmem
> > pressure doesn't need to reclaim.
> > 
> > Considering that 32b NUMA systems are rather suboptimal already and
> > it is generally advisable to use 64b kernel on such a HW I believe we
> > should rather care about the code maintainability and just get rid of
> > ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
> > somebody tries to set zone ordering either from kernel command line
> > or the sysctl.
> > 
> > Cc: <linux-api@vger.kernel.org>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > index 80e4adb4c360..d9f4ea057e74 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4864,40 +4824,22 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
> >  		void __user *buffer, size_t *length,
> >  		loff_t *ppos)
> >  {
> > -	char saved_string[NUMA_ZONELIST_ORDER_LEN];
> > +	char *str;
> >  	int ret;
> > -	static DEFINE_MUTEX(zl_order_mutex);
> >  
> > -	mutex_lock(&zl_order_mutex);
> > -	if (write) {
> > -		if (strlen((char *)table->data) >= NUMA_ZONELIST_ORDER_LEN) {
> > -			ret = -EINVAL;
> > -			goto out;
> > -		}
> > -		strcpy(saved_string, (char *)table->data);
> > +	if (!write) {
> > +		int len = sizeof("Default");
> > +		if (copy_to_user(buffer, "Default", len))
> > +			return -EFAULT;
> > +		return len;
> >  	}
> 
> That should to be "default" because the original code would have the proc
> entry display "default" unless it was set at runtime. Pretty weird I
> know but it's always possible someone is parsing the original default
> and not handling it properly.

Ohh, right! That is indeed strange. Then I guess it would be probably
better to simply return Node to make it clear what the default is. What
do you think?

> Otherwise I think we're way past the point where large memory 32-bit
> NUMA machines are a thing so
> 
> Acked-by: Mel Gorman <mgorman@suse.de>

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14  7:59 ` [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE Michal Hocko
@ 2017-07-14  9:36   ` Mel Gorman
  2017-07-14 10:47     ` Michal Hocko
  2017-07-19  9:33   ` Vlastimil Babka
  1 sibling, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2017-07-14  9:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	Michal Hocko, linux-api

On Fri, Jul 14, 2017 at 09:59:58AM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Supporting zone ordered zonelists costs us just a lot of code while
> the usefulness is arguable if existent at all. Mel has already made
> node ordering default on 64b systems. 32b systems are still using
> ZONELIST_ORDER_ZONE because it is considered better to fallback to
> a different NUMA node rather than consume precious lowmem zones.
> 
> This argument is, however, weaken by the fact that the memory reclaim
> has been reworked to be node rather than zone oriented. This means
> that lowmem requests have to skip over all highmem pages on LRUs already
> and so zone ordering doesn't save the reclaim time much. So the only
> advantage of the zone ordering is under a light memory pressure when
> highmem requests do not ever hit into lowmem zones and the lowmem
> pressure doesn't need to reclaim.
> 
> Considering that 32b NUMA systems are rather suboptimal already and
> it is generally advisable to use 64b kernel on such a HW I believe we
> should rather care about the code maintainability and just get rid of
> ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
> somebody tries to set zone ordering either from kernel command line
> or the sysctl.
> 
> Cc: <linux-api@vger.kernel.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> index 80e4adb4c360..d9f4ea057e74 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4864,40 +4824,22 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
>  		void __user *buffer, size_t *length,
>  		loff_t *ppos)
>  {
> -	char saved_string[NUMA_ZONELIST_ORDER_LEN];
> +	char *str;
>  	int ret;
> -	static DEFINE_MUTEX(zl_order_mutex);
>  
> -	mutex_lock(&zl_order_mutex);
> -	if (write) {
> -		if (strlen((char *)table->data) >= NUMA_ZONELIST_ORDER_LEN) {
> -			ret = -EINVAL;
> -			goto out;
> -		}
> -		strcpy(saved_string, (char *)table->data);
> +	if (!write) {
> +		int len = sizeof("Default");
> +		if (copy_to_user(buffer, "Default", len))
> +			return -EFAULT;
> +		return len;
>  	}

That should to be "default" because the original code would have the proc
entry display "default" unless it was set at runtime. Pretty weird I
know but it's always possible someone is parsing the original default
and not handling it properly.

Otherwise I think we're way past the point where large memory 32-bit
NUMA machines are a thing so

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14  7:59 [PATCH 0/9] cleanup zonelists initialization Michal Hocko
@ 2017-07-14  7:59 ` Michal Hocko
  2017-07-14  9:36   ` Mel Gorman
  2017-07-19  9:33   ` Vlastimil Babka
  0 siblings, 2 replies; 22+ messages in thread
From: Michal Hocko @ 2017-07-14  7:59 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	LKML, Michal Hocko, linux-api

From: Michal Hocko <mhocko@suse.com>

Supporting zone ordered zonelists costs us just a lot of code while
the usefulness is arguable if existent at all. Mel has already made
node ordering default on 64b systems. 32b systems are still using
ZONELIST_ORDER_ZONE because it is considered better to fallback to
a different NUMA node rather than consume precious lowmem zones.

This argument is, however, weaken by the fact that the memory reclaim
has been reworked to be node rather than zone oriented. This means
that lowmem requests have to skip over all highmem pages on LRUs already
and so zone ordering doesn't save the reclaim time much. So the only
advantage of the zone ordering is under a light memory pressure when
highmem requests do not ever hit into lowmem zones and the lowmem
pressure doesn't need to reclaim.

Considering that 32b NUMA systems are rather suboptimal already and
it is generally advisable to use 64b kernel on such a HW I believe we
should rather care about the code maintainability and just get rid of
ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
somebody tries to set zone ordering either from kernel command line
or the sysctl.

Cc: <linux-api@vger.kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 kernel/sysctl.c |   2 -
 mm/page_alloc.c | 178 ++++++++------------------------------------------------
 2 files changed, 23 insertions(+), 157 deletions(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 655686d546cb..0cbce40f5426 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1553,8 +1553,6 @@ static struct ctl_table vm_table[] = {
 #ifdef CONFIG_NUMA
 	{
 		.procname	= "numa_zonelist_order",
-		.data		= &numa_zonelist_order,
-		.maxlen		= NUMA_ZONELIST_ORDER_LEN,
 		.mode		= 0644,
 		.proc_handler	= numa_zonelist_order_handler,
 	},
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80e4adb4c360..d9f4ea057e74 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4791,52 +4791,18 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
 	return nr_zones;
 }
 
-
-/*
- *  zonelist_order:
- *  0 = automatic detection of better ordering.
- *  1 = order by ([node] distance, -zonetype)
- *  2 = order by (-zonetype, [node] distance)
- *
- *  If not NUMA, ZONELIST_ORDER_ZONE and ZONELIST_ORDER_NODE will create
- *  the same zonelist. So only NUMA can configure this param.
- */
-#define ZONELIST_ORDER_DEFAULT  0
-#define ZONELIST_ORDER_NODE     1
-#define ZONELIST_ORDER_ZONE     2
-
-/* zonelist order in the kernel.
- * set_zonelist_order() will set this to NODE or ZONE.
- */
-static int current_zonelist_order = ZONELIST_ORDER_DEFAULT;
-static char zonelist_order_name[3][8] = {"Default", "Node", "Zone"};
-
-
 #ifdef CONFIG_NUMA
-/* The value user specified ....changed by config */
-static int user_zonelist_order = ZONELIST_ORDER_DEFAULT;
-/* string for sysctl */
-#define NUMA_ZONELIST_ORDER_LEN	16
-char numa_zonelist_order[16] = "default";
-
-/*
- * interface for configure zonelist ordering.
- * command line option "numa_zonelist_order"
- *	= "[dD]efault	- default, automatic configuration.
- *	= "[nN]ode 	- order by node locality, then by zone within node
- *	= "[zZ]one      - order by zone, then by locality within zone
- */
 
 static int __parse_numa_zonelist_order(char *s)
 {
-	if (*s == 'd' || *s == 'D') {
-		user_zonelist_order = ZONELIST_ORDER_DEFAULT;
-	} else if (*s == 'n' || *s == 'N') {
-		user_zonelist_order = ZONELIST_ORDER_NODE;
-	} else if (*s == 'z' || *s == 'Z') {
-		user_zonelist_order = ZONELIST_ORDER_ZONE;
-	} else {
-		pr_warn("Ignoring invalid numa_zonelist_order value:  %s\n", s);
+	/*
+	 * We used to support different zonlists modes but they turned
+	 * out to be just not useful. Let's keep the warning in place
+	 * if somebody still use the cmd line parameter so that we do
+	 * not fail it silently
+	 */
+	if (!(*s == 'd' || *s == 'D' || *s == 'n' || *s == 'N')) {
+		pr_warn("Ignoring unsupported numa_zonelist_order value:  %s\n", s);
 		return -EINVAL;
 	}
 	return 0;
@@ -4844,16 +4810,10 @@ static int __parse_numa_zonelist_order(char *s)
 
 static __init int setup_numa_zonelist_order(char *s)
 {
-	int ret;
-
 	if (!s)
 		return 0;
 
-	ret = __parse_numa_zonelist_order(s);
-	if (ret == 0)
-		strlcpy(numa_zonelist_order, s, NUMA_ZONELIST_ORDER_LEN);
-
-	return ret;
+	return __parse_numa_zonelist_order(s);
 }
 early_param("numa_zonelist_order", setup_numa_zonelist_order);
 
@@ -4864,40 +4824,22 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
 		loff_t *ppos)
 {
-	char saved_string[NUMA_ZONELIST_ORDER_LEN];
+	char *str;
 	int ret;
-	static DEFINE_MUTEX(zl_order_mutex);
 
-	mutex_lock(&zl_order_mutex);
-	if (write) {
-		if (strlen((char *)table->data) >= NUMA_ZONELIST_ORDER_LEN) {
-			ret = -EINVAL;
-			goto out;
-		}
-		strcpy(saved_string, (char *)table->data);
+	if (!write) {
+		int len = sizeof("Default");
+		if (copy_to_user(buffer, "Default", len))
+			return -EFAULT;
+		return len;
 	}
-	ret = proc_dostring(table, write, buffer, length, ppos);
-	if (ret)
-		goto out;
-	if (write) {
-		int oldval = user_zonelist_order;
 
-		ret = __parse_numa_zonelist_order((char *)table->data);
-		if (ret) {
-			/*
-			 * bogus value.  restore saved string
-			 */
-			strncpy((char *)table->data, saved_string,
-				NUMA_ZONELIST_ORDER_LEN);
-			user_zonelist_order = oldval;
-		} else if (oldval != user_zonelist_order) {
-			mutex_lock(&zonelists_mutex);
-			build_all_zonelists(NULL, NULL);
-			mutex_unlock(&zonelists_mutex);
-		}
-	}
-out:
-	mutex_unlock(&zl_order_mutex);
+	str = memdup_user_nul(buffer, 16);
+	if (IS_ERR(str))
+		return PTR_ERR(str);
+
+	ret = __parse_numa_zonelist_order(str);
+	kfree(str);
 	return ret;
 }
 
@@ -5006,70 +4948,12 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
  */
 static int node_order[MAX_NUMNODES];
 
-static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
-{
-	int pos, j, node;
-	int zone_type;		/* needs to be signed */
-	struct zone *z;
-	struct zonelist *zonelist;
-
-	zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
-	pos = 0;
-	for (zone_type = MAX_NR_ZONES - 1; zone_type >= 0; zone_type--) {
-		for (j = 0; j < nr_nodes; j++) {
-			node = node_order[j];
-			z = &NODE_DATA(node)->node_zones[zone_type];
-			if (managed_zone(z)) {
-				zoneref_set_zone(z,
-					&zonelist->_zonerefs[pos++]);
-				check_highest_zone(zone_type);
-			}
-		}
-	}
-	zonelist->_zonerefs[pos].zone = NULL;
-	zonelist->_zonerefs[pos].zone_idx = 0;
-}
-
-#if defined(CONFIG_64BIT)
-/*
- * Devices that require DMA32/DMA are relatively rare and do not justify a
- * penalty to every machine in case the specialised case applies. Default
- * to Node-ordering on 64-bit NUMA machines
- */
-static int default_zonelist_order(void)
-{
-	return ZONELIST_ORDER_NODE;
-}
-#else
-/*
- * On 32-bit, the Normal zone needs to be preserved for allocations accessible
- * by the kernel. If processes running on node 0 deplete the low memory zone
- * then reclaim will occur more frequency increasing stalls and potentially
- * be easier to OOM if a large percentage of the zone is under writeback or
- * dirty. The problem is significantly worse if CONFIG_HIGHPTE is not set.
- * Hence, default to zone ordering on 32-bit.
- */
-static int default_zonelist_order(void)
-{
-	return ZONELIST_ORDER_ZONE;
-}
-#endif /* CONFIG_64BIT */
-
-static void set_zonelist_order(void)
-{
-	if (user_zonelist_order == ZONELIST_ORDER_DEFAULT)
-		current_zonelist_order = default_zonelist_order();
-	else
-		current_zonelist_order = user_zonelist_order;
-}
-
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int i, node, load;
 	nodemask_t used_mask;
 	int local_node, prev_node;
 	struct zonelist *zonelist;
-	unsigned int order = current_zonelist_order;
 
 	/* initialize zonelists */
 	for (i = 0; i < MAX_ZONELISTS; i++) {
@@ -5099,15 +4983,7 @@ static void build_zonelists(pg_data_t *pgdat)
 
 		prev_node = node;
 		load--;
-		if (order == ZONELIST_ORDER_NODE)
-			build_zonelists_in_node_order(pgdat, node);
-		else
-			node_order[i++] = node;	/* remember order */
-	}
-
-	if (order == ZONELIST_ORDER_ZONE) {
-		/* calculate node order -- i.e., DMA last! */
-		build_zonelists_in_zone_order(pgdat, i);
+		build_zonelists_in_node_order(pgdat, node);
 	}
 
 	build_thisnode_zonelists(pgdat);
@@ -5135,11 +5011,6 @@ static void setup_min_unmapped_ratio(void);
 static void setup_min_slab_ratio(void);
 #else	/* CONFIG_NUMA */
 
-static void set_zonelist_order(void)
-{
-	current_zonelist_order = ZONELIST_ORDER_ZONE;
-}
-
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int node, local_node;
@@ -5279,8 +5150,6 @@ build_all_zonelists_init(void)
  */
 void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
 {
-	set_zonelist_order();
-
 	if (system_state == SYSTEM_BOOTING) {
 		build_all_zonelists_init();
 	} else {
@@ -5306,9 +5175,8 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
 	else
 		page_group_by_mobility_disabled = 0;
 
-	pr_info("Built %i zonelists in %s order, mobility grouping %s.  Total pages: %ld\n",
+	pr_info("Built %i zonelists, mobility grouping %s.  Total pages: %ld\n",
 		nr_online_nodes,
-		zonelist_order_name[current_zonelist_order],
 		page_group_by_mobility_disabled ? "off" : "on",
 		vm_total_pages);
 #ifdef CONFIG_NUMA
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, back to index

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-21 14:39 [PATCH -v1 0/9] cleanup zonelists initialization Michal Hocko
2017-07-21 14:39 ` [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE Michal Hocko
2017-07-21 14:45   ` Michal Hocko
2017-07-21 14:39 ` [PATCH 2/9] mm, page_alloc: remove boot pageset initialization from memory hotplug Michal Hocko
2017-07-21 14:39 ` [PATCH 3/9] mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization Michal Hocko
2017-07-21 14:39 ` [PATCH 4/9] mm, memory_hotplug: drop zone from build_all_zonelists Michal Hocko
2017-07-21 14:39 ` [PATCH 5/9] mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node Michal Hocko
2017-07-21 14:39 ` [PATCH 6/9] mm, page_alloc: simplify zonelist initialization Michal Hocko
2017-07-24  9:25   ` Vlastimil Babka
2017-07-21 14:39 ` [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists Michal Hocko
2017-07-21 14:39 ` [PATCH 8/9] mm, memory_hotplug: get rid of zonelists_mutex Michal Hocko
2017-07-21 14:39 ` [PATCH 9/9] mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations Michal Hocko
  -- strict thread matches above, loose matches on Subject: below --
2017-07-14  7:59 [PATCH 0/9] cleanup zonelists initialization Michal Hocko
2017-07-14  7:59 ` [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE Michal Hocko
2017-07-14  9:36   ` Mel Gorman
2017-07-14 10:47     ` Michal Hocko
2017-07-14 11:16       ` Mel Gorman
2017-07-14 11:38         ` Michal Hocko
2017-07-14 12:56           ` Mel Gorman
2017-07-14 13:01             ` Mel Gorman
2017-07-14 13:08             ` Michal Hocko
2017-07-19  9:33   ` Vlastimil Babka
2017-07-19 13:44     ` Michal Hocko

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git