linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] cleanup zonelists initialization
@ 2017-07-14  7:59 Michal Hocko
  2017-07-14  7:59 ` [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2017-07-14  7:59 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	LKML, joonsoo kim, linux-api-u79uwXL29TY76Z2rM5mHXA,
	Michal Hocko, Shaohua Li, Toshi Kani, Wen Congyang

Hi,
this is aimed at cleaning up the zonelists initialization code we have
but the primary motivation was bug report [1] which got resolved but
the usage of stop_machine is just too ugly to live. Most patches are
straightforward but 3 of them need a special consideration.

Patch 1 removes zone ordered zonelists completely. I am CCing linux-api
because this is a user visible change. As I argue in the patch
description I do not think we have a strong usecase for it these days.
I have kept sysctl in place and warn into the log if somebody tries to
configure zone lists ordering. If somebody has a real usecase for it
we can revert this patch but I do not expect anybody will actually notice
runtime differences. This patch is not strictly needed for the rest but
it made patch 6 easier to implement.

Patch 7 removes stop_machine from build_all_zonelists without adding any
special synchronization between iterators and updater which I _believe_
is acceptable as explained in the changelog. I hope I am not missing
anything.

Patch 8 then removes zonelists_mutex which is kind of ugly as well and
not really needed AFAICS but a care should be taken when double checking
my thinking.

This has passed my light testing but I currently do not have a HW to
test hotadd_new_pgdat path (aka a completely new node added to the
system in runtime).

This is based on the current mmomt git tree (mmotm-2017-07-12-15-11).
Any feedback is highly appreciated.

The diffstat looks really promissing
 include/linux/mmzone.h |   3 +-
 init/main.c            |   2 +-
 kernel/sysctl.c        |   2 -
 mm/internal.h          |   1 +
 mm/memory_hotplug.c    |  27 +----
 mm/page_alloc.c        | 293 ++++++++++++-------------------------------------
 mm/page_ext.c          |   5 +-
 mm/sparse-vmemmap.c    |  11 +-
 mm/sparse.c            |  10 +-
 9 files changed, 89 insertions(+), 265 deletions(-)

Shortlog says
Michal Hocko (9):
      mm, page_alloc: rip out ZONELIST_ORDER_ZONE
      mm, page_alloc: remove boot pageset initialization from memory hotplug
      mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization
      mm, memory_hotplug: drop zone from build_all_zonelists
      mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node
      mm, page_alloc: simplify zonelist initialization
      mm, page_alloc: remove stop_machine from build_all_zonelists
      mm, memory_hotplug: get rid of zonelists_mutex
      mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations

[1] http://lkml.kernel.org/r/alpine.DEB.2.20.1706291803380.1861@nanos

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14  7:59 [PATCH 0/9] cleanup zonelists initialization Michal Hocko
@ 2017-07-14  7:59 ` Michal Hocko
       [not found]   ` <20170714080006.7250-2-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2017-07-14  7:59 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Vlastimil Babka,
	LKML, Michal Hocko, linux-api

From: Michal Hocko <mhocko@suse.com>

Supporting zone ordered zonelists costs us just a lot of code while
the usefulness is arguable if existent at all. Mel has already made
node ordering default on 64b systems. 32b systems are still using
ZONELIST_ORDER_ZONE because it is considered better to fallback to
a different NUMA node rather than consume precious lowmem zones.

This argument is, however, weaken by the fact that the memory reclaim
has been reworked to be node rather than zone oriented. This means
that lowmem requests have to skip over all highmem pages on LRUs already
and so zone ordering doesn't save the reclaim time much. So the only
advantage of the zone ordering is under a light memory pressure when
highmem requests do not ever hit into lowmem zones and the lowmem
pressure doesn't need to reclaim.

Considering that 32b NUMA systems are rather suboptimal already and
it is generally advisable to use 64b kernel on such a HW I believe we
should rather care about the code maintainability and just get rid of
ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
somebody tries to set zone ordering either from kernel command line
or the sysctl.

Cc: <linux-api@vger.kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 kernel/sysctl.c |   2 -
 mm/page_alloc.c | 178 ++++++++------------------------------------------------
 2 files changed, 23 insertions(+), 157 deletions(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 655686d546cb..0cbce40f5426 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1553,8 +1553,6 @@ static struct ctl_table vm_table[] = {
 #ifdef CONFIG_NUMA
 	{
 		.procname	= "numa_zonelist_order",
-		.data		= &numa_zonelist_order,
-		.maxlen		= NUMA_ZONELIST_ORDER_LEN,
 		.mode		= 0644,
 		.proc_handler	= numa_zonelist_order_handler,
 	},
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 80e4adb4c360..d9f4ea057e74 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4791,52 +4791,18 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
 	return nr_zones;
 }
 
-
-/*
- *  zonelist_order:
- *  0 = automatic detection of better ordering.
- *  1 = order by ([node] distance, -zonetype)
- *  2 = order by (-zonetype, [node] distance)
- *
- *  If not NUMA, ZONELIST_ORDER_ZONE and ZONELIST_ORDER_NODE will create
- *  the same zonelist. So only NUMA can configure this param.
- */
-#define ZONELIST_ORDER_DEFAULT  0
-#define ZONELIST_ORDER_NODE     1
-#define ZONELIST_ORDER_ZONE     2
-
-/* zonelist order in the kernel.
- * set_zonelist_order() will set this to NODE or ZONE.
- */
-static int current_zonelist_order = ZONELIST_ORDER_DEFAULT;
-static char zonelist_order_name[3][8] = {"Default", "Node", "Zone"};
-
-
 #ifdef CONFIG_NUMA
-/* The value user specified ....changed by config */
-static int user_zonelist_order = ZONELIST_ORDER_DEFAULT;
-/* string for sysctl */
-#define NUMA_ZONELIST_ORDER_LEN	16
-char numa_zonelist_order[16] = "default";
-
-/*
- * interface for configure zonelist ordering.
- * command line option "numa_zonelist_order"
- *	= "[dD]efault	- default, automatic configuration.
- *	= "[nN]ode 	- order by node locality, then by zone within node
- *	= "[zZ]one      - order by zone, then by locality within zone
- */
 
 static int __parse_numa_zonelist_order(char *s)
 {
-	if (*s == 'd' || *s == 'D') {
-		user_zonelist_order = ZONELIST_ORDER_DEFAULT;
-	} else if (*s == 'n' || *s == 'N') {
-		user_zonelist_order = ZONELIST_ORDER_NODE;
-	} else if (*s == 'z' || *s == 'Z') {
-		user_zonelist_order = ZONELIST_ORDER_ZONE;
-	} else {
-		pr_warn("Ignoring invalid numa_zonelist_order value:  %s\n", s);
+	/*
+	 * We used to support different zonlists modes but they turned
+	 * out to be just not useful. Let's keep the warning in place
+	 * if somebody still use the cmd line parameter so that we do
+	 * not fail it silently
+	 */
+	if (!(*s == 'd' || *s == 'D' || *s == 'n' || *s == 'N')) {
+		pr_warn("Ignoring unsupported numa_zonelist_order value:  %s\n", s);
 		return -EINVAL;
 	}
 	return 0;
@@ -4844,16 +4810,10 @@ static int __parse_numa_zonelist_order(char *s)
 
 static __init int setup_numa_zonelist_order(char *s)
 {
-	int ret;
-
 	if (!s)
 		return 0;
 
-	ret = __parse_numa_zonelist_order(s);
-	if (ret == 0)
-		strlcpy(numa_zonelist_order, s, NUMA_ZONELIST_ORDER_LEN);
-
-	return ret;
+	return __parse_numa_zonelist_order(s);
 }
 early_param("numa_zonelist_order", setup_numa_zonelist_order);
 
@@ -4864,40 +4824,22 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *length,
 		loff_t *ppos)
 {
-	char saved_string[NUMA_ZONELIST_ORDER_LEN];
+	char *str;
 	int ret;
-	static DEFINE_MUTEX(zl_order_mutex);
 
-	mutex_lock(&zl_order_mutex);
-	if (write) {
-		if (strlen((char *)table->data) >= NUMA_ZONELIST_ORDER_LEN) {
-			ret = -EINVAL;
-			goto out;
-		}
-		strcpy(saved_string, (char *)table->data);
+	if (!write) {
+		int len = sizeof("Default");
+		if (copy_to_user(buffer, "Default", len))
+			return -EFAULT;
+		return len;
 	}
-	ret = proc_dostring(table, write, buffer, length, ppos);
-	if (ret)
-		goto out;
-	if (write) {
-		int oldval = user_zonelist_order;
 
-		ret = __parse_numa_zonelist_order((char *)table->data);
-		if (ret) {
-			/*
-			 * bogus value.  restore saved string
-			 */
-			strncpy((char *)table->data, saved_string,
-				NUMA_ZONELIST_ORDER_LEN);
-			user_zonelist_order = oldval;
-		} else if (oldval != user_zonelist_order) {
-			mutex_lock(&zonelists_mutex);
-			build_all_zonelists(NULL, NULL);
-			mutex_unlock(&zonelists_mutex);
-		}
-	}
-out:
-	mutex_unlock(&zl_order_mutex);
+	str = memdup_user_nul(buffer, 16);
+	if (IS_ERR(str))
+		return PTR_ERR(str);
+
+	ret = __parse_numa_zonelist_order(str);
+	kfree(str);
 	return ret;
 }
 
@@ -5006,70 +4948,12 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
  */
 static int node_order[MAX_NUMNODES];
 
-static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
-{
-	int pos, j, node;
-	int zone_type;		/* needs to be signed */
-	struct zone *z;
-	struct zonelist *zonelist;
-
-	zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
-	pos = 0;
-	for (zone_type = MAX_NR_ZONES - 1; zone_type >= 0; zone_type--) {
-		for (j = 0; j < nr_nodes; j++) {
-			node = node_order[j];
-			z = &NODE_DATA(node)->node_zones[zone_type];
-			if (managed_zone(z)) {
-				zoneref_set_zone(z,
-					&zonelist->_zonerefs[pos++]);
-				check_highest_zone(zone_type);
-			}
-		}
-	}
-	zonelist->_zonerefs[pos].zone = NULL;
-	zonelist->_zonerefs[pos].zone_idx = 0;
-}
-
-#if defined(CONFIG_64BIT)
-/*
- * Devices that require DMA32/DMA are relatively rare and do not justify a
- * penalty to every machine in case the specialised case applies. Default
- * to Node-ordering on 64-bit NUMA machines
- */
-static int default_zonelist_order(void)
-{
-	return ZONELIST_ORDER_NODE;
-}
-#else
-/*
- * On 32-bit, the Normal zone needs to be preserved for allocations accessible
- * by the kernel. If processes running on node 0 deplete the low memory zone
- * then reclaim will occur more frequency increasing stalls and potentially
- * be easier to OOM if a large percentage of the zone is under writeback or
- * dirty. The problem is significantly worse if CONFIG_HIGHPTE is not set.
- * Hence, default to zone ordering on 32-bit.
- */
-static int default_zonelist_order(void)
-{
-	return ZONELIST_ORDER_ZONE;
-}
-#endif /* CONFIG_64BIT */
-
-static void set_zonelist_order(void)
-{
-	if (user_zonelist_order == ZONELIST_ORDER_DEFAULT)
-		current_zonelist_order = default_zonelist_order();
-	else
-		current_zonelist_order = user_zonelist_order;
-}
-
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int i, node, load;
 	nodemask_t used_mask;
 	int local_node, prev_node;
 	struct zonelist *zonelist;
-	unsigned int order = current_zonelist_order;
 
 	/* initialize zonelists */
 	for (i = 0; i < MAX_ZONELISTS; i++) {
@@ -5099,15 +4983,7 @@ static void build_zonelists(pg_data_t *pgdat)
 
 		prev_node = node;
 		load--;
-		if (order == ZONELIST_ORDER_NODE)
-			build_zonelists_in_node_order(pgdat, node);
-		else
-			node_order[i++] = node;	/* remember order */
-	}
-
-	if (order == ZONELIST_ORDER_ZONE) {
-		/* calculate node order -- i.e., DMA last! */
-		build_zonelists_in_zone_order(pgdat, i);
+		build_zonelists_in_node_order(pgdat, node);
 	}
 
 	build_thisnode_zonelists(pgdat);
@@ -5135,11 +5011,6 @@ static void setup_min_unmapped_ratio(void);
 static void setup_min_slab_ratio(void);
 #else	/* CONFIG_NUMA */
 
-static void set_zonelist_order(void)
-{
-	current_zonelist_order = ZONELIST_ORDER_ZONE;
-}
-
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int node, local_node;
@@ -5279,8 +5150,6 @@ build_all_zonelists_init(void)
  */
 void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
 {
-	set_zonelist_order();
-
 	if (system_state == SYSTEM_BOOTING) {
 		build_all_zonelists_init();
 	} else {
@@ -5306,9 +5175,8 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
 	else
 		page_group_by_mobility_disabled = 0;
 
-	pr_info("Built %i zonelists in %s order, mobility grouping %s.  Total pages: %ld\n",
+	pr_info("Built %i zonelists, mobility grouping %s.  Total pages: %ld\n",
 		nr_online_nodes,
-		zonelist_order_name[current_zonelist_order],
 		page_group_by_mobility_disabled ? "off" : "on",
 		vm_total_pages);
 #ifdef CONFIG_NUMA
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
       [not found]   ` <20170714080006.7250-2-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2017-07-14  9:36     ` Mel Gorman
  2017-07-14 10:47       ` Michal Hocko
  2017-07-19  9:33     ` Vlastimil Babka
  1 sibling, 1 reply; 12+ messages in thread
From: Mel Gorman @ 2017-07-14  9:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton, Johannes Weiner,
	Vlastimil Babka, LKML, Michal Hocko,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On Fri, Jul 14, 2017 at 09:59:58AM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> 
> Supporting zone ordered zonelists costs us just a lot of code while
> the usefulness is arguable if existent at all. Mel has already made
> node ordering default on 64b systems. 32b systems are still using
> ZONELIST_ORDER_ZONE because it is considered better to fallback to
> a different NUMA node rather than consume precious lowmem zones.
> 
> This argument is, however, weaken by the fact that the memory reclaim
> has been reworked to be node rather than zone oriented. This means
> that lowmem requests have to skip over all highmem pages on LRUs already
> and so zone ordering doesn't save the reclaim time much. So the only
> advantage of the zone ordering is under a light memory pressure when
> highmem requests do not ever hit into lowmem zones and the lowmem
> pressure doesn't need to reclaim.
> 
> Considering that 32b NUMA systems are rather suboptimal already and
> it is generally advisable to use 64b kernel on such a HW I believe we
> should rather care about the code maintainability and just get rid of
> ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
> somebody tries to set zone ordering either from kernel command line
> or the sysctl.
> 
> Cc: <linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
> Signed-off-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> index 80e4adb4c360..d9f4ea057e74 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4864,40 +4824,22 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
>  		void __user *buffer, size_t *length,
>  		loff_t *ppos)
>  {
> -	char saved_string[NUMA_ZONELIST_ORDER_LEN];
> +	char *str;
>  	int ret;
> -	static DEFINE_MUTEX(zl_order_mutex);
>  
> -	mutex_lock(&zl_order_mutex);
> -	if (write) {
> -		if (strlen((char *)table->data) >= NUMA_ZONELIST_ORDER_LEN) {
> -			ret = -EINVAL;
> -			goto out;
> -		}
> -		strcpy(saved_string, (char *)table->data);
> +	if (!write) {
> +		int len = sizeof("Default");
> +		if (copy_to_user(buffer, "Default", len))
> +			return -EFAULT;
> +		return len;
>  	}

That should to be "default" because the original code would have the proc
entry display "default" unless it was set at runtime. Pretty weird I
know but it's always possible someone is parsing the original default
and not handling it properly.

Otherwise I think we're way past the point where large memory 32-bit
NUMA machines are a thing so

Acked-by: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14  9:36     ` Mel Gorman
@ 2017-07-14 10:47       ` Michal Hocko
       [not found]         ` <20170714104756.GD2618-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2017-07-14 10:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	linux-api

On Fri 14-07-17 10:36:50, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 09:59:58AM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Supporting zone ordered zonelists costs us just a lot of code while
> > the usefulness is arguable if existent at all. Mel has already made
> > node ordering default on 64b systems. 32b systems are still using
> > ZONELIST_ORDER_ZONE because it is considered better to fallback to
> > a different NUMA node rather than consume precious lowmem zones.
> > 
> > This argument is, however, weaken by the fact that the memory reclaim
> > has been reworked to be node rather than zone oriented. This means
> > that lowmem requests have to skip over all highmem pages on LRUs already
> > and so zone ordering doesn't save the reclaim time much. So the only
> > advantage of the zone ordering is under a light memory pressure when
> > highmem requests do not ever hit into lowmem zones and the lowmem
> > pressure doesn't need to reclaim.
> > 
> > Considering that 32b NUMA systems are rather suboptimal already and
> > it is generally advisable to use 64b kernel on such a HW I believe we
> > should rather care about the code maintainability and just get rid of
> > ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
> > somebody tries to set zone ordering either from kernel command line
> > or the sysctl.
> > 
> > Cc: <linux-api@vger.kernel.org>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > index 80e4adb4c360..d9f4ea057e74 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4864,40 +4824,22 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
> >  		void __user *buffer, size_t *length,
> >  		loff_t *ppos)
> >  {
> > -	char saved_string[NUMA_ZONELIST_ORDER_LEN];
> > +	char *str;
> >  	int ret;
> > -	static DEFINE_MUTEX(zl_order_mutex);
> >  
> > -	mutex_lock(&zl_order_mutex);
> > -	if (write) {
> > -		if (strlen((char *)table->data) >= NUMA_ZONELIST_ORDER_LEN) {
> > -			ret = -EINVAL;
> > -			goto out;
> > -		}
> > -		strcpy(saved_string, (char *)table->data);
> > +	if (!write) {
> > +		int len = sizeof("Default");
> > +		if (copy_to_user(buffer, "Default", len))
> > +			return -EFAULT;
> > +		return len;
> >  	}
> 
> That should to be "default" because the original code would have the proc
> entry display "default" unless it was set at runtime. Pretty weird I
> know but it's always possible someone is parsing the original default
> and not handling it properly.

Ohh, right! That is indeed strange. Then I guess it would be probably
better to simply return Node to make it clear what the default is. What
do you think?

> Otherwise I think we're way past the point where large memory 32-bit
> NUMA machines are a thing so
> 
> Acked-by: Mel Gorman <mgorman@suse.de>

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
       [not found]         ` <20170714104756.GD2618-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2017-07-14 11:16           ` Mel Gorman
  2017-07-14 11:38             ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: Mel Gorman @ 2017-07-14 11:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton, Johannes Weiner,
	Vlastimil Babka, LKML, linux-api-u79uwXL29TY76Z2rM5mHXA

On Fri, Jul 14, 2017 at 12:47:57PM +0200, Michal Hocko wrote:
> > That should to be "default" because the original code would have the proc
> > entry display "default" unless it was set at runtime. Pretty weird I
> > know but it's always possible someone is parsing the original default
> > and not handling it properly.
> 
> Ohh, right! That is indeed strange. Then I guess it would be probably
> better to simply return Node to make it clear what the default is. What
> do you think?
> 

That would work too. The casing still matches.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14 11:16           ` Mel Gorman
@ 2017-07-14 11:38             ` Michal Hocko
  2017-07-14 12:56               ` Mel Gorman
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2017-07-14 11:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	linux-api

On Fri 14-07-17 12:16:33, Mel Gorman wrote:
> On Fri, Jul 14, 2017 at 12:47:57PM +0200, Michal Hocko wrote:
> > > That should to be "default" because the original code would have the proc
> > > entry display "default" unless it was set at runtime. Pretty weird I
> > > know but it's always possible someone is parsing the original default
> > > and not handling it properly.
> > 
> > Ohh, right! That is indeed strange. Then I guess it would be probably
> > better to simply return Node to make it clear what the default is. What
> > do you think?
> > 
> 
> That would work too. The casing still matches.

This folded in?
---
>From c7c36f011590680b254813be00ed791ddbc1bf1c Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 14 Jul 2017 13:36:05 +0200
Subject: [PATCH] fold me "mm, page_alloc: rip out ZONELIST_ORDER_ZONE"

- do not print Default in sysctl handler because our behavior was rather
inconsistent in the past numa_zonelist_order was lowecase while
zonelist_order_name was uppercase so boot time unchanged value woul
print lowercase while updated value could be uppercase. Print "Node"
which is the default instead - Mel

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd4c96edcec3..49bade7ff049 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4828,8 +4828,8 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
 	int ret;
 
 	if (!write) {
-		int len = sizeof("Default");
-		if (copy_to_user(buffer, "Default", len))
+		int len = sizeof("Node");
+		if (copy_to_user(buffer, "Node", len))
 			return -EFAULT;
 		return len;
 	}
-- 
2.11.0

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14 11:38             ` Michal Hocko
@ 2017-07-14 12:56               ` Mel Gorman
       [not found]                 ` <20170714125616.clbp4ezgtoon6cmk-l3A5Bk7waGM@public.gmane.org>
  2017-07-14 13:08                 ` Michal Hocko
  0 siblings, 2 replies; 12+ messages in thread
From: Mel Gorman @ 2017-07-14 12:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	linux-api

On Fri, Jul 14, 2017 at 01:38:40PM +0200, Michal Hocko wrote:
> On Fri 14-07-17 12:16:33, Mel Gorman wrote:
> > On Fri, Jul 14, 2017 at 12:47:57PM +0200, Michal Hocko wrote:
> > > > That should to be "default" because the original code would have the proc
> > > > entry display "default" unless it was set at runtime. Pretty weird I
> > > > know but it's always possible someone is parsing the original default
> > > > and not handling it properly.
> > > 
> > > Ohh, right! That is indeed strange. Then I guess it would be probably
> > > better to simply return Node to make it clear what the default is. What
> > > do you think?
> > > 
> > 
> > That would work too. The casing still matches.
> 
> This folded in?
> ---
> From c7c36f011590680b254813be00ed791ddbc1bf1c Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 14 Jul 2017 13:36:05 +0200
> Subject: [PATCH] fold me "mm, page_alloc: rip out ZONELIST_ORDER_ZONE"
> 
> - do not print Default in sysctl handler because our behavior was rather
> inconsistent in the past numa_zonelist_order was lowecase while
> zonelist_order_name was uppercase so boot time unchanged value woul
> print lowercase while updated value could be uppercase. Print "Node"
> which is the default instead - Mel
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/page_alloc.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dd4c96edcec3..49bade7ff049 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4828,8 +4828,8 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
>  	int ret;
>  
>  	if (!write) {
> -		int len = sizeof("Default");
> -		if (copy_to_user(buffer, "Default", len))
> +		int len = sizeof("Node");
> +		if (copy_to_user(buffer, "Node", len))
>  			return -EFAULT;

Ok for the name. But what's with using sizeof? The type is char * so it
just happens to work for Default, but not for Node. Also strongly suggest
you continue using proc_dostring because it catches all the corner-cases
that can occur.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
       [not found]                 ` <20170714125616.clbp4ezgtoon6cmk-l3A5Bk7waGM@public.gmane.org>
@ 2017-07-14 13:01                   ` Mel Gorman
  0 siblings, 0 replies; 12+ messages in thread
From: Mel Gorman @ 2017-07-14 13:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton, Johannes Weiner,
	Vlastimil Babka, LKML, linux-api-u79uwXL29TY76Z2rM5mHXA

On Fri, Jul 14, 2017 at 01:56:16PM +0100, Mel Gorman wrote:
> >  	if (!write) {
> > -		int len = sizeof("Default");
> > -		if (copy_to_user(buffer, "Default", len))
> > +		int len = sizeof("Node");
> > +		if (copy_to_user(buffer, "Node", len))
> >  			return -EFAULT;
> 
> Ok for the name. But what's with using sizeof?

Bah, sizeof static compile-time string versus char *. Never mind.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
  2017-07-14 12:56               ` Mel Gorman
       [not found]                 ` <20170714125616.clbp4ezgtoon6cmk-l3A5Bk7waGM@public.gmane.org>
@ 2017-07-14 13:08                 ` Michal Hocko
  1 sibling, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2017-07-14 13:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, Johannes Weiner, Vlastimil Babka, LKML,
	linux-api

On Fri 14-07-17 13:56:16, Mel Gorman wrote:
> Also strongly suggest you continue using proc_dostring because it
> catches all the corner-cases that can occur.

proc_dostring would need a data in sysctl table and I found that more
confusing than having a simplistic read/write. I can do that if you
insist though.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
       [not found]   ` <20170714080006.7250-2-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
  2017-07-14  9:36     ` Mel Gorman
@ 2017-07-19  9:33     ` Vlastimil Babka
       [not found]       ` <a4490c3e-9f7b-72b2-dfa3-80c054df6600-AlSwsSmVLrQ@public.gmane.org>
  1 sibling, 1 reply; 12+ messages in thread
From: Vlastimil Babka @ 2017-07-19  9:33 UTC (permalink / raw)
  To: Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, LKML, Michal Hocko,
	linux-api-u79uwXL29TY76Z2rM5mHXA

On 07/14/2017 09:59 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> 
> Supporting zone ordered zonelists costs us just a lot of code while
> the usefulness is arguable if existent at all. Mel has already made
> node ordering default on 64b systems. 32b systems are still using
> ZONELIST_ORDER_ZONE because it is considered better to fallback to
> a different NUMA node rather than consume precious lowmem zones.
> 
> This argument is, however, weaken by the fact that the memory reclaim
> has been reworked to be node rather than zone oriented. This means
> that lowmem requests have to skip over all highmem pages on LRUs already
> and so zone ordering doesn't save the reclaim time much. So the only
> advantage of the zone ordering is under a light memory pressure when
> highmem requests do not ever hit into lowmem zones and the lowmem
> pressure doesn't need to reclaim.
> 
> Considering that 32b NUMA systems are rather suboptimal already and
> it is generally advisable to use 64b kernel on such a HW I believe we
> should rather care about the code maintainability and just get rid of
> ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
> somebody tries to set zone ordering either from kernel command line
> or the sysctl.
> 
> Cc: <linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
> Signed-off-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

Found some leftovers to cleanup:

include/linux/mmzone.h:
extern char numa_zonelist_order[];
#define NUMA_ZONELIST_ORDER_LEN 16      /* string buffer size */

Also update docs?
Documentation/sysctl/vm.txt:zone.  Specify "[Zz]one" for zone order.
Documentation/admin-guide/kernel-parameters.txt:
numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
Documentation/vm/numa:a default zonelist order based on the sizes of the
various zone types relative
Documentation/vm/numa:default zonelist order may be overridden using the
numa_zonelist_order kernel

Otherwise,
Acked-by: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
       [not found]       ` <a4490c3e-9f7b-72b2-dfa3-80c054df6600-AlSwsSmVLrQ@public.gmane.org>
@ 2017-07-19 13:44         ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2017-07-19 13:44 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrew Morton, Mel Gorman,
	Johannes Weiner, LKML, linux-api-u79uwXL29TY76Z2rM5mHXA

On Wed 19-07-17 11:33:49, Vlastimil Babka wrote:
> On 07/14/2017 09:59 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> > 
> > Supporting zone ordered zonelists costs us just a lot of code while
> > the usefulness is arguable if existent at all. Mel has already made
> > node ordering default on 64b systems. 32b systems are still using
> > ZONELIST_ORDER_ZONE because it is considered better to fallback to
> > a different NUMA node rather than consume precious lowmem zones.
> > 
> > This argument is, however, weaken by the fact that the memory reclaim
> > has been reworked to be node rather than zone oriented. This means
> > that lowmem requests have to skip over all highmem pages on LRUs already
> > and so zone ordering doesn't save the reclaim time much. So the only
> > advantage of the zone ordering is under a light memory pressure when
> > highmem requests do not ever hit into lowmem zones and the lowmem
> > pressure doesn't need to reclaim.
> > 
> > Considering that 32b NUMA systems are rather suboptimal already and
> > it is generally advisable to use 64b kernel on such a HW I believe we
> > should rather care about the code maintainability and just get rid of
> > ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
> > somebody tries to set zone ordering either from kernel command line
> > or the sysctl.
> > 
> > Cc: <linux-api-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
> > Signed-off-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> 
> Found some leftovers to cleanup:
> 
> include/linux/mmzone.h:
> extern char numa_zonelist_order[];
> #define NUMA_ZONELIST_ORDER_LEN 16      /* string buffer size */
> 
> Also update docs?
> Documentation/sysctl/vm.txt:zone.  Specify "[Zz]one" for zone order.
> Documentation/admin-guide/kernel-parameters.txt:
> numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
> Documentation/vm/numa:a default zonelist order based on the sizes of the
> various zone types relative
> Documentation/vm/numa:default zonelist order may be overridden using the
> numa_zonelist_order kernel
> 
> Otherwise,
> Acked-by: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>

This?
---
commit 06c69d6785447160717ca8cc476dc9cac7a3f964
Author: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
Date:   Wed Jul 19 14:28:31 2017 +0200

    fold me
    
    - update documentation as per Vlastimil
    
    Acked-by: Vlastimil Babka <vbabka-AlSwsSmVLrQ@public.gmane.org>

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 60530c8490ff..28f1a0f84456 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2724,7 +2724,7 @@
 			Allowed values are enable and disable
 
 	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
-			one of ['zone', 'node', 'default'] can be specified
+			'node', 'default' can be specified
 			This can be set from sysctl after boot.
 			See Documentation/sysctl/vm.txt for details.
 
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 48244c42ff52..9baf66a9ef4e 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -572,7 +572,9 @@ See Documentation/nommu-mmap.txt for more information.
 
 numa_zonelist_order
 
-This sysctl is only for NUMA.
+This sysctl is only for NUMA and it is deprecated. Anything but
+Node order will fail!
+
 'where the memory is allocated from' is controlled by zonelists.
 (This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
  you may be able to read ZONE_DMA as ZONE_DMA32...)
diff --git a/Documentation/vm/numa b/Documentation/vm/numa
index a08f71647714..a31b85b9bb88 100644
--- a/Documentation/vm/numa
+++ b/Documentation/vm/numa
@@ -79,11 +79,8 @@ memory, Linux must decide whether to order the zonelists such that allocations
 fall back to the same zone type on a different node, or to a different zone
 type on the same node.  This is an important consideration because some zones,
 such as DMA or DMA32, represent relatively scarce resources.  Linux chooses
-a default zonelist order based on the sizes of the various zone types relative
-to the total memory of the node and the total memory of the system.  The
-default zonelist order may be overridden using the numa_zonelist_order kernel
-boot parameter or sysctl.  [see Documentation/admin-guide/kernel-parameters.rst and
-Documentation/sysctl/vm.txt]
+a default Node ordered zonelist. This means it tries to fallback to other zones
+from the same node before using remote nodes which are ordered by NUMA distance.
 
 By default, Linux will attempt to satisfy memory allocation requests from the
 node to which the CPU that executes the request is assigned.  Specifically,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fc14b8b3f6ce..b849006b20d3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -895,8 +895,6 @@ int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
-extern char numa_zonelist_order[];
-#define NUMA_ZONELIST_ORDER_LEN 16	/* string buffer size */
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE
       [not found] ` <20170721143915.14161-2-mhocko@kernel.org>
@ 2017-07-21 14:45   ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2017-07-21 14:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, linux-mm, LKML, linux-api

Ups, linux-api, didn't make it into the CC list. Let's add it here.
The cover which will give you some context is here
http://lkml.kernel.org/r/20170721143915.14161-1-mhocko@kernel.org

On Fri 21-07-17 16:39:07, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Supporting zone ordered zonelists costs us just a lot of code while
> the usefulness is arguable if existent at all. Mel has already made
> node ordering default on 64b systems. 32b systems are still using
> ZONELIST_ORDER_ZONE because it is considered better to fallback to
> a different NUMA node rather than consume precious lowmem zones.
> 
> This argument is, however, weaken by the fact that the memory reclaim
> has been reworked to be node rather than zone oriented. This means
> that lowmem requests have to skip over all highmem pages on LRUs already
> and so zone ordering doesn't save the reclaim time much. So the only
> advantage of the zone ordering is under a light memory pressure when
> highmem requests do not ever hit into lowmem zones and the lowmem
> pressure doesn't need to reclaim.
> 
> Considering that 32b NUMA systems are rather suboptimal already and
> it is generally advisable to use 64b kernel on such a HW I believe we
> should rather care about the code maintainability and just get rid of
> ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
> somebody tries to set zone ordering either from kernel command line
> or the sysctl.
> 
> Changes since v1
> - do not print Default in sysctl handler because our behavior was rather
>   inconsistent in the past numa_zonelist_order was lowecase while
>   zonelist_order_name was uppercase so boot time unchanged value woul
>   print lowercase while updated value could be uppercase. Print "Node"
>   which is the default instead - Mel
> - update documentation - Vlastimil
> 
> Cc: linux-api@vger.kernel.org
> Acked-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |   2 +-
>  Documentation/sysctl/vm.txt                     |   4 +-
>  Documentation/vm/numa                           |   7 +-
>  include/linux/mmzone.h                          |   2 -
>  kernel/sysctl.c                                 |   2 -
>  mm/page_alloc.c                                 | 178 +++---------------------
>  6 files changed, 29 insertions(+), 166 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 60530c8490ff..28f1a0f84456 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2724,7 +2724,7 @@
>  			Allowed values are enable and disable
>  
>  	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
> -			one of ['zone', 'node', 'default'] can be specified
> +			'node', 'default' can be specified
>  			This can be set from sysctl after boot.
>  			See Documentation/sysctl/vm.txt for details.
>  
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 48244c42ff52..9baf66a9ef4e 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -572,7 +572,9 @@ See Documentation/nommu-mmap.txt for more information.
>  
>  numa_zonelist_order
>  
> -This sysctl is only for NUMA.
> +This sysctl is only for NUMA and it is deprecated. Anything but
> +Node order will fail!
> +
>  'where the memory is allocated from' is controlled by zonelists.
>  (This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
>   you may be able to read ZONE_DMA as ZONE_DMA32...)
> diff --git a/Documentation/vm/numa b/Documentation/vm/numa
> index a08f71647714..a31b85b9bb88 100644
> --- a/Documentation/vm/numa
> +++ b/Documentation/vm/numa
> @@ -79,11 +79,8 @@ memory, Linux must decide whether to order the zonelists such that allocations
>  fall back to the same zone type on a different node, or to a different zone
>  type on the same node.  This is an important consideration because some zones,
>  such as DMA or DMA32, represent relatively scarce resources.  Linux chooses
> -a default zonelist order based on the sizes of the various zone types relative
> -to the total memory of the node and the total memory of the system.  The
> -default zonelist order may be overridden using the numa_zonelist_order kernel
> -boot parameter or sysctl.  [see Documentation/admin-guide/kernel-parameters.rst and
> -Documentation/sysctl/vm.txt]
> +a default Node ordered zonelist. This means it tries to fallback to other zones
> +from the same node before using remote nodes which are ordered by NUMA distance.
>  
>  By default, Linux will attempt to satisfy memory allocation requests from the
>  node to which the CPU that executes the request is assigned.  Specifically,
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fc14b8b3f6ce..b849006b20d3 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -895,8 +895,6 @@ int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
>  
>  extern int numa_zonelist_order_handler(struct ctl_table *, int,
>  			void __user *, size_t *, loff_t *);
> -extern char numa_zonelist_order[];
> -#define NUMA_ZONELIST_ORDER_LEN 16	/* string buffer size */
>  
>  #ifndef CONFIG_NEED_MULTIPLE_NODES
>  
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 655686d546cb..0cbce40f5426 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1553,8 +1553,6 @@ static struct ctl_table vm_table[] = {
>  #ifdef CONFIG_NUMA
>  	{
>  		.procname	= "numa_zonelist_order",
> -		.data		= &numa_zonelist_order,
> -		.maxlen		= NUMA_ZONELIST_ORDER_LEN,
>  		.mode		= 0644,
>  		.proc_handler	= numa_zonelist_order_handler,
>  	},
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 80e4adb4c360..b839a90e24c8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4791,52 +4791,18 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
>  	return nr_zones;
>  }
>  
> -
> -/*
> - *  zonelist_order:
> - *  0 = automatic detection of better ordering.
> - *  1 = order by ([node] distance, -zonetype)
> - *  2 = order by (-zonetype, [node] distance)
> - *
> - *  If not NUMA, ZONELIST_ORDER_ZONE and ZONELIST_ORDER_NODE will create
> - *  the same zonelist. So only NUMA can configure this param.
> - */
> -#define ZONELIST_ORDER_DEFAULT  0
> -#define ZONELIST_ORDER_NODE     1
> -#define ZONELIST_ORDER_ZONE     2
> -
> -/* zonelist order in the kernel.
> - * set_zonelist_order() will set this to NODE or ZONE.
> - */
> -static int current_zonelist_order = ZONELIST_ORDER_DEFAULT;
> -static char zonelist_order_name[3][8] = {"Default", "Node", "Zone"};
> -
> -
>  #ifdef CONFIG_NUMA
> -/* The value user specified ....changed by config */
> -static int user_zonelist_order = ZONELIST_ORDER_DEFAULT;
> -/* string for sysctl */
> -#define NUMA_ZONELIST_ORDER_LEN	16
> -char numa_zonelist_order[16] = "default";
> -
> -/*
> - * interface for configure zonelist ordering.
> - * command line option "numa_zonelist_order"
> - *	= "[dD]efault	- default, automatic configuration.
> - *	= "[nN]ode 	- order by node locality, then by zone within node
> - *	= "[zZ]one      - order by zone, then by locality within zone
> - */
>  
>  static int __parse_numa_zonelist_order(char *s)
>  {
> -	if (*s == 'd' || *s == 'D') {
> -		user_zonelist_order = ZONELIST_ORDER_DEFAULT;
> -	} else if (*s == 'n' || *s == 'N') {
> -		user_zonelist_order = ZONELIST_ORDER_NODE;
> -	} else if (*s == 'z' || *s == 'Z') {
> -		user_zonelist_order = ZONELIST_ORDER_ZONE;
> -	} else {
> -		pr_warn("Ignoring invalid numa_zonelist_order value:  %s\n", s);
> +	/*
> +	 * We used to support different zonlists modes but they turned
> +	 * out to be just not useful. Let's keep the warning in place
> +	 * if somebody still use the cmd line parameter so that we do
> +	 * not fail it silently
> +	 */
> +	if (!(*s == 'd' || *s == 'D' || *s == 'n' || *s == 'N')) {
> +		pr_warn("Ignoring unsupported numa_zonelist_order value:  %s\n", s);
>  		return -EINVAL;
>  	}
>  	return 0;
> @@ -4844,16 +4810,10 @@ static int __parse_numa_zonelist_order(char *s)
>  
>  static __init int setup_numa_zonelist_order(char *s)
>  {
> -	int ret;
> -
>  	if (!s)
>  		return 0;
>  
> -	ret = __parse_numa_zonelist_order(s);
> -	if (ret == 0)
> -		strlcpy(numa_zonelist_order, s, NUMA_ZONELIST_ORDER_LEN);
> -
> -	return ret;
> +	return __parse_numa_zonelist_order(s);
>  }
>  early_param("numa_zonelist_order", setup_numa_zonelist_order);
>  
> @@ -4864,40 +4824,22 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
>  		void __user *buffer, size_t *length,
>  		loff_t *ppos)
>  {
> -	char saved_string[NUMA_ZONELIST_ORDER_LEN];
> +	char *str;
>  	int ret;
> -	static DEFINE_MUTEX(zl_order_mutex);
>  
> -	mutex_lock(&zl_order_mutex);
> -	if (write) {
> -		if (strlen((char *)table->data) >= NUMA_ZONELIST_ORDER_LEN) {
> -			ret = -EINVAL;
> -			goto out;
> -		}
> -		strcpy(saved_string, (char *)table->data);
> +	if (!write) {
> +		int len = sizeof("Node");
> +		if (copy_to_user(buffer, "Node", len))
> +			return -EFAULT;
> +		return len;
>  	}
> -	ret = proc_dostring(table, write, buffer, length, ppos);
> -	if (ret)
> -		goto out;
> -	if (write) {
> -		int oldval = user_zonelist_order;
>  
> -		ret = __parse_numa_zonelist_order((char *)table->data);
> -		if (ret) {
> -			/*
> -			 * bogus value.  restore saved string
> -			 */
> -			strncpy((char *)table->data, saved_string,
> -				NUMA_ZONELIST_ORDER_LEN);
> -			user_zonelist_order = oldval;
> -		} else if (oldval != user_zonelist_order) {
> -			mutex_lock(&zonelists_mutex);
> -			build_all_zonelists(NULL, NULL);
> -			mutex_unlock(&zonelists_mutex);
> -		}
> -	}
> -out:
> -	mutex_unlock(&zl_order_mutex);
> +	str = memdup_user_nul(buffer, 16);
> +	if (IS_ERR(str))
> +		return PTR_ERR(str);
> +
> +	ret = __parse_numa_zonelist_order(str);
> +	kfree(str);
>  	return ret;
>  }
>  
> @@ -5006,70 +4948,12 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
>   */
>  static int node_order[MAX_NUMNODES];
>  
> -static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
> -{
> -	int pos, j, node;
> -	int zone_type;		/* needs to be signed */
> -	struct zone *z;
> -	struct zonelist *zonelist;
> -
> -	zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
> -	pos = 0;
> -	for (zone_type = MAX_NR_ZONES - 1; zone_type >= 0; zone_type--) {
> -		for (j = 0; j < nr_nodes; j++) {
> -			node = node_order[j];
> -			z = &NODE_DATA(node)->node_zones[zone_type];
> -			if (managed_zone(z)) {
> -				zoneref_set_zone(z,
> -					&zonelist->_zonerefs[pos++]);
> -				check_highest_zone(zone_type);
> -			}
> -		}
> -	}
> -	zonelist->_zonerefs[pos].zone = NULL;
> -	zonelist->_zonerefs[pos].zone_idx = 0;
> -}
> -
> -#if defined(CONFIG_64BIT)
> -/*
> - * Devices that require DMA32/DMA are relatively rare and do not justify a
> - * penalty to every machine in case the specialised case applies. Default
> - * to Node-ordering on 64-bit NUMA machines
> - */
> -static int default_zonelist_order(void)
> -{
> -	return ZONELIST_ORDER_NODE;
> -}
> -#else
> -/*
> - * On 32-bit, the Normal zone needs to be preserved for allocations accessible
> - * by the kernel. If processes running on node 0 deplete the low memory zone
> - * then reclaim will occur more frequency increasing stalls and potentially
> - * be easier to OOM if a large percentage of the zone is under writeback or
> - * dirty. The problem is significantly worse if CONFIG_HIGHPTE is not set.
> - * Hence, default to zone ordering on 32-bit.
> - */
> -static int default_zonelist_order(void)
> -{
> -	return ZONELIST_ORDER_ZONE;
> -}
> -#endif /* CONFIG_64BIT */
> -
> -static void set_zonelist_order(void)
> -{
> -	if (user_zonelist_order == ZONELIST_ORDER_DEFAULT)
> -		current_zonelist_order = default_zonelist_order();
> -	else
> -		current_zonelist_order = user_zonelist_order;
> -}
> -
>  static void build_zonelists(pg_data_t *pgdat)
>  {
>  	int i, node, load;
>  	nodemask_t used_mask;
>  	int local_node, prev_node;
>  	struct zonelist *zonelist;
> -	unsigned int order = current_zonelist_order;
>  
>  	/* initialize zonelists */
>  	for (i = 0; i < MAX_ZONELISTS; i++) {
> @@ -5099,15 +4983,7 @@ static void build_zonelists(pg_data_t *pgdat)
>  
>  		prev_node = node;
>  		load--;
> -		if (order == ZONELIST_ORDER_NODE)
> -			build_zonelists_in_node_order(pgdat, node);
> -		else
> -			node_order[i++] = node;	/* remember order */
> -	}
> -
> -	if (order == ZONELIST_ORDER_ZONE) {
> -		/* calculate node order -- i.e., DMA last! */
> -		build_zonelists_in_zone_order(pgdat, i);
> +		build_zonelists_in_node_order(pgdat, node);
>  	}
>  
>  	build_thisnode_zonelists(pgdat);
> @@ -5135,11 +5011,6 @@ static void setup_min_unmapped_ratio(void);
>  static void setup_min_slab_ratio(void);
>  #else	/* CONFIG_NUMA */
>  
> -static void set_zonelist_order(void)
> -{
> -	current_zonelist_order = ZONELIST_ORDER_ZONE;
> -}
> -
>  static void build_zonelists(pg_data_t *pgdat)
>  {
>  	int node, local_node;
> @@ -5279,8 +5150,6 @@ build_all_zonelists_init(void)
>   */
>  void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
>  {
> -	set_zonelist_order();
> -
>  	if (system_state == SYSTEM_BOOTING) {
>  		build_all_zonelists_init();
>  	} else {
> @@ -5306,9 +5175,8 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
>  	else
>  		page_group_by_mobility_disabled = 0;
>  
> -	pr_info("Built %i zonelists in %s order, mobility grouping %s.  Total pages: %ld\n",
> +	pr_info("Built %i zonelists, mobility grouping %s.  Total pages: %ld\n",
>  		nr_online_nodes,
> -		zonelist_order_name[current_zonelist_order],
>  		page_group_by_mobility_disabled ? "off" : "on",
>  		vm_total_pages);
>  #ifdef CONFIG_NUMA
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-07-21 14:45 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-14  7:59 [PATCH 0/9] cleanup zonelists initialization Michal Hocko
2017-07-14  7:59 ` [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE Michal Hocko
     [not found]   ` <20170714080006.7250-2-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2017-07-14  9:36     ` Mel Gorman
2017-07-14 10:47       ` Michal Hocko
     [not found]         ` <20170714104756.GD2618-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2017-07-14 11:16           ` Mel Gorman
2017-07-14 11:38             ` Michal Hocko
2017-07-14 12:56               ` Mel Gorman
     [not found]                 ` <20170714125616.clbp4ezgtoon6cmk-l3A5Bk7waGM@public.gmane.org>
2017-07-14 13:01                   ` Mel Gorman
2017-07-14 13:08                 ` Michal Hocko
2017-07-19  9:33     ` Vlastimil Babka
     [not found]       ` <a4490c3e-9f7b-72b2-dfa3-80c054df6600-AlSwsSmVLrQ@public.gmane.org>
2017-07-19 13:44         ` Michal Hocko
     [not found] <20170721143915.14161-1-mhocko@kernel.org>
     [not found] ` <20170721143915.14161-2-mhocko@kernel.org>
2017-07-21 14:45   ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).