All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Disable zone_reclaim_mode by default
@ 2014-04-07 22:34 ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2014-04-07 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Robert Haas, Josh Berkus, Andres Freund, Christoph Lameter,
	Linux-MM, LKML, Mel Gorman

When it was introduced, zone_reclaim_mode made sense as NUMA distances
punished and workloads were generally partitioned to fit into a NUMA
node. NUMA machines are now common but few of the workloads are NUMA-aware
and it's routine to see major performance due to zone_reclaim_mode being
disabled but relatively few can identify the problem.

Those that require zone_reclaim_mode are likely to be able to detect when
it needs to be enabled and tune appropriately so lets have a sensible
default for the bulk of users.

 Documentation/sysctl/vm.txt | 17 +++++++++--------
 include/linux/mmzone.h      |  1 -
 mm/page_alloc.c             | 17 +----------------
 3 files changed, 10 insertions(+), 25 deletions(-)

-- 
1.8.4.5


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 0/2] Disable zone_reclaim_mode by default
@ 2014-04-07 22:34 ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2014-04-07 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Robert Haas, Josh Berkus, Andres Freund, Christoph Lameter,
	Linux-MM, LKML, Mel Gorman

When it was introduced, zone_reclaim_mode made sense as NUMA distances
punished and workloads were generally partitioned to fit into a NUMA
node. NUMA machines are now common but few of the workloads are NUMA-aware
and it's routine to see major performance due to zone_reclaim_mode being
disabled but relatively few can identify the problem.

Those that require zone_reclaim_mode are likely to be able to detect when
it needs to be enabled and tune appropriately so lets have a sensible
default for the bulk of users.

 Documentation/sysctl/vm.txt | 17 +++++++++--------
 include/linux/mmzone.h      |  1 -
 mm/page_alloc.c             | 17 +----------------
 3 files changed, 10 insertions(+), 25 deletions(-)

-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/2] mm: Disable zone_reclaim_mode by default
  2014-04-07 22:34 ` Mel Gorman
@ 2014-04-07 22:34   ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2014-04-07 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Robert Haas, Josh Berkus, Andres Freund, Christoph Lameter,
	Linux-MM, LKML, Mel Gorman

zone_reclaim_mode causes processes to prefer reclaiming memory from local
node instead of spilling over to other nodes. This made sense initially when
NUMA machines were almost exclusively HPC and the workload was partitioned
into nodes. The NUMA penalties were sufficiently high to justify reclaiming
the memory. On current machines and workloads it is often the case that
zone_reclaim_mode destroys performance but not all users know how to detect
this. Favour the common case and disable it by default. Users that are
sophisticated enough to know they need zone_reclaim_mode will detect it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/vm.txt | 17 +++++++++--------
 mm/page_alloc.c             |  2 --
 2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index d614a9b..ff5da70 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -751,16 +751,17 @@ This is value ORed together of
 2	= Zone reclaim writes dirty pages out
 4	= Zone reclaim swaps pages
 
-zone_reclaim_mode is set during bootup to 1 if it is determined that pages
-from remote zones will cause a measurable performance reduction. The
-page allocator will then reclaim easily reusable pages (those page
-cache pages that are currently not used) before allocating off node pages.
-
-It may be beneficial to switch off zone reclaim if the system is
-used for a file server and all of memory should be used for caching files
-from disk. In that case the caching effect is more important than
+zone_reclaim_mode is disabled by default.  For file servers or workloads
+that benefit from having their data cached, zone_reclaim_mode should be
+left disabled as the caching effect is likely to be more important than
 data locality.
 
+zone_reclaim may be enabled if it's known that the workload is partitioned
+such that each partition fits within a NUMA node and that accessing remote
+memory would cause a measurable performance reduction.  The page allocator
+will then reclaim easily reusable pages (those page cache pages that are
+currently not used) before allocating off node pages.
+
 Allowing zone reclaim to write out pages stops processes that are
 writing large amounts of data from dirtying pages on other nodes. Zone
 reclaim will write out dirty pages if a zone fills up and so effectively
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3bac76a..a256f85 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1873,8 +1873,6 @@ static void __paginginit init_zone_allows_reclaim(int nid)
 	for_each_online_node(i)
 		if (node_distance(nid, i) <= RECLAIM_DISTANCE)
 			node_set(i, NODE_DATA(nid)->reclaim_nodes);
-		else
-			zone_reclaim_mode = 1;
 }
 
 #else	/* CONFIG_NUMA */
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 1/2] mm: Disable zone_reclaim_mode by default
@ 2014-04-07 22:34   ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2014-04-07 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Robert Haas, Josh Berkus, Andres Freund, Christoph Lameter,
	Linux-MM, LKML, Mel Gorman

zone_reclaim_mode causes processes to prefer reclaiming memory from local
node instead of spilling over to other nodes. This made sense initially when
NUMA machines were almost exclusively HPC and the workload was partitioned
into nodes. The NUMA penalties were sufficiently high to justify reclaiming
the memory. On current machines and workloads it is often the case that
zone_reclaim_mode destroys performance but not all users know how to detect
this. Favour the common case and disable it by default. Users that are
sophisticated enough to know they need zone_reclaim_mode will detect it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/sysctl/vm.txt | 17 +++++++++--------
 mm/page_alloc.c             |  2 --
 2 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index d614a9b..ff5da70 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -751,16 +751,17 @@ This is value ORed together of
 2	= Zone reclaim writes dirty pages out
 4	= Zone reclaim swaps pages
 
-zone_reclaim_mode is set during bootup to 1 if it is determined that pages
-from remote zones will cause a measurable performance reduction. The
-page allocator will then reclaim easily reusable pages (those page
-cache pages that are currently not used) before allocating off node pages.
-
-It may be beneficial to switch off zone reclaim if the system is
-used for a file server and all of memory should be used for caching files
-from disk. In that case the caching effect is more important than
+zone_reclaim_mode is disabled by default.  For file servers or workloads
+that benefit from having their data cached, zone_reclaim_mode should be
+left disabled as the caching effect is likely to be more important than
 data locality.
 
+zone_reclaim may be enabled if it's known that the workload is partitioned
+such that each partition fits within a NUMA node and that accessing remote
+memory would cause a measurable performance reduction.  The page allocator
+will then reclaim easily reusable pages (those page cache pages that are
+currently not used) before allocating off node pages.
+
 Allowing zone reclaim to write out pages stops processes that are
 writing large amounts of data from dirtying pages on other nodes. Zone
 reclaim will write out dirty pages if a zone fills up and so effectively
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3bac76a..a256f85 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1873,8 +1873,6 @@ static void __paginginit init_zone_allows_reclaim(int nid)
 	for_each_online_node(i)
 		if (node_distance(nid, i) <= RECLAIM_DISTANCE)
 			node_set(i, NODE_DATA(nid)->reclaim_nodes);
-		else
-			zone_reclaim_mode = 1;
 }
 
 #else	/* CONFIG_NUMA */
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/2] mm: page_alloc: Do not cache reclaim distances
  2014-04-07 22:34 ` Mel Gorman
@ 2014-04-07 22:34   ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2014-04-07 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Robert Haas, Josh Berkus, Andres Freund, Christoph Lameter,
	Linux-MM, LKML, Mel Gorman

pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by
zone_reclaim due to its distance. As it is expected that zone_reclaim_mode
will be rarely enabled it is unreasonable for all machines to take a penalty.
Fortunately, the zone_reclaim_mode() path is already slow and it is the path
that takes the hit.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |  1 -
 mm/page_alloc.c        | 15 +--------------
 2 files changed, 1 insertion(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9b61b9b..564b169 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -757,7 +757,6 @@ typedef struct pglist_data {
 	unsigned long node_spanned_pages; /* total size of physical page
 					     range, including holes */
 	int node_id;
-	nodemask_t reclaim_nodes;	/* Nodes allowed to reclaim from */
 	wait_queue_head_t kswapd_wait;
 	wait_queue_head_t pfmemalloc_wait;
 	struct task_struct *kswapd;	/* Protected by lock_memory_hotplug() */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a256f85..574928e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1863,16 +1863,7 @@ static bool zone_local(struct zone *local_zone, struct zone *zone)
 
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
-	return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes);
-}
-
-static void __paginginit init_zone_allows_reclaim(int nid)
-{
-	int i;
-
-	for_each_online_node(i)
-		if (node_distance(nid, i) <= RECLAIM_DISTANCE)
-			node_set(i, NODE_DATA(nid)->reclaim_nodes);
+	return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < RECLAIM_DISTANCE;
 }
 
 #else	/* CONFIG_NUMA */
@@ -1906,9 +1897,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 	return true;
 }
 
-static inline void init_zone_allows_reclaim(int nid)
-{
-}
 #endif	/* CONFIG_NUMA */
 
 /*
@@ -4917,7 +4905,6 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 
 	pgdat->node_id = nid;
 	pgdat->node_start_pfn = node_start_pfn;
-	init_zone_allows_reclaim(nid);
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
 #endif
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/2] mm: page_alloc: Do not cache reclaim distances
@ 2014-04-07 22:34   ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2014-04-07 22:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Robert Haas, Josh Berkus, Andres Freund, Christoph Lameter,
	Linux-MM, LKML, Mel Gorman

pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by
zone_reclaim due to its distance. As it is expected that zone_reclaim_mode
will be rarely enabled it is unreasonable for all machines to take a penalty.
Fortunately, the zone_reclaim_mode() path is already slow and it is the path
that takes the hit.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |  1 -
 mm/page_alloc.c        | 15 +--------------
 2 files changed, 1 insertion(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9b61b9b..564b169 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -757,7 +757,6 @@ typedef struct pglist_data {
 	unsigned long node_spanned_pages; /* total size of physical page
 					     range, including holes */
 	int node_id;
-	nodemask_t reclaim_nodes;	/* Nodes allowed to reclaim from */
 	wait_queue_head_t kswapd_wait;
 	wait_queue_head_t pfmemalloc_wait;
 	struct task_struct *kswapd;	/* Protected by lock_memory_hotplug() */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a256f85..574928e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1863,16 +1863,7 @@ static bool zone_local(struct zone *local_zone, struct zone *zone)
 
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
-	return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes);
-}
-
-static void __paginginit init_zone_allows_reclaim(int nid)
-{
-	int i;
-
-	for_each_online_node(i)
-		if (node_distance(nid, i) <= RECLAIM_DISTANCE)
-			node_set(i, NODE_DATA(nid)->reclaim_nodes);
+	return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < RECLAIM_DISTANCE;
 }
 
 #else	/* CONFIG_NUMA */
@@ -1906,9 +1897,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 	return true;
 }
 
-static inline void init_zone_allows_reclaim(int nid)
-{
-}
 #endif	/* CONFIG_NUMA */
 
 /*
@@ -4917,7 +4905,6 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 
 	pgdat->node_id = nid;
 	pgdat->node_start_pfn = node_start_pfn;
-	init_zone_allows_reclaim(nid);
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
 #endif
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default
  2014-04-07 22:34   ` Mel Gorman
@ 2014-04-07 23:35     ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2014-04-07 23:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Robert Haas, Josh Berkus, Andres Freund,
	Christoph Lameter, Linux-MM, LKML

On Mon, Apr 07, 2014 at 11:34:27PM +0100, Mel Gorman wrote:
> zone_reclaim_mode causes processes to prefer reclaiming memory from local
> node instead of spilling over to other nodes. This made sense initially when
> NUMA machines were almost exclusively HPC and the workload was partitioned
> into nodes. The NUMA penalties were sufficiently high to justify reclaiming
> the memory. On current machines and workloads it is often the case that
> zone_reclaim_mode destroys performance but not all users know how to detect
> this. Favour the common case and disable it by default. Users that are
> sophisticated enough to know they need zone_reclaim_mode will detect it.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default
@ 2014-04-07 23:35     ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2014-04-07 23:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Robert Haas, Josh Berkus, Andres Freund,
	Christoph Lameter, Linux-MM, LKML

On Mon, Apr 07, 2014 at 11:34:27PM +0100, Mel Gorman wrote:
> zone_reclaim_mode causes processes to prefer reclaiming memory from local
> node instead of spilling over to other nodes. This made sense initially when
> NUMA machines were almost exclusively HPC and the workload was partitioned
> into nodes. The NUMA penalties were sufficiently high to justify reclaiming
> the memory. On current machines and workloads it is often the case that
> zone_reclaim_mode destroys performance but not all users know how to detect
> this. Favour the common case and disable it by default. Users that are
> sophisticated enough to know they need zone_reclaim_mode will detect it.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: Do not cache reclaim distances
  2014-04-07 22:34   ` Mel Gorman
@ 2014-04-07 23:36     ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2014-04-07 23:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Robert Haas, Josh Berkus, Andres Freund,
	Christoph Lameter, Linux-MM, LKML

On Mon, Apr 07, 2014 at 11:34:28PM +0100, Mel Gorman wrote:
> pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by
> zone_reclaim due to its distance. As it is expected that zone_reclaim_mode
> will be rarely enabled it is unreasonable for all machines to take a penalty.
> Fortunately, the zone_reclaim_mode() path is already slow and it is the path
> that takes the hit.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: Do not cache reclaim distances
@ 2014-04-07 23:36     ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2014-04-07 23:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Robert Haas, Josh Berkus, Andres Freund,
	Christoph Lameter, Linux-MM, LKML

On Mon, Apr 07, 2014 at 11:34:28PM +0100, Mel Gorman wrote:
> pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by
> zone_reclaim due to its distance. As it is expected that zone_reclaim_mode
> will be rarely enabled it is unreasonable for all machines to take a penalty.
> Fortunately, the zone_reclaim_mode() path is already slow and it is the path
> that takes the hit.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default
  2014-04-07 22:34   ` Mel Gorman
@ 2014-04-08  1:17     ` Zhang Yanfei
  -1 siblings, 0 replies; 44+ messages in thread
From: Zhang Yanfei @ 2014-04-08  1:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Robert Haas, Josh Berkus, Andres Freund,
	Christoph Lameter, Linux-MM, LKML

On 04/08/2014 06:34 AM, Mel Gorman wrote:
> zone_reclaim_mode causes processes to prefer reclaiming memory from local
> node instead of spilling over to other nodes. This made sense initially when
> NUMA machines were almost exclusively HPC and the workload was partitioned
> into nodes. The NUMA penalties were sufficiently high to justify reclaiming
> the memory. On current machines and workloads it is often the case that
> zone_reclaim_mode destroys performance but not all users know how to detect
> this. Favour the common case and disable it by default. Users that are
> sophisticated enough to know they need zone_reclaim_mode will detect it.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

> ---
>  Documentation/sysctl/vm.txt | 17 +++++++++--------
>  mm/page_alloc.c             |  2 --
>  2 files changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index d614a9b..ff5da70 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -751,16 +751,17 @@ This is value ORed together of
>  2	= Zone reclaim writes dirty pages out
>  4	= Zone reclaim swaps pages
>  
> -zone_reclaim_mode is set during bootup to 1 if it is determined that pages
> -from remote zones will cause a measurable performance reduction. The
> -page allocator will then reclaim easily reusable pages (those page
> -cache pages that are currently not used) before allocating off node pages.
> -
> -It may be beneficial to switch off zone reclaim if the system is
> -used for a file server and all of memory should be used for caching files
> -from disk. In that case the caching effect is more important than
> +zone_reclaim_mode is disabled by default.  For file servers or workloads
> +that benefit from having their data cached, zone_reclaim_mode should be
> +left disabled as the caching effect is likely to be more important than
>  data locality.
>  
> +zone_reclaim may be enabled if it's known that the workload is partitioned
> +such that each partition fits within a NUMA node and that accessing remote
> +memory would cause a measurable performance reduction.  The page allocator
> +will then reclaim easily reusable pages (those page cache pages that are
> +currently not used) before allocating off node pages.
> +
>  Allowing zone reclaim to write out pages stops processes that are
>  writing large amounts of data from dirtying pages on other nodes. Zone
>  reclaim will write out dirty pages if a zone fills up and so effectively
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3bac76a..a256f85 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1873,8 +1873,6 @@ static void __paginginit init_zone_allows_reclaim(int nid)
>  	for_each_online_node(i)
>  		if (node_distance(nid, i) <= RECLAIM_DISTANCE)
>  			node_set(i, NODE_DATA(nid)->reclaim_nodes);
> -		else
> -			zone_reclaim_mode = 1;
>  }
>  
>  #else	/* CONFIG_NUMA */
> 


-- 
Thanks.
Zhang Yanfei

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default
@ 2014-04-08  1:17     ` Zhang Yanfei
  0 siblings, 0 replies; 44+ messages in thread
From: Zhang Yanfei @ 2014-04-08  1:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Robert Haas, Josh Berkus, Andres Freund,
	Christoph Lameter, Linux-MM, LKML

On 04/08/2014 06:34 AM, Mel Gorman wrote:
> zone_reclaim_mode causes processes to prefer reclaiming memory from local
> node instead of spilling over to other nodes. This made sense initially when
> NUMA machines were almost exclusively HPC and the workload was partitioned
> into nodes. The NUMA penalties were sufficiently high to justify reclaiming
> the memory. On current machines and workloads it is often the case that
> zone_reclaim_mode destroys performance but not all users know how to detect
> this. Favour the common case and disable it by default. Users that are
> sophisticated enough to know they need zone_reclaim_mode will detect it.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

> ---
>  Documentation/sysctl/vm.txt | 17 +++++++++--------
>  mm/page_alloc.c             |  2 --
>  2 files changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index d614a9b..ff5da70 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -751,16 +751,17 @@ This is value ORed together of
>  2	= Zone reclaim writes dirty pages out
>  4	= Zone reclaim swaps pages
>  
> -zone_reclaim_mode is set during bootup to 1 if it is determined that pages
> -from remote zones will cause a measurable performance reduction. The
> -page allocator will then reclaim easily reusable pages (those page
> -cache pages that are currently not used) before allocating off node pages.
> -
> -It may be beneficial to switch off zone reclaim if the system is
> -used for a file server and all of memory should be used for caching files
> -from disk. In that case the caching effect is more important than
> +zone_reclaim_mode is disabled by default.  For file servers or workloads
> +that benefit from having their data cached, zone_reclaim_mode should be
> +left disabled as the caching effect is likely to be more important than
>  data locality.
>  
> +zone_reclaim may be enabled if it's known that the workload is partitioned
> +such that each partition fits within a NUMA node and that accessing remote
> +memory would cause a measurable performance reduction.  The page allocator
> +will then reclaim easily reusable pages (those page cache pages that are
> +currently not used) before allocating off node pages.
> +
>  Allowing zone reclaim to write out pages stops processes that are
>  writing large amounts of data from dirtying pages on other nodes. Zone
>  reclaim will write out dirty pages if a zone fills up and so effectively
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3bac76a..a256f85 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1873,8 +1873,6 @@ static void __paginginit init_zone_allows_reclaim(int nid)
>  	for_each_online_node(i)
>  		if (node_distance(nid, i) <= RECLAIM_DISTANCE)
>  			node_set(i, NODE_DATA(nid)->reclaim_nodes);
> -		else
> -			zone_reclaim_mode = 1;
>  }
>  
>  #else	/* CONFIG_NUMA */
> 


-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: Do not cache reclaim distances
  2014-04-07 22:34   ` Mel Gorman
@ 2014-04-08  1:17     ` Zhang Yanfei
  -1 siblings, 0 replies; 44+ messages in thread
From: Zhang Yanfei @ 2014-04-08  1:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Robert Haas, Josh Berkus, Andres Freund,
	Christoph Lameter, Linux-MM, LKML

On 04/08/2014 06:34 AM, Mel Gorman wrote:
> pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by
> zone_reclaim due to its distance. As it is expected that zone_reclaim_mode
> will be rarely enabled it is unreasonable for all machines to take a penalty.
> Fortunately, the zone_reclaim_mode() path is already slow and it is the path
> that takes the hit.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

> ---
>  include/linux/mmzone.h |  1 -
>  mm/page_alloc.c        | 15 +--------------
>  2 files changed, 1 insertion(+), 15 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 9b61b9b..564b169 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -757,7 +757,6 @@ typedef struct pglist_data {
>  	unsigned long node_spanned_pages; /* total size of physical page
>  					     range, including holes */
>  	int node_id;
> -	nodemask_t reclaim_nodes;	/* Nodes allowed to reclaim from */
>  	wait_queue_head_t kswapd_wait;
>  	wait_queue_head_t pfmemalloc_wait;
>  	struct task_struct *kswapd;	/* Protected by lock_memory_hotplug() */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a256f85..574928e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1863,16 +1863,7 @@ static bool zone_local(struct zone *local_zone, struct zone *zone)
>  
>  static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
>  {
> -	return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes);
> -}
> -
> -static void __paginginit init_zone_allows_reclaim(int nid)
> -{
> -	int i;
> -
> -	for_each_online_node(i)
> -		if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> -			node_set(i, NODE_DATA(nid)->reclaim_nodes);
> +	return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < RECLAIM_DISTANCE;
>  }
>  
>  #else	/* CONFIG_NUMA */
> @@ -1906,9 +1897,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
>  	return true;
>  }
>  
> -static inline void init_zone_allows_reclaim(int nid)
> -{
> -}
>  #endif	/* CONFIG_NUMA */
>  
>  /*
> @@ -4917,7 +4905,6 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
>  
>  	pgdat->node_id = nid;
>  	pgdat->node_start_pfn = node_start_pfn;
> -	init_zone_allows_reclaim(nid);
>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  	get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
>  #endif
> 


-- 
Thanks.
Zhang Yanfei

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] mm: page_alloc: Do not cache reclaim distances
@ 2014-04-08  1:17     ` Zhang Yanfei
  0 siblings, 0 replies; 44+ messages in thread
From: Zhang Yanfei @ 2014-04-08  1:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Robert Haas, Josh Berkus, Andres Freund,
	Christoph Lameter, Linux-MM, LKML

On 04/08/2014 06:34 AM, Mel Gorman wrote:
> pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by
> zone_reclaim due to its distance. As it is expected that zone_reclaim_mode
> will be rarely enabled it is unreasonable for all machines to take a penalty.
> Fortunately, the zone_reclaim_mode() path is already slow and it is the path
> that takes the hit.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>

> ---
>  include/linux/mmzone.h |  1 -
>  mm/page_alloc.c        | 15 +--------------
>  2 files changed, 1 insertion(+), 15 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 9b61b9b..564b169 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -757,7 +757,6 @@ typedef struct pglist_data {
>  	unsigned long node_spanned_pages; /* total size of physical page
>  					     range, including holes */
>  	int node_id;
> -	nodemask_t reclaim_nodes;	/* Nodes allowed to reclaim from */
>  	wait_queue_head_t kswapd_wait;
>  	wait_queue_head_t pfmemalloc_wait;
>  	struct task_struct *kswapd;	/* Protected by lock_memory_hotplug() */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a256f85..574928e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1863,16 +1863,7 @@ static bool zone_local(struct zone *local_zone, struct zone *zone)
>  
>  static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
>  {
> -	return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes);
> -}
> -
> -static void __paginginit init_zone_allows_reclaim(int nid)
> -{
> -	int i;
> -
> -	for_each_online_node(i)
> -		if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> -			node_set(i, NODE_DATA(nid)->reclaim_nodes);
> +	return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < RECLAIM_DISTANCE;
>  }
>  
>  #else	/* CONFIG_NUMA */
> @@ -1906,9 +1897,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
>  	return true;
>  }
>  
> -static inline void init_zone_allows_reclaim(int nid)
> -{
> -}
>  #endif	/* CONFIG_NUMA */
>  
>  /*
> @@ -4917,7 +4905,6 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
>  
>  	pgdat->node_id = nid;
>  	pgdat->node_start_pfn = node_start_pfn;
> -	init_zone_allows_reclaim(nid);
>  #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>  	get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
>  #endif
> 


-- 
Thanks.
Zhang Yanfei

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default
  2014-04-07 22:34   ` Mel Gorman
@ 2014-04-08  7:14     ` Andres Freund
  -1 siblings, 0 replies; 44+ messages in thread
From: Andres Freund @ 2014-04-08  7:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Robert Haas, Josh Berkus, Christoph Lameter,
	Linux-MM, LKML

Hi,

On 2014-04-07 23:34:27 +0100, Mel Gorman wrote:
> zone_reclaim_mode causes processes to prefer reclaiming memory from local
> node instead of spilling over to other nodes. This made sense initially when
> NUMA machines were almost exclusively HPC and the workload was partitioned
> into nodes. The NUMA penalties were sufficiently high to justify reclaiming
> the memory. On current machines and workloads it is often the case that
> zone_reclaim_mode destroys performance but not all users know how to detect
> this. Favour the common case and disable it by default. Users that are
> sophisticated enough to know they need zone_reclaim_mode will detect it.

Unsurprisingly I am in favor of this.

>  Documentation/sysctl/vm.txt | 17 +++++++++--------
>  mm/page_alloc.c             |  2 --
>  2 files changed, 9 insertions(+), 10 deletions(-)

But I think linux/topology.h's comment about RECLAIM_DISTANCE should be
adapted as well.

Thanks,

Andres

-- 
 Andres Freund	                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default
@ 2014-04-08  7:14     ` Andres Freund
  0 siblings, 0 replies; 44+ messages in thread
From: Andres Freund @ 2014-04-08  7:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Robert Haas, Josh Berkus, Christoph Lameter,
	Linux-MM, LKML

Hi,

On 2014-04-07 23:34:27 +0100, Mel Gorman wrote:
> zone_reclaim_mode causes processes to prefer reclaiming memory from local
> node instead of spilling over to other nodes. This made sense initially when
> NUMA machines were almost exclusively HPC and the workload was partitioned
> into nodes. The NUMA penalties were sufficiently high to justify reclaiming
> the memory. On current machines and workloads it is often the case that
> zone_reclaim_mode destroys performance but not all users know how to detect
> this. Favour the common case and disable it by default. Users that are
> sophisticated enough to know they need zone_reclaim_mode will detect it.

Unsurprisingly I am in favor of this.

>  Documentation/sysctl/vm.txt | 17 +++++++++--------
>  mm/page_alloc.c             |  2 --
>  2 files changed, 9 insertions(+), 10 deletions(-)

But I think linux/topology.h's comment about RECLAIM_DISTANCE should be
adapted as well.

Thanks,

Andres

-- 
 Andres Freund	                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
  2014-04-07 22:34 ` Mel Gorman
@ 2014-04-08  7:26   ` Vlastimil Babka
  -1 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2014-04-08  7:26 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Robert Haas, Josh Berkus, Andres Freund, Christoph Lameter,
	Linux-MM, LKML

On 04/08/2014 12:34 AM, Mel Gorman wrote:
> When it was introduced, zone_reclaim_mode made sense as NUMA distances
> punished and workloads were generally partitioned to fit into a NUMA
> node. NUMA machines are now common but few of the workloads are NUMA-aware
> and it's routine to see major performance due to zone_reclaim_mode being
> disabled but relatively few can identify the problem.
     ^ I think you meant "enabled" here?

Just in case the cover letter goes to the changelog...

Vlastimil

> Those that require zone_reclaim_mode are likely to be able to detect when
> it needs to be enabled and tune appropriately so lets have a sensible
> default for the bulk of users.
>
>   Documentation/sysctl/vm.txt | 17 +++++++++--------
>   include/linux/mmzone.h      |  1 -
>   mm/page_alloc.c             | 17 +----------------
>   3 files changed, 10 insertions(+), 25 deletions(-)
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
@ 2014-04-08  7:26   ` Vlastimil Babka
  0 siblings, 0 replies; 44+ messages in thread
From: Vlastimil Babka @ 2014-04-08  7:26 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Robert Haas, Josh Berkus, Andres Freund, Christoph Lameter,
	Linux-MM, LKML

On 04/08/2014 12:34 AM, Mel Gorman wrote:
> When it was introduced, zone_reclaim_mode made sense as NUMA distances
> punished and workloads were generally partitioned to fit into a NUMA
> node. NUMA machines are now common but few of the workloads are NUMA-aware
> and it's routine to see major performance due to zone_reclaim_mode being
> disabled but relatively few can identify the problem.
     ^ I think you meant "enabled" here?

Just in case the cover letter goes to the changelog...

Vlastimil

> Those that require zone_reclaim_mode are likely to be able to detect when
> it needs to be enabled and tune appropriately so lets have a sensible
> default for the bulk of users.
>
>   Documentation/sysctl/vm.txt | 17 +++++++++--------
>   include/linux/mmzone.h      |  1 -
>   mm/page_alloc.c             | 17 +----------------
>   3 files changed, 10 insertions(+), 25 deletions(-)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default
  2014-04-07 22:34   ` Mel Gorman
@ 2014-04-08 14:14     ` Christoph Lameter
  -1 siblings, 0 replies; 44+ messages in thread
From: Christoph Lameter @ 2014-04-08 14:14 UTC (permalink / raw)
  To: sivanich
  Cc: Mel Gorman, Andrew Morton, Robert Haas, Josh Berkus,
	Andres Freund, Linux-MM, LKML

On Mon, 7 Apr 2014, Mel Gorman wrote:

> zone_reclaim_mode causes processes to prefer reclaiming memory from local
> node instead of spilling over to other nodes. This made sense initially when
> NUMA machines were almost exclusively HPC and the workload was partitioned
> into nodes. The NUMA penalties were sufficiently high to justify reclaiming
> the memory. On current machines and workloads it is often the case that
> zone_reclaim_mode destroys performance but not all users know how to detect
> this. Favour the common case and disable it by default. Users that are
> sophisticated enough to know they need zone_reclaim_mode will detect it.

Ok that is going to require SGI machines to deal with zone_reclaim
configurations on bootup. Dimitri? Any comments?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default
@ 2014-04-08 14:14     ` Christoph Lameter
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Lameter @ 2014-04-08 14:14 UTC (permalink / raw)
  To: sivanich
  Cc: Mel Gorman, Andrew Morton, Robert Haas, Josh Berkus,
	Andres Freund, Linux-MM, LKML

On Mon, 7 Apr 2014, Mel Gorman wrote:

> zone_reclaim_mode causes processes to prefer reclaiming memory from local
> node instead of spilling over to other nodes. This made sense initially when
> NUMA machines were almost exclusively HPC and the workload was partitioned
> into nodes. The NUMA penalties were sufficiently high to justify reclaiming
> the memory. On current machines and workloads it is often the case that
> zone_reclaim_mode destroys performance but not all users know how to detect
> this. Favour the common case and disable it by default. Users that are
> sophisticated enough to know they need zone_reclaim_mode will detect it.

Ok that is going to require SGI machines to deal with zone_reclaim
configurations on bootup. Dimitri? Any comments?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
  2014-04-08  7:26   ` Vlastimil Babka
@ 2014-04-08 14:17     ` Christoph Lameter
  -1 siblings, 0 replies; 44+ messages in thread
From: Christoph Lameter @ 2014-04-08 14:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mel Gorman, Andrew Morton, Robert Haas, Josh Berkus,
	Andres Freund, Linux-MM, LKML, sivanich

On Tue, 8 Apr 2014, Vlastimil Babka wrote:

> On 04/08/2014 12:34 AM, Mel Gorman wrote:
> > When it was introduced, zone_reclaim_mode made sense as NUMA distances
> > punished and workloads were generally partitioned to fit into a NUMA
> > node. NUMA machines are now common but few of the workloads are NUMA-aware
> > and it's routine to see major performance due to zone_reclaim_mode being
> > disabled but relatively few can identify the problem.
>     ^ I think you meant "enabled" here?
>
> Just in case the cover letter goes to the changelog...

Correct.

Another solution here would be to increase the threshhold so that
4 socket machines do not enable zone reclaim by default. The larger the
NUMA system is the more memory is off node from the perspective of a
processor and the larger the hit from remote memory.

On the other hand: The more expensive we make reclaim the less it
makes sense to allow zone reclaim to occur.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
@ 2014-04-08 14:17     ` Christoph Lameter
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Lameter @ 2014-04-08 14:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mel Gorman, Andrew Morton, Robert Haas, Josh Berkus,
	Andres Freund, Linux-MM, LKML, sivanich

On Tue, 8 Apr 2014, Vlastimil Babka wrote:

> On 04/08/2014 12:34 AM, Mel Gorman wrote:
> > When it was introduced, zone_reclaim_mode made sense as NUMA distances
> > punished and workloads were generally partitioned to fit into a NUMA
> > node. NUMA machines are now common but few of the workloads are NUMA-aware
> > and it's routine to see major performance due to zone_reclaim_mode being
> > disabled but relatively few can identify the problem.
>     ^ I think you meant "enabled" here?
>
> Just in case the cover letter goes to the changelog...

Correct.

Another solution here would be to increase the threshhold so that
4 socket machines do not enable zone reclaim by default. The larger the
NUMA system is the more memory is off node from the perspective of a
processor and the larger the hit from remote memory.

On the other hand: The more expensive we make reclaim the less it
makes sense to allow zone reclaim to occur.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
  2014-04-08 14:17     ` Christoph Lameter
@ 2014-04-08 14:26       ` Andres Freund
  -1 siblings, 0 replies; 44+ messages in thread
From: Andres Freund @ 2014-04-08 14:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, Mel Gorman, Andrew Morton, Robert Haas,
	Josh Berkus, Linux-MM, LKML, sivanich

On 2014-04-08 09:17:04 -0500, Christoph Lameter wrote:
> On Tue, 8 Apr 2014, Vlastimil Babka wrote:
> 
> > On 04/08/2014 12:34 AM, Mel Gorman wrote:
> > > When it was introduced, zone_reclaim_mode made sense as NUMA distances
> > > punished and workloads were generally partitioned to fit into a NUMA
> > > node. NUMA machines are now common but few of the workloads are NUMA-aware
> > > and it's routine to see major performance due to zone_reclaim_mode being
> > > disabled but relatively few can identify the problem.
> >     ^ I think you meant "enabled" here?
> >
> > Just in case the cover letter goes to the changelog...
> 
> Correct.
> 
> Another solution here would be to increase the threshhold so that
> 4 socket machines do not enable zone reclaim by default. The larger the
> NUMA system is the more memory is off node from the perspective of a
> processor and the larger the hit from remote memory.

FWIW, I've the problem hit majorly on 8 socket machines. Those are the
largest I have seen so far in postgres scenarios. Everything larger is
far less likely to be used as single node database server, so that's
possibly a sensible cutoff.
But then, I'd think that special many-socket machines are setup by
specialists, that'd know to enable if it makes sense...

Greetings,

Andres Freund

-- 
 Andres Freund	                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
@ 2014-04-08 14:26       ` Andres Freund
  0 siblings, 0 replies; 44+ messages in thread
From: Andres Freund @ 2014-04-08 14:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, Mel Gorman, Andrew Morton, Robert Haas,
	Josh Berkus, Linux-MM, LKML, sivanich

On 2014-04-08 09:17:04 -0500, Christoph Lameter wrote:
> On Tue, 8 Apr 2014, Vlastimil Babka wrote:
> 
> > On 04/08/2014 12:34 AM, Mel Gorman wrote:
> > > When it was introduced, zone_reclaim_mode made sense as NUMA distances
> > > punished and workloads were generally partitioned to fit into a NUMA
> > > node. NUMA machines are now common but few of the workloads are NUMA-aware
> > > and it's routine to see major performance due to zone_reclaim_mode being
> > > disabled but relatively few can identify the problem.
> >     ^ I think you meant "enabled" here?
> >
> > Just in case the cover letter goes to the changelog...
> 
> Correct.
> 
> Another solution here would be to increase the threshhold so that
> 4 socket machines do not enable zone reclaim by default. The larger the
> NUMA system is the more memory is off node from the perspective of a
> processor and the larger the hit from remote memory.

FWIW, I've the problem hit majorly on 8 socket machines. Those are the
largest I have seen so far in postgres scenarios. Everything larger is
far less likely to be used as single node database server, so that's
possibly a sensible cutoff.
But then, I'd think that special many-socket machines are setup by
specialists, that'd know to enable if it makes sense...

Greetings,

Andres Freund

-- 
 Andres Freund	                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
       [not found]     ` <WM!ea1193ee171854a74828ee30c859d97ff2ce66405ffa3a0b8c31a1233c6a0b55530cdf3cbfcd989c0ec18fef1d533f81!@asav-3.01.com>
@ 2014-04-08 14:46         ` Josh Berkus
  0 siblings, 0 replies; 44+ messages in thread
From: Josh Berkus @ 2014-04-08 14:46 UTC (permalink / raw)
  To: Christoph Lameter, Vlastimil Babka
  Cc: Mel Gorman, Andrew Morton, Robert Haas, Andres Freund, Linux-MM,
	LKML, sivanich

On 04/08/2014 10:17 AM, Christoph Lameter wrote:

> Another solution here would be to increase the threshhold so that
> 4 socket machines do not enable zone reclaim by default. The larger the
> NUMA system is the more memory is off node from the perspective of a
> processor and the larger the hit from remote memory.

8 and 16 socket machines aren't common for nonspecialist workloads
*now*, but by the time these changes make it into supported distribution
kernels, they may very well be.  So having zone_reclaim_mode
automatically turn itself on if you have more than 8 sockets would still
be a booby-trap ("Boss, I dunno.  I installed the additional processors
and memory performance went to hell!")

For zone_reclaim_mode=1 to be useful on standard servers, both of the
following need to be true:

1. the user has to have set CPU affinity for their applications;

2. the applications can't need more than one memory bank worth of cache.

The thing is, there is *no way* for Linux to know if the above is true.

Now, I can certainly imagine non-HPC workloads for which both of the
above would be true; for example, I've set up VMware ESX servers where
each VM has one socket and one memory bank. However, if the user knows
enough to set up socket affinity, they know enough to set
zone_reclaim_mode = 1.  The default should cover the know-nothing case,
not the experienced specialist case.

I'd also argue that there's a fundamental false assumption in the entire
algorithm of zone_reclaim_mode, because there is no memory bank which is
as distant as disk is, ever.  However, if it's off by default, then I
don't care.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
@ 2014-04-08 14:46         ` Josh Berkus
  0 siblings, 0 replies; 44+ messages in thread
From: Josh Berkus @ 2014-04-08 14:46 UTC (permalink / raw)
  To: Christoph Lameter, Vlastimil Babka
  Cc: Mel Gorman, Andrew Morton, Robert Haas, Andres Freund, Linux-MM,
	LKML, sivanich

On 04/08/2014 10:17 AM, Christoph Lameter wrote:

> Another solution here would be to increase the threshhold so that
> 4 socket machines do not enable zone reclaim by default. The larger the
> NUMA system is the more memory is off node from the perspective of a
> processor and the larger the hit from remote memory.

8 and 16 socket machines aren't common for nonspecialist workloads
*now*, but by the time these changes make it into supported distribution
kernels, they may very well be.  So having zone_reclaim_mode
automatically turn itself on if you have more than 8 sockets would still
be a booby-trap ("Boss, I dunno.  I installed the additional processors
and memory performance went to hell!")

For zone_reclaim_mode=1 to be useful on standard servers, both of the
following need to be true:

1. the user has to have set CPU affinity for their applications;

2. the applications can't need more than one memory bank worth of cache.

The thing is, there is *no way* for Linux to know if the above is true.

Now, I can certainly imagine non-HPC workloads for which both of the
above would be true; for example, I've set up VMware ESX servers where
each VM has one socket and one memory bank. However, if the user knows
enough to set up socket affinity, they know enough to set
zone_reclaim_mode = 1.  The default should cover the know-nothing case,
not the experienced specialist case.

I'd also argue that there's a fundamental false assumption in the entire
algorithm of zone_reclaim_mode, because there is no memory bank which is
as distant as disk is, ever.  However, if it's off by default, then I
don't care.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default
  2014-04-08 14:14     ` Christoph Lameter
@ 2014-04-08 14:47       ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2014-04-08 14:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: sivanich, Andrew Morton, Robert Haas, Josh Berkus, Andres Freund,
	Linux-MM, LKML

On Tue, Apr 08, 2014 at 09:14:05AM -0500, Christoph Lameter wrote:
> On Mon, 7 Apr 2014, Mel Gorman wrote:
> 
> > zone_reclaim_mode causes processes to prefer reclaiming memory from local
> > node instead of spilling over to other nodes. This made sense initially when
> > NUMA machines were almost exclusively HPC and the workload was partitioned
> > into nodes. The NUMA penalties were sufficiently high to justify reclaiming
> > the memory. On current machines and workloads it is often the case that
> > zone_reclaim_mode destroys performance but not all users know how to detect
> > this. Favour the common case and disable it by default. Users that are
> > sophisticated enough to know they need zone_reclaim_mode will detect it.
> 
> Ok that is going to require SGI machines to deal with zone_reclaim
> configurations on bootup. Dimitri? Any comments?
> 

The SGI machines are also likely to be managed by system administrators
who are both aware of zone_reclaim_mode and know how to evaluate if it
should be enabled or not. The pair of patches is really aimmed at the
common case of 2-8 socket machines running workloads that are not NUMA
aware.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] mm: Disable zone_reclaim_mode by default
@ 2014-04-08 14:47       ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2014-04-08 14:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: sivanich, Andrew Morton, Robert Haas, Josh Berkus, Andres Freund,
	Linux-MM, LKML

On Tue, Apr 08, 2014 at 09:14:05AM -0500, Christoph Lameter wrote:
> On Mon, 7 Apr 2014, Mel Gorman wrote:
> 
> > zone_reclaim_mode causes processes to prefer reclaiming memory from local
> > node instead of spilling over to other nodes. This made sense initially when
> > NUMA machines were almost exclusively HPC and the workload was partitioned
> > into nodes. The NUMA penalties were sufficiently high to justify reclaiming
> > the memory. On current machines and workloads it is often the case that
> > zone_reclaim_mode destroys performance but not all users know how to detect
> > this. Favour the common case and disable it by default. Users that are
> > sophisticated enough to know they need zone_reclaim_mode will detect it.
> 
> Ok that is going to require SGI machines to deal with zone_reclaim
> configurations on bootup. Dimitri? Any comments?
> 

The SGI machines are also likely to be managed by system administrators
who are both aware of zone_reclaim_mode and know how to evaluate if it
should be enabled or not. The pair of patches is really aimmed at the
common case of 2-8 socket machines running workloads that are not NUMA
aware.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
  2014-04-08 14:17     ` Christoph Lameter
@ 2014-04-08 19:53       ` Robert Haas
  -1 siblings, 0 replies; 44+ messages in thread
From: Robert Haas @ 2014-04-08 19:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, Mel Gorman, Andrew Morton, Josh Berkus,
	Andres Freund, Linux-MM, LKML, sivanich

On Tue, Apr 8, 2014 at 10:17 AM, Christoph Lameter <cl@linux.com> wrote:
> Another solution here would be to increase the threshhold so that
> 4 socket machines do not enable zone reclaim by default. The larger the
> NUMA system is the more memory is off node from the perspective of a
> processor and the larger the hit from remote memory.

Well, as Josh quite rightly said, the hit from accessing remote memory
is never going to be as large as the hit from disk.  If and when there
is a machine where remote memory is more expensive to access than
disk, that's a good argument for zone_reclaim_mode.  But I don't
believe that's anywhere close to being true today, even on an 8-socket
machine with an SSD.

Now, perhaps the fear is that if we access that remote memory
*repeatedly* the aggregate cost will exceed what it would have cost to
fault that page into the local node just once.  But it takes a lot of
accesses for that to be true, and most of the time you won't get them.
 Even if you do, I bet many workloads will prefer even performance
across all the accesses over a very slow first access followed by
slightly faster subsequent accesses.

In an ideal world, the kernel would put the hottest pages on the local
node and the less-hot pages on remote nodes, moving pages around as
the workload shifts.  In practice, that's probably pretty hard.
Fortunately, it's not nearly as important as making sure we don't
unnecessarily hit the disk, which is infinitely slower than any memory
bank.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
@ 2014-04-08 19:53       ` Robert Haas
  0 siblings, 0 replies; 44+ messages in thread
From: Robert Haas @ 2014-04-08 19:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Vlastimil Babka, Mel Gorman, Andrew Morton, Josh Berkus,
	Andres Freund, Linux-MM, LKML, sivanich

On Tue, Apr 8, 2014 at 10:17 AM, Christoph Lameter <cl@linux.com> wrote:
> Another solution here would be to increase the threshhold so that
> 4 socket machines do not enable zone reclaim by default. The larger the
> NUMA system is the more memory is off node from the perspective of a
> processor and the larger the hit from remote memory.

Well, as Josh quite rightly said, the hit from accessing remote memory
is never going to be as large as the hit from disk.  If and when there
is a machine where remote memory is more expensive to access than
disk, that's a good argument for zone_reclaim_mode.  But I don't
believe that's anywhere close to being true today, even on an 8-socket
machine with an SSD.

Now, perhaps the fear is that if we access that remote memory
*repeatedly* the aggregate cost will exceed what it would have cost to
fault that page into the local node just once.  But it takes a lot of
accesses for that to be true, and most of the time you won't get them.
 Even if you do, I bet many workloads will prefer even performance
across all the accesses over a very slow first access followed by
slightly faster subsequent accesses.

In an ideal world, the kernel would put the hottest pages on the local
node and the less-hot pages on remote nodes, moving pages around as
the workload shifts.  In practice, that's probably pretty hard.
Fortunately, it's not nearly as important as making sure we don't
unnecessarily hit the disk, which is infinitely slower than any memory
bank.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
       [not found]       ` <WM!55d2a092da9f6180473043487a4eb612ae8195f78d2ffdd83f673ed5cb2cb9659cf61e0c8d5bae23f5c914057bcd2ee4!@asav-3.01.com>
@ 2014-04-08 19:56           ` Josh Berkus
  0 siblings, 0 replies; 44+ messages in thread
From: Josh Berkus @ 2014-04-08 19:56 UTC (permalink / raw)
  To: Robert Haas, Christoph Lameter
  Cc: Vlastimil Babka, Mel Gorman, Andrew Morton, Andres Freund,
	Linux-MM, LKML, sivanich

On 04/08/2014 03:53 PM, Robert Haas wrote:
> In an ideal world, the kernel would put the hottest pages on the local
> node and the less-hot pages on remote nodes, moving pages around as
> the workload shifts.  In practice, that's probably pretty hard.
> Fortunately, it's not nearly as important as making sure we don't
> unnecessarily hit the disk, which is infinitely slower than any memory
> bank.

Even if the kernel could do this, we would *still* have to disable it
for PostgreSQL, since our double-buffering makes our pages look "cold"
to the kernel ... as discussed.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
@ 2014-04-08 19:56           ` Josh Berkus
  0 siblings, 0 replies; 44+ messages in thread
From: Josh Berkus @ 2014-04-08 19:56 UTC (permalink / raw)
  To: Robert Haas, Christoph Lameter
  Cc: Vlastimil Babka, Mel Gorman, Andrew Morton, Andres Freund,
	Linux-MM, LKML, sivanich

On 04/08/2014 03:53 PM, Robert Haas wrote:
> In an ideal world, the kernel would put the hottest pages on the local
> node and the less-hot pages on remote nodes, moving pages around as
> the workload shifts.  In practice, that's probably pretty hard.
> Fortunately, it's not nearly as important as making sure we don't
> unnecessarily hit the disk, which is infinitely slower than any memory
> bank.

Even if the kernel could do this, we would *still* have to disable it
for PostgreSQL, since our double-buffering makes our pages look "cold"
to the kernel ... as discussed.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
  2014-04-08 19:53       ` Robert Haas
@ 2014-04-08 22:58         ` Christoph Lameter
  -1 siblings, 0 replies; 44+ messages in thread
From: Christoph Lameter @ 2014-04-08 22:58 UTC (permalink / raw)
  To: Robert Haas
  Cc: Vlastimil Babka, Mel Gorman, Andrew Morton, Josh Berkus,
	Andres Freund, Linux-MM, LKML, sivanich

On Tue, 8 Apr 2014, Robert Haas wrote:

> Well, as Josh quite rightly said, the hit from accessing remote memory
> is never going to be as large as the hit from disk.  If and when there
> is a machine where remote memory is more expensive to access than
> disk, that's a good argument for zone_reclaim_mode.  But I don't
> believe that's anywhere close to being true today, even on an 8-socket
> machine with an SSD.

I am nost sure how disk figures into this?

The tradeoff is zone reclaim vs. the aggregate performance
degradation of the remote memory accesses. That depends on the
cacheability of the app and the scale of memory accesses.

The reason that zone reclaim is on by default is that off node accesses
are a big performance hit on large scale NUMA systems (like ScaleMP and
SGI). Zone reclaim was written *because* those system experienced severe
performance degradation.

On the tightly coupled 4 and 8 node systems there does not seem to
be a benefit from what I hear.

> Now, perhaps the fear is that if we access that remote memory
> *repeatedly* the aggregate cost will exceed what it would have cost to
> fault that page into the local node just once.  But it takes a lot of
> accesses for that to be true, and most of the time you won't get them.
>  Even if you do, I bet many workloads will prefer even performance
> across all the accesses over a very slow first access followed by
> slightly faster subsequent accesses.

Many HPC workloads prefer the opposite.

> In an ideal world, the kernel would put the hottest pages on the local
> node and the less-hot pages on remote nodes, moving pages around as
> the workload shifts.  In practice, that's probably pretty hard.
> Fortunately, it's not nearly as important as making sure we don't
> unnecessarily hit the disk, which is infinitely slower than any memory
> bank.

Shifting pages involves similar tradeoffs as zone reclaim vs. remote
allocations.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
@ 2014-04-08 22:58         ` Christoph Lameter
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Lameter @ 2014-04-08 22:58 UTC (permalink / raw)
  To: Robert Haas
  Cc: Vlastimil Babka, Mel Gorman, Andrew Morton, Josh Berkus,
	Andres Freund, Linux-MM, LKML, sivanich

On Tue, 8 Apr 2014, Robert Haas wrote:

> Well, as Josh quite rightly said, the hit from accessing remote memory
> is never going to be as large as the hit from disk.  If and when there
> is a machine where remote memory is more expensive to access than
> disk, that's a good argument for zone_reclaim_mode.  But I don't
> believe that's anywhere close to being true today, even on an 8-socket
> machine with an SSD.

I am nost sure how disk figures into this?

The tradeoff is zone reclaim vs. the aggregate performance
degradation of the remote memory accesses. That depends on the
cacheability of the app and the scale of memory accesses.

The reason that zone reclaim is on by default is that off node accesses
are a big performance hit on large scale NUMA systems (like ScaleMP and
SGI). Zone reclaim was written *because* those system experienced severe
performance degradation.

On the tightly coupled 4 and 8 node systems there does not seem to
be a benefit from what I hear.

> Now, perhaps the fear is that if we access that remote memory
> *repeatedly* the aggregate cost will exceed what it would have cost to
> fault that page into the local node just once.  But it takes a lot of
> accesses for that to be true, and most of the time you won't get them.
>  Even if you do, I bet many workloads will prefer even performance
> across all the accesses over a very slow first access followed by
> slightly faster subsequent accesses.

Many HPC workloads prefer the opposite.

> In an ideal world, the kernel would put the hottest pages on the local
> node and the less-hot pages on remote nodes, moving pages around as
> the workload shifts.  In practice, that's probably pretty hard.
> Fortunately, it's not nearly as important as making sure we don't
> unnecessarily hit the disk, which is infinitely slower than any memory
> bank.

Shifting pages involves similar tradeoffs as zone reclaim vs. remote
allocations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
  2014-04-08 22:58         ` Christoph Lameter
@ 2014-04-08 23:26           ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2014-04-08 23:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robert Haas, Vlastimil Babka, Andrew Morton, Josh Berkus,
	Andres Freund, Linux-MM, LKML, sivanich

On Tue, Apr 08, 2014 at 05:58:21PM -0500, Christoph Lameter wrote:
> On Tue, 8 Apr 2014, Robert Haas wrote:
> 
> > Well, as Josh quite rightly said, the hit from accessing remote memory
> > is never going to be as large as the hit from disk.  If and when there
> > is a machine where remote memory is more expensive to access than
> > disk, that's a good argument for zone_reclaim_mode.  But I don't
> > believe that's anywhere close to being true today, even on an 8-socket
> > machine with an SSD.
> 
> I am nost sure how disk figures into this?
> 

It's a matter of perspective. For those that are running file servers,
databases and the like they don't see the remote accesses, they see their
page cache getting reclaimed but not all of those users understand why
because they are not NUMA aware. This is why they are seeing the cost of
zone_reclaim_mode to be IO-related.

I think pretty much 100% of the bug reports I've seen related to
zone_reclaim_mode were due to IO-intensive workloads and the user not
recognising why page cache was getting reclaimed aggressively.

> The tradeoff is zone reclaim vs. the aggregate performance
> degradation of the remote memory accesses. That depends on the
> cacheability of the app and the scale of memory accesses.
> 

For HPC, yes.

> The reason that zone reclaim is on by default is that off node accesses
> are a big performance hit on large scale NUMA systems (like ScaleMP and
> SGI). Zone reclaim was written *because* those system experienced severe
> performance degradation.
> 

Yes, this is understood. However, those same people already know how to use
cpusets, NUMA bindings and how tune their workload to partition it into
the nodes. From a NUMA perspective they are relatively sophisticated and
know how and when to set zone_reclaim_mode. At least on any bug report I've
seen related to these really large machines, they were already using cpusets.

This is why I think think the default for zone_reclaim should now be off
because it helps the common case.

> On the tightly coupled 4 and 8 node systems there does not seem to
> be a benefit from what I hear.
> 
> > Now, perhaps the fear is that if we access that remote memory
> > *repeatedly* the aggregate cost will exceed what it would have cost to
> > fault that page into the local node just once.  But it takes a lot of
> > accesses for that to be true, and most of the time you won't get them.
> >  Even if you do, I bet many workloads will prefer even performance
> > across all the accesses over a very slow first access followed by
> > slightly faster subsequent accesses.
> 
> Many HPC workloads prefer the opposite.
> 

And they know how to tune accordingly.

> > In an ideal world, the kernel would put the hottest pages on the local
> > node and the less-hot pages on remote nodes, moving pages around as
> > the workload shifts.  In practice, that's probably pretty hard.
> > Fortunately, it's not nearly as important as making sure we don't
> > unnecessarily hit the disk, which is infinitely slower than any memory
> > bank.
> 
> Shifting pages involves similar tradeoffs as zone reclaim vs. remote
> allocations.

In practice it really is hard for the kernel to do this
automatically. Automatic NUMA balancing will help if the data is mapped but
not if it's buffered read/writes because there is no hinting information
available right now. At some point we may need to tackle IO locality but
it'll take time for users to get experience with automatic balancing as
it is before taking further steps. That's an aside to the current discussion.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
@ 2014-04-08 23:26           ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2014-04-08 23:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Robert Haas, Vlastimil Babka, Andrew Morton, Josh Berkus,
	Andres Freund, Linux-MM, LKML, sivanich

On Tue, Apr 08, 2014 at 05:58:21PM -0500, Christoph Lameter wrote:
> On Tue, 8 Apr 2014, Robert Haas wrote:
> 
> > Well, as Josh quite rightly said, the hit from accessing remote memory
> > is never going to be as large as the hit from disk.  If and when there
> > is a machine where remote memory is more expensive to access than
> > disk, that's a good argument for zone_reclaim_mode.  But I don't
> > believe that's anywhere close to being true today, even on an 8-socket
> > machine with an SSD.
> 
> I am nost sure how disk figures into this?
> 

It's a matter of perspective. For those that are running file servers,
databases and the like they don't see the remote accesses, they see their
page cache getting reclaimed but not all of those users understand why
because they are not NUMA aware. This is why they are seeing the cost of
zone_reclaim_mode to be IO-related.

I think pretty much 100% of the bug reports I've seen related to
zone_reclaim_mode were due to IO-intensive workloads and the user not
recognising why page cache was getting reclaimed aggressively.

> The tradeoff is zone reclaim vs. the aggregate performance
> degradation of the remote memory accesses. That depends on the
> cacheability of the app and the scale of memory accesses.
> 

For HPC, yes.

> The reason that zone reclaim is on by default is that off node accesses
> are a big performance hit on large scale NUMA systems (like ScaleMP and
> SGI). Zone reclaim was written *because* those system experienced severe
> performance degradation.
> 

Yes, this is understood. However, those same people already know how to use
cpusets, NUMA bindings and how tune their workload to partition it into
the nodes. From a NUMA perspective they are relatively sophisticated and
know how and when to set zone_reclaim_mode. At least on any bug report I've
seen related to these really large machines, they were already using cpusets.

This is why I think think the default for zone_reclaim should now be off
because it helps the common case.

> On the tightly coupled 4 and 8 node systems there does not seem to
> be a benefit from what I hear.
> 
> > Now, perhaps the fear is that if we access that remote memory
> > *repeatedly* the aggregate cost will exceed what it would have cost to
> > fault that page into the local node just once.  But it takes a lot of
> > accesses for that to be true, and most of the time you won't get them.
> >  Even if you do, I bet many workloads will prefer even performance
> > across all the accesses over a very slow first access followed by
> > slightly faster subsequent accesses.
> 
> Many HPC workloads prefer the opposite.
> 

And they know how to tune accordingly.

> > In an ideal world, the kernel would put the hottest pages on the local
> > node and the less-hot pages on remote nodes, moving pages around as
> > the workload shifts.  In practice, that's probably pretty hard.
> > Fortunately, it's not nearly as important as making sure we don't
> > unnecessarily hit the disk, which is infinitely slower than any memory
> > bank.
> 
> Shifting pages involves similar tradeoffs as zone reclaim vs. remote
> allocations.

In practice it really is hard for the kernel to do this
automatically. Automatic NUMA balancing will help if the data is mapped but
not if it's buffered read/writes because there is no hinting information
available right now. At some point we may need to tackle IO locality but
it'll take time for users to get experience with automatic balancing as
it is before taking further steps. That's an aside to the current discussion.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
  2014-04-08 19:56           ` Josh Berkus
@ 2014-04-09 13:08             ` Mel Gorman
  -1 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2014-04-09 13:08 UTC (permalink / raw)
  To: Josh Berkus
  Cc: Robert Haas, Christoph Lameter, Vlastimil Babka, Andrew Morton,
	Andres Freund, Linux-MM, LKML, sivanich

On Tue, Apr 08, 2014 at 03:56:49PM -0400, Josh Berkus wrote:
> On 04/08/2014 03:53 PM, Robert Haas wrote:
> > In an ideal world, the kernel would put the hottest pages on the local
> > node and the less-hot pages on remote nodes, moving pages around as
> > the workload shifts.  In practice, that's probably pretty hard.
> > Fortunately, it's not nearly as important as making sure we don't
> > unnecessarily hit the disk, which is infinitely slower than any memory
> > bank.
> 
> Even if the kernel could do this, we would *still* have to disable it
> for PostgreSQL, since our double-buffering makes our pages look "cold"
> to the kernel ... as discussed.
> 

If it's the shared mapping that is being used then automatic NUMA
balancing should migrate those pages to a node local to the CPU
accessing it but how well it works will partially depend on how much
those accesses move around. It's independent of the zone_reclaim_mode
issue.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
@ 2014-04-09 13:08             ` Mel Gorman
  0 siblings, 0 replies; 44+ messages in thread
From: Mel Gorman @ 2014-04-09 13:08 UTC (permalink / raw)
  To: Josh Berkus
  Cc: Robert Haas, Christoph Lameter, Vlastimil Babka, Andrew Morton,
	Andres Freund, Linux-MM, LKML, sivanich

On Tue, Apr 08, 2014 at 03:56:49PM -0400, Josh Berkus wrote:
> On 04/08/2014 03:53 PM, Robert Haas wrote:
> > In an ideal world, the kernel would put the hottest pages on the local
> > node and the less-hot pages on remote nodes, moving pages around as
> > the workload shifts.  In practice, that's probably pretty hard.
> > Fortunately, it's not nearly as important as making sure we don't
> > unnecessarily hit the disk, which is infinitely slower than any memory
> > bank.
> 
> Even if the kernel could do this, we would *still* have to disable it
> for PostgreSQL, since our double-buffering makes our pages look "cold"
> to the kernel ... as discussed.
> 

If it's the shared mapping that is being used then automatic NUMA
balancing should migrate those pages to a node local to the CPU
accessing it but how well it works will partially depend on how much
those accesses move around. It's independent of the zone_reclaim_mode
issue.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
  2014-04-08 22:58         ` Christoph Lameter
@ 2014-04-10 10:26           ` Jeremy Harris
  -1 siblings, 0 replies; 44+ messages in thread
From: Jeremy Harris @ 2014-04-10 10:26 UTC (permalink / raw)
  Cc: Linux-MM, LKML

On 08/04/14 23:58, Christoph Lameter wrote:
> The reason that zone reclaim is on by default is that off node accesses
> are a big performance hit on large scale NUMA systems (like ScaleMP and
> SGI). Zone reclaim was written *because* those system experienced severe
> performance degradation.
>
> On the tightly coupled 4 and 8 node systems there does not seem to
> be a benefit from what I hear.

In principle, is this difference in distance something the kernel
could measure?
-- 
Cheers,
    Jeremy



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
@ 2014-04-10 10:26           ` Jeremy Harris
  0 siblings, 0 replies; 44+ messages in thread
From: Jeremy Harris @ 2014-04-10 10:26 UTC (permalink / raw)
  Cc: Linux-MM, LKML

On 08/04/14 23:58, Christoph Lameter wrote:
> The reason that zone reclaim is on by default is that off node accesses
> are a big performance hit on large scale NUMA systems (like ScaleMP and
> SGI). Zone reclaim was written *because* those system experienced severe
> performance degradation.
>
> On the tightly coupled 4 and 8 node systems there does not seem to
> be a benefit from what I hear.

In principle, is this difference in distance something the kernel
could measure?
-- 
Cheers,
    Jeremy


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
  2014-04-07 22:34 ` Mel Gorman
@ 2014-04-18 15:49   ` Michal Hocko
  -1 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2014-04-18 15:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Robert Haas, Josh Berkus, Andres Freund,
	Christoph Lameter, Linux-MM, LKML

On Mon 07-04-14 23:34:26, Mel Gorman wrote:
> When it was introduced, zone_reclaim_mode made sense as NUMA distances
> punished and workloads were generally partitioned to fit into a NUMA
> node. NUMA machines are now common but few of the workloads are NUMA-aware
> and it's routine to see major performance due to zone_reclaim_mode being
> disabled but relatively few can identify the problem.
> 
> Those that require zone_reclaim_mode are likely to be able to detect when
> it needs to be enabled and tune appropriately so lets have a sensible
> default for the bulk of users.
> 
>  Documentation/sysctl/vm.txt | 17 +++++++++--------
>  include/linux/mmzone.h      |  1 -
>  mm/page_alloc.c             | 17 +----------------
>  3 files changed, 10 insertions(+), 25 deletions(-)

Auto-enabling caused so many reports in the past that it is definitely
much better to not be clever and let admins enable zone_reclaim where it
is appropriate instead.

For both patches.
Acked-by: Michal Hocko <mhocko@suse.cz>
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
@ 2014-04-18 15:49   ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2014-04-18 15:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Robert Haas, Josh Berkus, Andres Freund,
	Christoph Lameter, Linux-MM, LKML

On Mon 07-04-14 23:34:26, Mel Gorman wrote:
> When it was introduced, zone_reclaim_mode made sense as NUMA distances
> punished and workloads were generally partitioned to fit into a NUMA
> node. NUMA machines are now common but few of the workloads are NUMA-aware
> and it's routine to see major performance due to zone_reclaim_mode being
> disabled but relatively few can identify the problem.
> 
> Those that require zone_reclaim_mode are likely to be able to detect when
> it needs to be enabled and tune appropriately so lets have a sensible
> default for the bulk of users.
> 
>  Documentation/sysctl/vm.txt | 17 +++++++++--------
>  include/linux/mmzone.h      |  1 -
>  mm/page_alloc.c             | 17 +----------------
>  3 files changed, 10 insertions(+), 25 deletions(-)

Auto-enabling caused so many reports in the past that it is definitely
much better to not be clever and let admins enable zone_reclaim where it
is appropriate instead.

For both patches.
Acked-by: Michal Hocko <mhocko@suse.cz>
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
  2014-04-18 15:49   ` Michal Hocko
@ 2014-04-18 16:44     ` Christoph Lameter
  -1 siblings, 0 replies; 44+ messages in thread
From: Christoph Lameter @ 2014-04-18 16:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Andrew Morton, Robert Haas, Josh Berkus,
	Andres Freund, Linux-MM, LKML

On Fri, 18 Apr 2014, Michal Hocko wrote:

> Auto-enabling caused so many reports in the past that it is definitely
> much better to not be clever and let admins enable zone_reclaim where it
> is appropriate instead.
>
> For both patches.
> Acked-by: Michal Hocko <mhocko@suse.cz>

I did not get any objections from SGI either.

Reviewed-by: Christoph Lameter <cl@linux.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/2] Disable zone_reclaim_mode by default
@ 2014-04-18 16:44     ` Christoph Lameter
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Lameter @ 2014-04-18 16:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Andrew Morton, Robert Haas, Josh Berkus,
	Andres Freund, Linux-MM, LKML

On Fri, 18 Apr 2014, Michal Hocko wrote:

> Auto-enabling caused so many reports in the past that it is definitely
> much better to not be clever and let admins enable zone_reclaim where it
> is appropriate instead.
>
> For both patches.
> Acked-by: Michal Hocko <mhocko@suse.cz>

I did not get any objections from SGI either.

Reviewed-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2014-04-18 16:44 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-07 22:34 [PATCH 0/2] Disable zone_reclaim_mode by default Mel Gorman
2014-04-07 22:34 ` Mel Gorman
2014-04-07 22:34 ` [PATCH 1/2] mm: " Mel Gorman
2014-04-07 22:34   ` Mel Gorman
2014-04-07 23:35   ` Johannes Weiner
2014-04-07 23:35     ` Johannes Weiner
2014-04-08  1:17   ` Zhang Yanfei
2014-04-08  1:17     ` Zhang Yanfei
2014-04-08  7:14   ` Andres Freund
2014-04-08  7:14     ` Andres Freund
2014-04-08 14:14   ` Christoph Lameter
2014-04-08 14:14     ` Christoph Lameter
2014-04-08 14:47     ` Mel Gorman
2014-04-08 14:47       ` Mel Gorman
2014-04-07 22:34 ` [PATCH 2/2] mm: page_alloc: Do not cache reclaim distances Mel Gorman
2014-04-07 22:34   ` Mel Gorman
2014-04-07 23:36   ` Johannes Weiner
2014-04-07 23:36     ` Johannes Weiner
2014-04-08  1:17   ` Zhang Yanfei
2014-04-08  1:17     ` Zhang Yanfei
2014-04-08  7:26 ` [PATCH 0/2] Disable zone_reclaim_mode by default Vlastimil Babka
2014-04-08  7:26   ` Vlastimil Babka
2014-04-08 14:17   ` Christoph Lameter
2014-04-08 14:17     ` Christoph Lameter
2014-04-08 14:26     ` Andres Freund
2014-04-08 14:26       ` Andres Freund
     [not found]     ` <WM!ea1193ee171854a74828ee30c859d97ff2ce66405ffa3a0b8c31a1233c6a0b55530cdf3cbfcd989c0ec18fef1d533f81!@asav-3.01.com>
2014-04-08 14:46       ` Josh Berkus
2014-04-08 14:46         ` Josh Berkus
2014-04-08 19:53     ` Robert Haas
2014-04-08 19:53       ` Robert Haas
     [not found]       ` <WM!55d2a092da9f6180473043487a4eb612ae8195f78d2ffdd83f673ed5cb2cb9659cf61e0c8d5bae23f5c914057bcd2ee4!@asav-3.01.com>
2014-04-08 19:56         ` Josh Berkus
2014-04-08 19:56           ` Josh Berkus
2014-04-09 13:08           ` Mel Gorman
2014-04-09 13:08             ` Mel Gorman
2014-04-08 22:58       ` Christoph Lameter
2014-04-08 22:58         ` Christoph Lameter
2014-04-08 23:26         ` Mel Gorman
2014-04-08 23:26           ` Mel Gorman
2014-04-10 10:26         ` Jeremy Harris
2014-04-10 10:26           ` Jeremy Harris
2014-04-18 15:49 ` Michal Hocko
2014-04-18 15:49   ` Michal Hocko
2014-04-18 16:44   ` Christoph Lameter
2014-04-18 16:44     ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.