All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/16] Misc page alloc, shmem and mark_page_accessed optimisations
@ 2014-04-18 14:50 Mel Gorman
  2014-04-18 14:50 ` [PATCH 01/16] mm: Disable zone_reclaim_mode by default Mel Gorman
                   ` (15 more replies)
  0 siblings, 16 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

I was investigating a performance bug that looked like dd to tmpfs
had regressed.  The bulk of the problem turned out to be a difference
in Kconfig but it got me looking at the unnecessary overhead in tmpfs,
mark_page_accessed and parts of the allocator. This series is the result.

The primary test workload was dd to a tmpfs file that was 1/10th the size
of memory so that dirty balancing and reclaim should not be factors.

loopdd Throughput
                     3.15.0-rc1            3.15.0-rc1
                        vanilla        microopt-v1r11
Min      3993.6000 (  0.00%)      4096.0000 (  2.56%)
Mean     4766.7200 (  0.00%)      4896.4267 (  2.72%)
Stddev    164.5053 (  0.00%)       167.7316 (  1.96%)
Max      4812.8000 (  0.00%)      5120.0000 (  6.38%)

Respectable increase in throughput. The figures are misleading though because
dd reports in GB/sec so there is a lot of noise. The actual time to completiono
is easier to see

loopdd Time
                         3.15.0-rc1            3.15.0-rc1
                            vanilla        microopt-v1r11
Min      time0.3521 (  0.00%)0.3317 (  5.80%)
Mean     time0.3570 (  0.00%)0.3458 (  3.14%)
Stddev   time0.0140 (  0.00%)0.0112 ( 20.59%)
Max      time0.4230 (  0.00%)0.4083 (  3.49%)

The time to dd the data is noticably reduced

          3.15.0-rc1  3.15.0-rc1
             vanillamicroopt-v1r11
User           10.86       10.78
System         70.21       67.12
Elapsed        92.43       89.42

And the system CPU overhead is lower.

A series of tests against various filesystems as well as a general
benchmark are still running but I thought I would send the series out
as-is for comment.

 Documentation/sysctl/vm.txt         |  17 ++--
 arch/ia64/include/asm/topology.h    |   3 +-
 arch/powerpc/include/asm/topology.h |   8 +-
 include/linux/cpuset.h              |  29 +++++++
 include/linux/mmzone.h              |  14 ++-
 include/linux/page-flags.h          |   2 +
 include/linux/pageblock-flags.h     |  18 +++-
 include/linux/swap.h                |   7 +-
 include/linux/topology.h            |   3 +-
 kernel/cpuset.c                     |   8 +-
 mm/filemap.c                        |  58 ++++++++-----
 mm/page_alloc.c                     | 164 ++++++++++++++++++++----------------
 mm/shmem.c                          |   8 +-
 mm/swap.c                           |  13 ++-
 14 files changed, 226 insertions(+), 126 deletions(-)

-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH 01/16] mm: Disable zone_reclaim_mode by default
  2014-04-18 14:50 [PATCH 00/16] Misc page alloc, shmem and mark_page_accessed optimisations Mel Gorman
@ 2014-04-18 14:50 ` Mel Gorman
  2014-04-18 17:26     ` Andi Kleen
  2014-04-18 14:50   ` Mel Gorman
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

zone_reclaim_mode causes processes to prefer reclaiming memory from local
node instead of spilling over to other nodes. This made sense initially when
NUMA machines were almost exclusively HPC and the workload was partitioned
into nodes. The NUMA penalties were sufficiently high to justify reclaiming
the memory. On current machines and workloads it is often the case that
zone_reclaim_mode destroys performance but not all users know how to detect
this. Favour the common case and disable it by default. Users that are
sophisticated enough to know they need zone_reclaim_mode will detect it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 Documentation/sysctl/vm.txt         | 17 +++++++++--------
 arch/ia64/include/asm/topology.h    |  3 ++-
 arch/powerpc/include/asm/topology.h |  8 ++------
 include/linux/topology.h            |  3 ++-
 mm/page_alloc.c                     |  2 --
 5 files changed, 15 insertions(+), 18 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index dd9d0e3..5b6da0f 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -772,16 +772,17 @@ This is value ORed together of
 2	= Zone reclaim writes dirty pages out
 4	= Zone reclaim swaps pages
 
-zone_reclaim_mode is set during bootup to 1 if it is determined that pages
-from remote zones will cause a measurable performance reduction. The
-page allocator will then reclaim easily reusable pages (those page
-cache pages that are currently not used) before allocating off node pages.
-
-It may be beneficial to switch off zone reclaim if the system is
-used for a file server and all of memory should be used for caching files
-from disk. In that case the caching effect is more important than
+zone_reclaim_mode is disabled by default.  For file servers or workloads
+that benefit from having their data cached, zone_reclaim_mode should be
+left disabled as the caching effect is likely to be more important than
 data locality.
 
+zone_reclaim may be enabled if it's known that the workload is partitioned
+such that each partition fits within a NUMA node and that accessing remote
+memory would cause a measurable performance reduction.  The page allocator
+will then reclaim easily reusable pages (those page cache pages that are
+currently not used) before allocating off node pages.
+
 Allowing zone reclaim to write out pages stops processes that are
 writing large amounts of data from dirtying pages on other nodes. Zone
 reclaim will write out dirty pages if a zone fills up and so effectively
diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index 5cb55a1..3555fdd 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -21,7 +21,8 @@
 #define PENALTY_FOR_NODE_WITH_CPUS 255
 
 /*
- * Distance above which we begin to use zone reclaim
+ * Nodes within this distance are eligible for reclaim by zone_reclaim() when
+ * zone_reclaim_mode is enabled.
  */
 #define RECLAIM_DISTANCE 15
 
diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index c920215..6c8a8c5 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -9,12 +9,8 @@ struct device_node;
 #ifdef CONFIG_NUMA
 
 /*
- * Before going off node we want the VM to try and reclaim from the local
- * node. It does this if the remote distance is larger than RECLAIM_DISTANCE.
- * With the default REMOTE_DISTANCE of 20 and the default RECLAIM_DISTANCE of
- * 20, we never reclaim and go off node straight away.
- *
- * To fix this we choose a smaller value of RECLAIM_DISTANCE.
+ * If zone_reclaim_mode is enabled, a RECLAIM_DISTANCE of 10 will mean that
+ * all zones on all nodes will be eligible for zone_reclaim().
  */
 #define RECLAIM_DISTANCE 10
 
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 7062330..53261e2 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -58,7 +58,8 @@ int arch_update_cpu_topology(void);
 /*
  * If the distance between nodes in a system is larger than RECLAIM_DISTANCE
  * (in whatever arch specific measurement units returned by node_distance())
- * then switch on zone reclaim on boot.
+ * and zone_reclaim_mode is enabled then the VM will only call zone_reclaim()
+ * on nodes within this distance.
  */
 #define RECLAIM_DISTANCE 30
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dba293..628f1e7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1860,8 +1860,6 @@ static void __paginginit init_zone_allows_reclaim(int nid)
 	for_each_node_state(i, N_MEMORY)
 		if (node_distance(nid, i) <= RECLAIM_DISTANCE)
 			node_set(i, NODE_DATA(nid)->reclaim_nodes);
-		else
-			zone_reclaim_mode = 1;
 }
 
 #else	/* CONFIG_NUMA */
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 02/16] mm: page_alloc: Do not cache reclaim distances
  2014-04-18 14:50 [PATCH 00/16] Misc page alloc, shmem and mark_page_accessed optimisations Mel Gorman
@ 2014-04-18 14:50   ` Mel Gorman
  2014-04-18 14:50   ` Mel Gorman
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by
zone_reclaim due to its distance. As it is expected that zone_reclaim_mode
will be rarely enabled it is unreasonable for all machines to take a penalty.
Fortunately, the zone_reclaim_mode() path is already slow and it is the path
that takes the hit.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/mmzone.h |  1 -
 mm/page_alloc.c        | 18 ++----------------
 2 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fac5509..c1dbe0b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -763,7 +763,6 @@ typedef struct pglist_data {
 	unsigned long node_spanned_pages; /* total size of physical page
 					     range, including holes */
 	int node_id;
-	nodemask_t reclaim_nodes;	/* Nodes allowed to reclaim from */
 	wait_queue_head_t kswapd_wait;
 	wait_queue_head_t pfmemalloc_wait;
 	struct task_struct *kswapd;	/* Protected by lock_memory_hotplug() */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 628f1e7..3c8200c5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1850,16 +1850,8 @@ static bool zone_local(struct zone *local_zone, struct zone *zone)
 
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
-	return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes);
-}
-
-static void __paginginit init_zone_allows_reclaim(int nid)
-{
-	int i;
-
-	for_each_node_state(i, N_MEMORY)
-		if (node_distance(nid, i) <= RECLAIM_DISTANCE)
-			node_set(i, NODE_DATA(nid)->reclaim_nodes);
+	return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <
+							RECLAIM_DISTANCE;
 }
 
 #else	/* CONFIG_NUMA */
@@ -1892,10 +1884,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
 	return true;
 }
-
-static inline void init_zone_allows_reclaim(int nid)
-{
-}
 #endif	/* CONFIG_NUMA */
 
 /*
@@ -4919,8 +4907,6 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 
 	pgdat->node_id = nid;
 	pgdat->node_start_pfn = node_start_pfn;
-	if (node_state(nid, N_MEMORY))
-		init_zone_allows_reclaim(nid);
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
 #endif
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 02/16] mm: page_alloc: Do not cache reclaim distances
@ 2014-04-18 14:50   ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by
zone_reclaim due to its distance. As it is expected that zone_reclaim_mode
will be rarely enabled it is unreasonable for all machines to take a penalty.
Fortunately, the zone_reclaim_mode() path is already slow and it is the path
that takes the hit.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 include/linux/mmzone.h |  1 -
 mm/page_alloc.c        | 18 ++----------------
 2 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fac5509..c1dbe0b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -763,7 +763,6 @@ typedef struct pglist_data {
 	unsigned long node_spanned_pages; /* total size of physical page
 					     range, including holes */
 	int node_id;
-	nodemask_t reclaim_nodes;	/* Nodes allowed to reclaim from */
 	wait_queue_head_t kswapd_wait;
 	wait_queue_head_t pfmemalloc_wait;
 	struct task_struct *kswapd;	/* Protected by lock_memory_hotplug() */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 628f1e7..3c8200c5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1850,16 +1850,8 @@ static bool zone_local(struct zone *local_zone, struct zone *zone)
 
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
-	return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes);
-}
-
-static void __paginginit init_zone_allows_reclaim(int nid)
-{
-	int i;
-
-	for_each_node_state(i, N_MEMORY)
-		if (node_distance(nid, i) <= RECLAIM_DISTANCE)
-			node_set(i, NODE_DATA(nid)->reclaim_nodes);
+	return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <
+							RECLAIM_DISTANCE;
 }
 
 #else	/* CONFIG_NUMA */
@@ -1892,10 +1884,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
 	return true;
 }
-
-static inline void init_zone_allows_reclaim(int nid)
-{
-}
 #endif	/* CONFIG_NUMA */
 
 /*
@@ -4919,8 +4907,6 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
 
 	pgdat->node_id = nid;
 	pgdat->node_start_pfn = node_start_pfn;
-	if (node_state(nid, N_MEMORY))
-		init_zone_allows_reclaim(nid);
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 	get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
 #endif
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 03/16] mm: page_alloc: Do not update zlc unless the zlc is active
  2014-04-18 14:50 [PATCH 00/16] Misc page alloc, shmem and mark_page_accessed optimisations Mel Gorman
@ 2014-04-18 14:50   ` Mel Gorman
  2014-04-18 14:50   ` Mel Gorman
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

The zlc is used on NUMA machines to quickly skip over zones that are full.
However it is always updated, even for the first zone scanned when the
zlc might not even be active. As it's a write to a bitmap that potentially
bounces cache line it's deceptively expensive and most machines will not
care. Only update the zlc if it was active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c8200c5..d8c9c4a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2030,7 +2030,7 @@ try_this_zone:
 		if (page)
 			break;
 this_zone_full:
-		if (IS_ENABLED(CONFIG_NUMA))
+		if (IS_ENABLED(CONFIG_NUMA) && zlc_active)
 			zlc_mark_zone_full(zonelist, z);
 	}
 
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 03/16] mm: page_alloc: Do not update zlc unless the zlc is active
@ 2014-04-18 14:50   ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

The zlc is used on NUMA machines to quickly skip over zones that are full.
However it is always updated, even for the first zone scanned when the
zlc might not even be active. As it's a write to a bitmap that potentially
bounces cache line it's deceptively expensive and most machines will not
care. Only update the zlc if it was active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c8200c5..d8c9c4a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2030,7 +2030,7 @@ try_this_zone:
 		if (page)
 			break;
 this_zone_full:
-		if (IS_ENABLED(CONFIG_NUMA))
+		if (IS_ENABLED(CONFIG_NUMA) && zlc_active)
 			zlc_mark_zone_full(zonelist, z);
 	}
 
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 04/16] mm: page_alloc: Do not treat a zone that cannot be used for dirty pages as "full"
  2014-04-18 14:50 [PATCH 00/16] Misc page alloc, shmem and mark_page_accessed optimisations Mel Gorman
@ 2014-04-18 14:50   ` Mel Gorman
  2014-04-18 14:50   ` Mel Gorman
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

If a zone cannot be used for a dirty page then it gets marked "full"
which is cached in the zlc and later potentially skipped by allocation
requests that have nothing to do with dirty zones.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d8c9c4a..ad702e9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1962,7 +1962,7 @@ zonelist_scan:
 		 */
 		if ((alloc_flags & ALLOC_WMARK_LOW) &&
 		    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
-			goto this_zone_full;
+			continue;
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 		if (!zone_watermark_ok(zone, order, mark,
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 04/16] mm: page_alloc: Do not treat a zone that cannot be used for dirty pages as "full"
@ 2014-04-18 14:50   ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

If a zone cannot be used for a dirty page then it gets marked "full"
which is cached in the zlc and later potentially skipped by allocation
requests that have nothing to do with dirty zones.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d8c9c4a..ad702e9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1962,7 +1962,7 @@ zonelist_scan:
 		 */
 		if ((alloc_flags & ALLOC_WMARK_LOW) &&
 		    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
-			goto this_zone_full;
+			continue;
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 		if (!zone_watermark_ok(zone, order, mark,
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 05/16] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets
  2014-04-18 14:50 [PATCH 00/16] Misc page alloc, shmem and mark_page_accessed optimisations Mel Gorman
                   ` (3 preceding siblings ...)
  2014-04-18 14:50   ` Mel Gorman
@ 2014-04-18 14:50 ` Mel Gorman
  2014-04-18 14:50   ` Mel Gorman
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

If cpusets are not in use then we still check a global variable on every
page allocation. Use jump labels to avoid the overhead.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/cpuset.h | 29 +++++++++++++++++++++++++++++
 kernel/cpuset.c        |  8 ++++++--
 mm/page_alloc.c        |  3 ++-
 3 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index b19d3dc..9c840e3 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -17,6 +17,35 @@
 
 extern int number_of_cpusets;	/* How many cpusets are defined in system? */
 
+#ifdef HAVE_JUMP_LABEL
+extern struct static_key cpusets_enabled_key;
+static inline bool cpusets_enabled(void)
+{
+	return static_key_false(&cpusets_enabled_key);
+}
+#else
+static inline bool cpusets_enabled(void)
+{
+	return number_of_cpusets > 1;
+}
+#endif
+
+static inline void cpuset_inc(void)
+{
+	number_of_cpusets++;
+#ifdef HAVE_JUMP_LABEL
+	static_key_slow_inc(&cpusets_enabled_key);
+#endif
+}
+
+static inline void cpuset_dec(void)
+{
+	number_of_cpusets--;
+#ifdef HAVE_JUMP_LABEL
+	static_key_slow_dec(&cpusets_enabled_key);
+#endif
+}
+
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
 extern void cpuset_update_active_cpus(bool cpu_online);
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 3d54c41..34ada52 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -68,6 +68,10 @@
  */
 int number_of_cpusets __read_mostly;
 
+#ifdef HAVE_JUMP_LABEL
+struct static_key cpusets_enabled_key = STATIC_KEY_INIT_FALSE;
+#endif
+
 /* See "Frequency meter" comments, below. */
 
 struct fmeter {
@@ -1888,7 +1892,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	if (is_spread_slab(parent))
 		set_bit(CS_SPREAD_SLAB, &cs->flags);
 
-	number_of_cpusets++;
+	cpuset_inc();
 
 	if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
 		goto out_unlock;
@@ -1939,7 +1943,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 	if (is_sched_load_balance(cs))
 		update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
 
-	number_of_cpusets--;
+	cpuset_dec();
 	clear_bit(CS_ONLINE, &cs->flags);
 
 	mutex_unlock(&cpuset_mutex);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ad702e9..3f2a9dd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1916,7 +1916,8 @@ zonelist_scan:
 		if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
-		if ((alloc_flags & ALLOC_CPUSET) &&
+		if (cpusets_enabled() &&
+			(alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				continue;
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 06/16] mm: page_alloc: Calculate classzone_idx once from the zonelist ref
  2014-04-18 14:50 [PATCH 00/16] Misc page alloc, shmem and mark_page_accessed optimisations Mel Gorman
@ 2014-04-18 14:50   ` Mel Gorman
  2014-04-18 14:50   ` Mel Gorman
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

There is no need to calculate zone_idx(preferred_zone) multiple times
or use the pgdat to figure it out.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 43 ++++++++++++++++++++++++-------------------
 1 file changed, 24 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3f2a9dd..88a6dac 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1893,17 +1893,15 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone, int migratetype)
+		struct zone *preferred_zone, int classzone_idx, int migratetype)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
-	int classzone_idx;
 	struct zone *zone;
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
-	classzone_idx = zone_idx(preferred_zone);
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -2160,7 +2158,7 @@ static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int classzone_idx, int migratetype)
 {
 	struct page *page;
 
@@ -2178,7 +2176,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone, migratetype);
+		preferred_zone, classzone_idx, migratetype);
 	if (page)
 		goto out;
 
@@ -2213,7 +2211,7 @@ static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, bool sync_migration,
+	int classzone_idx, int migratetype, bool sync_migration,
 	bool *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
@@ -2241,7 +2239,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		page = get_page_from_freelist(gfp_mask, nodemask,
 				order, zonelist, high_zoneidx,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
-				preferred_zone, migratetype);
+				preferred_zone, classzone_idx, migratetype);
 		if (page) {
 			preferred_zone->compact_blockskip_flush = false;
 			compaction_defer_reset(preferred_zone, order, true);
@@ -2314,7 +2312,7 @@ static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, unsigned long *did_some_progress)
+	int classzone_idx, int migratetype, unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	bool drained = false;
@@ -2332,7 +2330,8 @@ retry:
 	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags & ~ALLOC_NO_WATERMARKS,
-					preferred_zone, migratetype);
+					preferred_zone, classzone_idx,
+					migratetype);
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
@@ -2355,14 +2354,14 @@ static inline struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int classzone_idx, int migratetype)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, classzone_idx, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
@@ -2463,7 +2462,7 @@ static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int classzone_idx, int migratetype)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -2520,7 +2519,7 @@ rebalance:
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, classzone_idx, migratetype);
 	if (page)
 		goto got_pg;
 
@@ -2535,7 +2534,7 @@ rebalance:
 
 		page = __alloc_pages_high_priority(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, classzone_idx, migratetype);
 		if (page) {
 			goto got_pg;
 		}
@@ -2568,6 +2567,7 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
+					classzone_idx,
 					migratetype, sync_migration,
 					&contended_compaction,
 					&deferred_compaction,
@@ -2591,7 +2591,8 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					migratetype, &did_some_progress);
+					classzone_idx, migratetype,
+					&did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -2610,7 +2611,7 @@ rebalance:
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask, preferred_zone,
-					migratetype);
+					classzone_idx, migratetype);
 			if (page)
 				goto got_pg;
 
@@ -2653,6 +2654,7 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
+					classzone_idx,
 					migratetype, sync_migration,
 					&contended_compaction,
 					&deferred_compaction,
@@ -2680,11 +2682,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
+	struct zoneref *preferred_zoneref;
 	struct page *page = NULL;
 	int migratetype = allocflags_to_migratetype(gfp_mask);
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
 	struct mem_cgroup *memcg = NULL;
+	int classzone_idx;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2714,11 +2718,12 @@ retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	/* The preferred zone is used for statistics later */
-	first_zones_zonelist(zonelist, high_zoneidx,
+	preferred_zoneref = first_zones_zonelist(zonelist, high_zoneidx,
 				nodemask ? : &cpuset_current_mems_allowed,
 				&preferred_zone);
 	if (!preferred_zone)
 		goto out;
+	classzone_idx = zonelist_zone_idx(preferred_zoneref);
 
 #ifdef CONFIG_CMA
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
@@ -2728,7 +2733,7 @@ retry:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
-			preferred_zone, migratetype);
+			preferred_zone, classzone_idx, migratetype);
 	if (unlikely(!page)) {
 		/*
 		 * The first pass makes sure allocations are spread
@@ -2754,7 +2759,7 @@ retry:
 		gfp_mask = memalloc_noio_flags(gfp_mask);
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, classzone_idx, migratetype);
 	}
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 06/16] mm: page_alloc: Calculate classzone_idx once from the zonelist ref
@ 2014-04-18 14:50   ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

There is no need to calculate zone_idx(preferred_zone) multiple times
or use the pgdat to figure it out.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 43 ++++++++++++++++++++++++-------------------
 1 file changed, 24 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3f2a9dd..88a6dac 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1893,17 +1893,15 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone, int migratetype)
+		struct zone *preferred_zone, int classzone_idx, int migratetype)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
-	int classzone_idx;
 	struct zone *zone;
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
-	classzone_idx = zone_idx(preferred_zone);
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -2160,7 +2158,7 @@ static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int classzone_idx, int migratetype)
 {
 	struct page *page;
 
@@ -2178,7 +2176,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone, migratetype);
+		preferred_zone, classzone_idx, migratetype);
 	if (page)
 		goto out;
 
@@ -2213,7 +2211,7 @@ static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, bool sync_migration,
+	int classzone_idx, int migratetype, bool sync_migration,
 	bool *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
@@ -2241,7 +2239,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		page = get_page_from_freelist(gfp_mask, nodemask,
 				order, zonelist, high_zoneidx,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
-				preferred_zone, migratetype);
+				preferred_zone, classzone_idx, migratetype);
 		if (page) {
 			preferred_zone->compact_blockskip_flush = false;
 			compaction_defer_reset(preferred_zone, order, true);
@@ -2314,7 +2312,7 @@ static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, unsigned long *did_some_progress)
+	int classzone_idx, int migratetype, unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	bool drained = false;
@@ -2332,7 +2330,8 @@ retry:
 	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags & ~ALLOC_NO_WATERMARKS,
-					preferred_zone, migratetype);
+					preferred_zone, classzone_idx,
+					migratetype);
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
@@ -2355,14 +2354,14 @@ static inline struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int classzone_idx, int migratetype)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, classzone_idx, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
@@ -2463,7 +2462,7 @@ static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int classzone_idx, int migratetype)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -2520,7 +2519,7 @@ rebalance:
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, classzone_idx, migratetype);
 	if (page)
 		goto got_pg;
 
@@ -2535,7 +2534,7 @@ rebalance:
 
 		page = __alloc_pages_high_priority(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, classzone_idx, migratetype);
 		if (page) {
 			goto got_pg;
 		}
@@ -2568,6 +2567,7 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
+					classzone_idx,
 					migratetype, sync_migration,
 					&contended_compaction,
 					&deferred_compaction,
@@ -2591,7 +2591,8 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					migratetype, &did_some_progress);
+					classzone_idx, migratetype,
+					&did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -2610,7 +2611,7 @@ rebalance:
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask, preferred_zone,
-					migratetype);
+					classzone_idx, migratetype);
 			if (page)
 				goto got_pg;
 
@@ -2653,6 +2654,7 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
+					classzone_idx,
 					migratetype, sync_migration,
 					&contended_compaction,
 					&deferred_compaction,
@@ -2680,11 +2682,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
+	struct zoneref *preferred_zoneref;
 	struct page *page = NULL;
 	int migratetype = allocflags_to_migratetype(gfp_mask);
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
 	struct mem_cgroup *memcg = NULL;
+	int classzone_idx;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2714,11 +2718,12 @@ retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	/* The preferred zone is used for statistics later */
-	first_zones_zonelist(zonelist, high_zoneidx,
+	preferred_zoneref = first_zones_zonelist(zonelist, high_zoneidx,
 				nodemask ? : &cpuset_current_mems_allowed,
 				&preferred_zone);
 	if (!preferred_zone)
 		goto out;
+	classzone_idx = zonelist_zone_idx(preferred_zoneref);
 
 #ifdef CONFIG_CMA
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
@@ -2728,7 +2733,7 @@ retry:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
-			preferred_zone, migratetype);
+			preferred_zone, classzone_idx, migratetype);
 	if (unlikely(!page)) {
 		/*
 		 * The first pass makes sure allocations are spread
@@ -2754,7 +2759,7 @@ retry:
 		gfp_mask = memalloc_noio_flags(gfp_mask);
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, classzone_idx, migratetype);
 	}
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 07/16] mm: page_alloc: Only check the zone id check if pages are buddies
  2014-04-18 14:50 [PATCH 00/16] Misc page alloc, shmem and mark_page_accessed optimisations Mel Gorman
@ 2014-04-18 14:50   ` Mel Gorman
  2014-04-18 14:50   ` Mel Gorman
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

A node/zone index is used to check if pages are compatible for merging
but this happens unconditionally even if the buddy page is not free. Defer
the calculation as long as possible. Ideally we would check the zone boundary
but nodes can overlap.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 88a6dac..c5933a5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -508,16 +508,26 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 	if (!pfn_valid_within(page_to_pfn(buddy)))
 		return 0;
 
-	if (page_zone_id(page) != page_zone_id(buddy))
-		return 0;
-
 	if (page_is_guard(buddy) && page_order(buddy) == order) {
 		VM_BUG_ON_PAGE(page_count(buddy) != 0, buddy);
+
+		if (page_zone_id(page) != page_zone_id(buddy))
+			return 0;
+
 		return 1;
 	}
 
 	if (PageBuddy(buddy) && page_order(buddy) == order) {
 		VM_BUG_ON_PAGE(page_count(buddy) != 0, buddy);
+
+		/*
+		 * zone check is done late to avoid uselessly
+		 * calculating zone/node ids for pages that could
+		 * never merge.
+		 */
+		if (page_zone_id(page) != page_zone_id(buddy))
+			return 0;
+
 		return 1;
 	}
 	return 0;
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 07/16] mm: page_alloc: Only check the zone id check if pages are buddies
@ 2014-04-18 14:50   ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

A node/zone index is used to check if pages are compatible for merging
but this happens unconditionally even if the buddy page is not free. Defer
the calculation as long as possible. Ideally we would check the zone boundary
but nodes can overlap.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 88a6dac..c5933a5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -508,16 +508,26 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 	if (!pfn_valid_within(page_to_pfn(buddy)))
 		return 0;
 
-	if (page_zone_id(page) != page_zone_id(buddy))
-		return 0;
-
 	if (page_is_guard(buddy) && page_order(buddy) == order) {
 		VM_BUG_ON_PAGE(page_count(buddy) != 0, buddy);
+
+		if (page_zone_id(page) != page_zone_id(buddy))
+			return 0;
+
 		return 1;
 	}
 
 	if (PageBuddy(buddy) && page_order(buddy) == order) {
 		VM_BUG_ON_PAGE(page_count(buddy) != 0, buddy);
+
+		/*
+		 * zone check is done late to avoid uselessly
+		 * calculating zone/node ids for pages that could
+		 * never merge.
+		 */
+		if (page_zone_id(page) != page_zone_id(buddy))
+			return 0;
+
 		return 1;
 	}
 	return 0;
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 08/16] mm: page_alloc: Only check the alloc flags and gfp_mask for dirty once
  2014-04-18 14:50 [PATCH 00/16] Misc page alloc, shmem and mark_page_accessed optimisations Mel Gorman
@ 2014-04-18 14:50   ` Mel Gorman
  2014-04-18 14:50   ` Mel Gorman
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

Currently it's calculated once per zone in the zonelist.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5933a5..770735a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1911,6 +1911,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
+	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
+				(gfp_mask & __GFP_WRITE);
 
 zonelist_scan:
 	/*
@@ -1969,8 +1971,7 @@ zonelist_scan:
 		 * will require awareness of zones in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if ((alloc_flags & ALLOC_WMARK_LOW) &&
-		    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
+		if (consider_zone_dirty && !zone_dirty_ok(zone))
 			continue;
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 08/16] mm: page_alloc: Only check the alloc flags and gfp_mask for dirty once
@ 2014-04-18 14:50   ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

Currently it's calculated once per zone in the zonelist.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5933a5..770735a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1911,6 +1911,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
+	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
+				(gfp_mask & __GFP_WRITE);
 
 zonelist_scan:
 	/*
@@ -1969,8 +1971,7 @@ zonelist_scan:
 		 * will require awareness of zones in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if ((alloc_flags & ALLOC_WMARK_LOW) &&
-		    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
+		if (consider_zone_dirty && !zone_dirty_ok(zone))
 			continue;
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 09/16] mm: page_alloc: Take the ALLOC_NO_WATERMARK check out of the fast path
  2014-04-18 14:50 [PATCH 00/16] Misc page alloc, shmem and mark_page_accessed optimisations Mel Gorman
@ 2014-04-18 14:50   ` Mel Gorman
  2014-04-18 14:50   ` Mel Gorman
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for
__GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these cases
are relatively rare events but the ALLOC_NO_WATERMARK check is an unlikely
branch in the fast path.  This patch moves the check out of the fast path
and after it has been determined that the watermarks have not been met. This
helps the common fast path at the cost of making the slow path slower and
hitting kswapd with a performance cost. It's a reasonable tradeoff.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 770735a..737577c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1930,9 +1930,6 @@ zonelist_scan:
 			(alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				continue;
-		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
-		if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
-			goto try_this_zone;
 		/*
 		 * Distribute pages in proportion to the individual
 		 * zone size to ensure fair page aging.  The zone a
@@ -1979,6 +1976,11 @@ zonelist_scan:
 				       classzone_idx, alloc_flags)) {
 			int ret;
 
+			/* Checked here to keep the fast path fast */
+			BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
+			if (alloc_flags & ALLOC_NO_WATERMARKS)
+				goto try_this_zone;
+
 			if (IS_ENABLED(CONFIG_NUMA) &&
 					!did_zlc_setup && nr_online_nodes > 1) {
 				/*
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 09/16] mm: page_alloc: Take the ALLOC_NO_WATERMARK check out of the fast path
@ 2014-04-18 14:50   ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for
__GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these cases
are relatively rare events but the ALLOC_NO_WATERMARK check is an unlikely
branch in the fast path.  This patch moves the check out of the fast path
and after it has been determined that the watermarks have not been met. This
helps the common fast path at the cost of making the slow path slower and
hitting kswapd with a performance cost. It's a reasonable tradeoff.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 770735a..737577c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1930,9 +1930,6 @@ zonelist_scan:
 			(alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				continue;
-		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
-		if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
-			goto try_this_zone;
 		/*
 		 * Distribute pages in proportion to the individual
 		 * zone size to ensure fair page aging.  The zone a
@@ -1979,6 +1976,11 @@ zonelist_scan:
 				       classzone_idx, alloc_flags)) {
 			int ret;
 
+			/* Checked here to keep the fast path fast */
+			BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
+			if (alloc_flags & ALLOC_NO_WATERMARKS)
+				goto try_this_zone;
+
 			if (IS_ENABLED(CONFIG_NUMA) &&
 					!did_zlc_setup && nr_online_nodes > 1) {
 				/*
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 10/16] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
  2014-04-18 14:50 [PATCH 00/16] Misc page alloc, shmem and mark_page_accessed optimisations Mel Gorman
@ 2014-04-18 14:50   ` Mel Gorman
  2014-04-18 14:50   ` Mel Gorman
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

The test_bit operations in get/set pageblock flags are expensive. This patch
reads the bitmap on a word basis and use shifts and masks to isolate the bits
of interest. Similarly masks are used to set a local copy of the bitmap and then
use cmpxchg to update the bitmap if there have been no other changes made in
parallel.

In a test running dd onto tmpfs the overhead of the pageblock-related
functions went from 1.27% in profiles to 0.5%.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h          |  6 +++++-
 include/linux/pageblock-flags.h | 21 ++++++++++++++++----
 mm/page_alloc.c                 | 43 +++++++++++++++++++++++++----------------
 3 files changed, 48 insertions(+), 22 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c1dbe0b..c97b4bc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -75,9 +75,13 @@ enum {
 
 extern int page_group_by_mobility_disabled;
 
+#define NR_MIGRATETYPE_BITS 3
+#define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1)
+
 static inline int get_pageblock_migratetype(struct page *page)
 {
-	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
+	BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2);
+	return get_pageblock_flags_mask(page, NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK);
 }
 
 struct free_area {
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 2ee8cd2..c89ac75 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -30,9 +30,12 @@ enum pageblock_bits {
 	PB_migrate,
 	PB_migrate_end = PB_migrate + 3 - 1,
 			/* 3 bits required for migrate types */
-#ifdef CONFIG_COMPACTION
 	PB_migrate_skip,/* If set the block is skipped by compaction */
-#endif /* CONFIG_COMPACTION */
+
+	/*
+	 * Assume the bits will always align on a word. If this assumption
+	 * changes then get/set pageblock needs updating.
+	 */
 	NR_PAGEBLOCK_BITS
 };
 
@@ -62,9 +65,19 @@ extern int pageblock_order;
 /* Forward declaration */
 struct page;
 
+unsigned long get_pageblock_flags_mask(struct page *page,
+				unsigned long nr_flag_bits,
+				unsigned long mask);
+
 /* Declarations for getting and setting flags. See mm/page_alloc.c */
-unsigned long get_pageblock_flags_group(struct page *page,
-					int start_bitidx, int end_bitidx);
+static inline unsigned long get_pageblock_flags_group(struct page *page,
+					int start_bitidx, int end_bitidx)
+{
+	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
+	unsigned long mask = (1 << nr_flag_bits) - 1;
+
+	return get_pageblock_flags_mask(page, nr_flag_bits, mask);
+}
 void set_pageblock_flags_group(struct page *page, unsigned long flags,
 					int start_bitidx, int end_bitidx);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 737577c..6047866 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6012,25 +6012,24 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
  * @end_bitidx: The last bit of interest
  * returns pageblock_bits flags
  */
-unsigned long get_pageblock_flags_group(struct page *page,
-					int start_bitidx, int end_bitidx)
+unsigned long get_pageblock_flags_mask(struct page *page,
+					unsigned long nr_flag_bits,
+					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx;
-	unsigned long flags = 0;
-	unsigned long value = 1;
+	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long word;
 
 	zone = page_zone(page);
 	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
+	word_bitidx = bitidx / BITS_PER_LONG;
+	bitidx &= (BITS_PER_LONG-1);
 
-	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
-		if (test_bit(bitidx + start_bitidx, bitmap))
-			flags |= value;
-
-	return flags;
+	word = bitmap[word_bitidx];
+	return (word >> (BITS_PER_LONG - (bitidx + nr_flag_bits))) & mask;
 }
 
 /**
@@ -6045,20 +6044,30 @@ void set_pageblock_flags_group(struct page *page, unsigned long flags,
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx;
-	unsigned long value = 1;
+	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
+	unsigned long mask = (1 << nr_flag_bits) - 1;
+	unsigned long old_word, new_word;
+
+	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
 
 	zone = page_zone(page);
 	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
+	word_bitidx = bitidx / BITS_PER_LONG;
+	bitidx &= (BITS_PER_LONG-1);
+
 	VM_BUG_ON_PAGE(!zone_spans_pfn(zone, pfn), page);
 
-	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
-		if (flags & value)
-			__set_bit(bitidx + start_bitidx, bitmap);
-		else
-			__clear_bit(bitidx + start_bitidx, bitmap);
+	end_bitidx = bitidx + (end_bitidx - start_bitidx);
+	mask <<= (BITS_PER_LONG - end_bitidx - 1);
+	flags <<= (BITS_PER_LONG - end_bitidx - 1);
+
+	do {
+		old_word = ACCESS_ONCE(bitmap[word_bitidx]);
+		new_word = (old_word & ~mask) | flags;
+	} while (cmpxchg(&bitmap[word_bitidx], old_word, new_word) != old_word);
 }
 
 /*
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 10/16] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
@ 2014-04-18 14:50   ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

The test_bit operations in get/set pageblock flags are expensive. This patch
reads the bitmap on a word basis and use shifts and masks to isolate the bits
of interest. Similarly masks are used to set a local copy of the bitmap and then
use cmpxchg to update the bitmap if there have been no other changes made in
parallel.

In a test running dd onto tmpfs the overhead of the pageblock-related
functions went from 1.27% in profiles to 0.5%.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h          |  6 +++++-
 include/linux/pageblock-flags.h | 21 ++++++++++++++++----
 mm/page_alloc.c                 | 43 +++++++++++++++++++++++++----------------
 3 files changed, 48 insertions(+), 22 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c1dbe0b..c97b4bc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -75,9 +75,13 @@ enum {
 
 extern int page_group_by_mobility_disabled;
 
+#define NR_MIGRATETYPE_BITS 3
+#define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1)
+
 static inline int get_pageblock_migratetype(struct page *page)
 {
-	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
+	BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2);
+	return get_pageblock_flags_mask(page, NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK);
 }
 
 struct free_area {
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 2ee8cd2..c89ac75 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -30,9 +30,12 @@ enum pageblock_bits {
 	PB_migrate,
 	PB_migrate_end = PB_migrate + 3 - 1,
 			/* 3 bits required for migrate types */
-#ifdef CONFIG_COMPACTION
 	PB_migrate_skip,/* If set the block is skipped by compaction */
-#endif /* CONFIG_COMPACTION */
+
+	/*
+	 * Assume the bits will always align on a word. If this assumption
+	 * changes then get/set pageblock needs updating.
+	 */
 	NR_PAGEBLOCK_BITS
 };
 
@@ -62,9 +65,19 @@ extern int pageblock_order;
 /* Forward declaration */
 struct page;
 
+unsigned long get_pageblock_flags_mask(struct page *page,
+				unsigned long nr_flag_bits,
+				unsigned long mask);
+
 /* Declarations for getting and setting flags. See mm/page_alloc.c */
-unsigned long get_pageblock_flags_group(struct page *page,
-					int start_bitidx, int end_bitidx);
+static inline unsigned long get_pageblock_flags_group(struct page *page,
+					int start_bitidx, int end_bitidx)
+{
+	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
+	unsigned long mask = (1 << nr_flag_bits) - 1;
+
+	return get_pageblock_flags_mask(page, nr_flag_bits, mask);
+}
 void set_pageblock_flags_group(struct page *page, unsigned long flags,
 					int start_bitidx, int end_bitidx);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 737577c..6047866 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6012,25 +6012,24 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
  * @end_bitidx: The last bit of interest
  * returns pageblock_bits flags
  */
-unsigned long get_pageblock_flags_group(struct page *page,
-					int start_bitidx, int end_bitidx)
+unsigned long get_pageblock_flags_mask(struct page *page,
+					unsigned long nr_flag_bits,
+					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx;
-	unsigned long flags = 0;
-	unsigned long value = 1;
+	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long word;
 
 	zone = page_zone(page);
 	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
+	word_bitidx = bitidx / BITS_PER_LONG;
+	bitidx &= (BITS_PER_LONG-1);
 
-	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
-		if (test_bit(bitidx + start_bitidx, bitmap))
-			flags |= value;
-
-	return flags;
+	word = bitmap[word_bitidx];
+	return (word >> (BITS_PER_LONG - (bitidx + nr_flag_bits))) & mask;
 }
 
 /**
@@ -6045,20 +6044,30 @@ void set_pageblock_flags_group(struct page *page, unsigned long flags,
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx;
-	unsigned long value = 1;
+	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
+	unsigned long mask = (1 << nr_flag_bits) - 1;
+	unsigned long old_word, new_word;
+
+	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
 
 	zone = page_zone(page);
 	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
+	word_bitidx = bitidx / BITS_PER_LONG;
+	bitidx &= (BITS_PER_LONG-1);
+
 	VM_BUG_ON_PAGE(!zone_spans_pfn(zone, pfn), page);
 
-	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
-		if (flags & value)
-			__set_bit(bitidx + start_bitidx, bitmap);
-		else
-			__clear_bit(bitidx + start_bitidx, bitmap);
+	end_bitidx = bitidx + (end_bitidx - start_bitidx);
+	mask <<= (BITS_PER_LONG - end_bitidx - 1);
+	flags <<= (BITS_PER_LONG - end_bitidx - 1);
+
+	do {
+		old_word = ACCESS_ONCE(bitmap[word_bitidx]);
+		new_word = (old_word & ~mask) | flags;
+	} while (cmpxchg(&bitmap[word_bitidx], old_word, new_word) != old_word);
 }
 
 /*
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 11/16] mm: page_alloc: Reduce number of times page_to_pfn is called
  2014-04-18 14:50 [PATCH 00/16] Misc page alloc, shmem and mark_page_accessed optimisations Mel Gorman
@ 2014-04-18 14:50   ` Mel Gorman
  2014-04-18 14:50   ` Mel Gorman
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

In the free path we calculate page_to_pfn multiple times. Reduce that.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h          |  9 +++++++--
 include/linux/pageblock-flags.h | 15 ++++++---------
 mm/page_alloc.c                 | 26 +++++++++++++++-----------
 3 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c97b4bc..14ed8d1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -78,10 +78,15 @@ extern int page_group_by_mobility_disabled;
 #define NR_MIGRATETYPE_BITS 3
 #define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1)
 
-static inline int get_pageblock_migratetype(struct page *page)
+#define get_pageblock_migratetype(page)					\
+	get_pfnblock_flags_mask(page, page_to_pfn(page),		\
+				NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK)
+
+static inline int get_pfnblock_migratetype(struct page *page, unsigned long pfn)
 {
 	BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2);
-	return get_pageblock_flags_mask(page, NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK);
+	return get_pfnblock_flags_mask(page, pfn,
+					NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK);
 }
 
 struct free_area {
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index c89ac75..6a9dd5b 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -65,19 +65,16 @@ extern int pageblock_order;
 /* Forward declaration */
 struct page;
 
-unsigned long get_pageblock_flags_mask(struct page *page,
+unsigned long get_pfnblock_flags_mask(struct page *page,
+				unsigned long pfn,
 				unsigned long nr_flag_bits,
 				unsigned long mask);
 
 /* Declarations for getting and setting flags. See mm/page_alloc.c */
-static inline unsigned long get_pageblock_flags_group(struct page *page,
-					int start_bitidx, int end_bitidx)
-{
-	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
-	unsigned long mask = (1 << nr_flag_bits) - 1;
-
-	return get_pageblock_flags_mask(page, nr_flag_bits, mask);
-}
+#define get_pageblock_flags_group(page, start_bitidx, end_bitidx) \
+	get_pfnblock_flags_mask(page, page_to_pfn(page),		\
+			end_bitidx - start_bitidx + 1,			\
+			(1 << (end_bitidx - start_bitidx + 1)) - 1)
 void set_pageblock_flags_group(struct page *page, unsigned long flags,
 					int start_bitidx, int end_bitidx);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6047866..377e58a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -559,6 +559,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
  */
 
 static inline void __free_one_page(struct page *page,
+		unsigned long pfn,
 		struct zone *zone, unsigned int order,
 		int migratetype)
 {
@@ -575,7 +576,7 @@ static inline void __free_one_page(struct page *page,
 
 	VM_BUG_ON(migratetype == -1);
 
-	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
+	page_idx = pfn & ((1 << MAX_ORDER) - 1);
 
 	VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
@@ -710,7 +711,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			list_del(&page->lru);
 			mt = get_freepage_migratetype(page);
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
-			__free_one_page(page, zone, 0, mt);
+			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
 			trace_mm_page_pcpu_drain(page, 0, mt);
 			if (likely(!is_migrate_isolate_page(page))) {
 				__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
@@ -722,13 +723,15 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	spin_unlock(&zone->lock);
 }
 
-static void free_one_page(struct zone *zone, struct page *page, int order,
+static void free_one_page(struct zone *zone,
+				struct page *page, unsigned long pfn,
+				int order,
 				int migratetype)
 {
 	spin_lock(&zone->lock);
 	zone->pages_scanned = 0;
 
-	__free_one_page(page, zone, order, migratetype);
+	__free_one_page(page, pfn, zone, order, migratetype);
 	if (unlikely(!is_migrate_isolate(migratetype)))
 		__mod_zone_freepage_state(zone, 1 << order, migratetype);
 	spin_unlock(&zone->lock);
@@ -765,15 +768,16 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
 	int migratetype;
+	unsigned long pfn = page_to_pfn(page);
 
 	if (!free_pages_prepare(page, order))
 		return;
 
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	migratetype = get_pageblock_migratetype(page);
+	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_freepage_migratetype(page, migratetype);
-	free_one_page(page_zone(page), page, order, migratetype);
+	free_one_page(page_zone(page), page, pfn, order, migratetype);
 	local_irq_restore(flags);
 }
 
@@ -1376,12 +1380,13 @@ void free_hot_cold_page(struct page *page, int cold)
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
 	if (!free_pages_prepare(page, 0))
 		return;
 
-	migratetype = get_pageblock_migratetype(page);
+	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_freepage_migratetype(page, migratetype);
 	local_irq_save(flags);
 	__count_vm_event(PGFREE);
@@ -1395,7 +1400,7 @@ void free_hot_cold_page(struct page *page, int cold)
 	 */
 	if (migratetype >= MIGRATE_PCPTYPES) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(zone, page, 0, migratetype);
+			free_one_page(zone, page, pfn, 0, migratetype);
 			goto out;
 		}
 		migratetype = MIGRATE_MOVABLE;
@@ -6012,17 +6017,16 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
  * @end_bitidx: The last bit of interest
  * returns pageblock_bits flags
  */
-unsigned long get_pageblock_flags_mask(struct page *page,
+unsigned long get_pfnblock_flags_mask(struct page *page, unsigned long pfn,
 					unsigned long nr_flag_bits,
 					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long bitidx, word_bitidx;
 	unsigned long word;
 
 	zone = page_zone(page);
-	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
 	word_bitidx = bitidx / BITS_PER_LONG;
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 11/16] mm: page_alloc: Reduce number of times page_to_pfn is called
@ 2014-04-18 14:50   ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

In the free path we calculate page_to_pfn multiple times. Reduce that.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h          |  9 +++++++--
 include/linux/pageblock-flags.h | 15 ++++++---------
 mm/page_alloc.c                 | 26 +++++++++++++++-----------
 3 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c97b4bc..14ed8d1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -78,10 +78,15 @@ extern int page_group_by_mobility_disabled;
 #define NR_MIGRATETYPE_BITS 3
 #define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1)
 
-static inline int get_pageblock_migratetype(struct page *page)
+#define get_pageblock_migratetype(page)					\
+	get_pfnblock_flags_mask(page, page_to_pfn(page),		\
+				NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK)
+
+static inline int get_pfnblock_migratetype(struct page *page, unsigned long pfn)
 {
 	BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2);
-	return get_pageblock_flags_mask(page, NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK);
+	return get_pfnblock_flags_mask(page, pfn,
+					NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK);
 }
 
 struct free_area {
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index c89ac75..6a9dd5b 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -65,19 +65,16 @@ extern int pageblock_order;
 /* Forward declaration */
 struct page;
 
-unsigned long get_pageblock_flags_mask(struct page *page,
+unsigned long get_pfnblock_flags_mask(struct page *page,
+				unsigned long pfn,
 				unsigned long nr_flag_bits,
 				unsigned long mask);
 
 /* Declarations for getting and setting flags. See mm/page_alloc.c */
-static inline unsigned long get_pageblock_flags_group(struct page *page,
-					int start_bitidx, int end_bitidx)
-{
-	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
-	unsigned long mask = (1 << nr_flag_bits) - 1;
-
-	return get_pageblock_flags_mask(page, nr_flag_bits, mask);
-}
+#define get_pageblock_flags_group(page, start_bitidx, end_bitidx) \
+	get_pfnblock_flags_mask(page, page_to_pfn(page),		\
+			end_bitidx - start_bitidx + 1,			\
+			(1 << (end_bitidx - start_bitidx + 1)) - 1)
 void set_pageblock_flags_group(struct page *page, unsigned long flags,
 					int start_bitidx, int end_bitidx);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6047866..377e58a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -559,6 +559,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
  */
 
 static inline void __free_one_page(struct page *page,
+		unsigned long pfn,
 		struct zone *zone, unsigned int order,
 		int migratetype)
 {
@@ -575,7 +576,7 @@ static inline void __free_one_page(struct page *page,
 
 	VM_BUG_ON(migratetype == -1);
 
-	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
+	page_idx = pfn & ((1 << MAX_ORDER) - 1);
 
 	VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
@@ -710,7 +711,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			list_del(&page->lru);
 			mt = get_freepage_migratetype(page);
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
-			__free_one_page(page, zone, 0, mt);
+			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
 			trace_mm_page_pcpu_drain(page, 0, mt);
 			if (likely(!is_migrate_isolate_page(page))) {
 				__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
@@ -722,13 +723,15 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	spin_unlock(&zone->lock);
 }
 
-static void free_one_page(struct zone *zone, struct page *page, int order,
+static void free_one_page(struct zone *zone,
+				struct page *page, unsigned long pfn,
+				int order,
 				int migratetype)
 {
 	spin_lock(&zone->lock);
 	zone->pages_scanned = 0;
 
-	__free_one_page(page, zone, order, migratetype);
+	__free_one_page(page, pfn, zone, order, migratetype);
 	if (unlikely(!is_migrate_isolate(migratetype)))
 		__mod_zone_freepage_state(zone, 1 << order, migratetype);
 	spin_unlock(&zone->lock);
@@ -765,15 +768,16 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
 	int migratetype;
+	unsigned long pfn = page_to_pfn(page);
 
 	if (!free_pages_prepare(page, order))
 		return;
 
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	migratetype = get_pageblock_migratetype(page);
+	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_freepage_migratetype(page, migratetype);
-	free_one_page(page_zone(page), page, order, migratetype);
+	free_one_page(page_zone(page), page, pfn, order, migratetype);
 	local_irq_restore(flags);
 }
 
@@ -1376,12 +1380,13 @@ void free_hot_cold_page(struct page *page, int cold)
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
 	if (!free_pages_prepare(page, 0))
 		return;
 
-	migratetype = get_pageblock_migratetype(page);
+	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_freepage_migratetype(page, migratetype);
 	local_irq_save(flags);
 	__count_vm_event(PGFREE);
@@ -1395,7 +1400,7 @@ void free_hot_cold_page(struct page *page, int cold)
 	 */
 	if (migratetype >= MIGRATE_PCPTYPES) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(zone, page, 0, migratetype);
+			free_one_page(zone, page, pfn, 0, migratetype);
 			goto out;
 		}
 		migratetype = MIGRATE_MOVABLE;
@@ -6012,17 +6017,16 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
  * @end_bitidx: The last bit of interest
  * returns pageblock_bits flags
  */
-unsigned long get_pageblock_flags_mask(struct page *page,
+unsigned long get_pfnblock_flags_mask(struct page *page, unsigned long pfn,
 					unsigned long nr_flag_bits,
 					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long bitidx, word_bitidx;
 	unsigned long word;
 
 	zone = page_zone(page);
-	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
 	word_bitidx = bitidx / BITS_PER_LONG;
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 12/16] mm: shmem: Avoid atomic operation during shmem_getpage_gfp
  2014-04-18 14:50 [PATCH 00/16] Misc page alloc, shmem and mark_page_accessed optimisations Mel Gorman
@ 2014-04-18 14:50   ` Mel Gorman
  2014-04-18 14:50   ` Mel Gorman
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
before it's even added to the LRU or visible. This is unnecessary as what
could it possible race against?  Use an unlocked variant.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/page-flags.h | 1 +
 mm/shmem.c                 | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index d1fe1a7..4d4b39a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -208,6 +208,7 @@ PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned)	/* Xen */
 PAGEFLAG(SavePinned, savepinned);			/* Xen */
 PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved)
 PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
+	__SETPAGEFLAG(SwapBacked, swapbacked)
 
 __PAGEFLAG(SlobFree, slob_free)
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 9f70e02..f47fb38 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1132,7 +1132,7 @@ repeat:
 			goto decused;
 		}
 
-		SetPageSwapBacked(page);
+		__SetPageSwapBacked(page);
 		__set_page_locked(page);
 		error = mem_cgroup_charge_file(page, current->mm,
 						gfp & GFP_RECLAIM_MASK);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 12/16] mm: shmem: Avoid atomic operation during shmem_getpage_gfp
@ 2014-04-18 14:50   ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
before it's even added to the LRU or visible. This is unnecessary as what
could it possible race against?  Use an unlocked variant.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/page-flags.h | 1 +
 mm/shmem.c                 | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index d1fe1a7..4d4b39a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -208,6 +208,7 @@ PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned)	/* Xen */
 PAGEFLAG(SavePinned, savepinned);			/* Xen */
 PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved)
 PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
+	__SETPAGEFLAG(SwapBacked, swapbacked)
 
 __PAGEFLAG(SlobFree, slob_free)
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 9f70e02..f47fb38 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1132,7 +1132,7 @@ repeat:
 			goto decused;
 		}
 
-		SetPageSwapBacked(page);
+		__SetPageSwapBacked(page);
 		__set_page_locked(page);
 		error = mem_cgroup_charge_file(page, current->mm,
 						gfp & GFP_RECLAIM_MASK);
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 13/16] mm: Do not use atomic operations when releasing pages
  2014-04-18 14:50 [PATCH 00/16] Misc page alloc, shmem and mark_page_accessed optimisations Mel Gorman
@ 2014-04-18 14:50   ` Mel Gorman
  2014-04-18 14:50   ` Mel Gorman
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

There should be no references to it any more and a parallel mark should
not be reordered against us. Use non-locked varient to clear page active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/swap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index 9ce43ba..fed4caf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -854,7 +854,7 @@ void release_pages(struct page **pages, int nr, int cold)
 		}
 
 		/* Clear Active bit in case of parallel mark_page_accessed */
-		ClearPageActive(page);
+		__ClearPageActive(page);
 
 		list_add(&page->lru, &pages_to_free);
 	}
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 13/16] mm: Do not use atomic operations when releasing pages
@ 2014-04-18 14:50   ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

There should be no references to it any more and a parallel mark should
not be reordered against us. Use non-locked varient to clear page active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/swap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index 9ce43ba..fed4caf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -854,7 +854,7 @@ void release_pages(struct page **pages, int nr, int cold)
 		}
 
 		/* Clear Active bit in case of parallel mark_page_accessed */
-		ClearPageActive(page);
+		__ClearPageActive(page);
 
 		list_add(&page->lru, &pages_to_free);
 	}
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 14/16] mm: Do not use unnecessary atomic operations when adding pages to the LRU
  2014-04-18 14:50 [PATCH 00/16] Misc page alloc, shmem and mark_page_accessed optimisations Mel Gorman
@ 2014-04-18 14:50   ` Mel Gorman
  2014-04-18 14:50   ` Mel Gorman
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

When adding pages to the LRU we clear the active bit unconditionally. As the
page could be reachable from other paths we cannot use unlocked operations
without risk of corruption such as a parallel mark_page_accessed. This
patch test if is necessary to clear the atomic flag before using an atomic
operation. In the unlikely even this races with mark_page_accesssed the
consequences are simply that the page may be promoted to the active list
that might have been left on the inactive list before the patch. This is
a marginal consequence.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/swap.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3507115..4a9ac85 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -329,13 +329,15 @@ extern void add_page_to_unevictable_list(struct page *page);
  */
 static inline void lru_cache_add_anon(struct page *page)
 {
-	ClearPageActive(page);
+	if (PageActive(page))
+		ClearPageActive(page);
 	__lru_cache_add(page);
 }
 
 static inline void lru_cache_add_file(struct page *page)
 {
-	ClearPageActive(page);
+	if (PageActive(page))
+		ClearPageActive(page);
 	__lru_cache_add(page);
 }
 
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 14/16] mm: Do not use unnecessary atomic operations when adding pages to the LRU
@ 2014-04-18 14:50   ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

When adding pages to the LRU we clear the active bit unconditionally. As the
page could be reachable from other paths we cannot use unlocked operations
without risk of corruption such as a parallel mark_page_accessed. This
patch test if is necessary to clear the atomic flag before using an atomic
operation. In the unlikely even this races with mark_page_accesssed the
consequences are simply that the page may be promoted to the active list
that might have been left on the inactive list before the patch. This is
a marginal consequence.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/swap.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3507115..4a9ac85 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -329,13 +329,15 @@ extern void add_page_to_unevictable_list(struct page *page);
  */
 static inline void lru_cache_add_anon(struct page *page)
 {
-	ClearPageActive(page);
+	if (PageActive(page))
+		ClearPageActive(page);
 	__lru_cache_add(page);
 }
 
 static inline void lru_cache_add_file(struct page *page)
 {
-	ClearPageActive(page);
+	if (PageActive(page))
+		ClearPageActive(page);
 	__lru_cache_add(page);
 }
 
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 15/16] mm: Non-atomically mark page accessed in write_begin where possible
  2014-04-18 14:50 [PATCH 00/16] Misc page alloc, shmem and mark_page_accessed optimisations Mel Gorman
@ 2014-04-18 14:50   ` Mel Gorman
  2014-04-18 14:50   ` Mel Gorman
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

aops->write_begin may allocate a new page and make it visible just to
have mark_page_accessed called almost immediately after. Once it's visible
atomic operations are necessary which is noticable overhead when writing
to an in-memory filesystem like tmpfs but should also be noticable with
fast storage.

The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial allocation
of a page cache page. This patch adds an init_page_accessed() helper which
behaves like the first call to mark_page_accessed() but may called before
the page is visible and can be done non-atomically.

In this patch, new allocations in grab_cache_page_write_begin() or
find_or_create_page() use init_page_accessed() and existing pages use
mark_page_accessed().

This places a burden because filesystems need to ensure they either use these
helpers or update the helpers they do use to call init_page_accessed()
or mark_page_accessed() as appropriate. There is also a snag in that
the timing of the mark_page_accessed() has now changed so in rare cases
it's possible a page gets to the end of the LRU as PageReferenced where
as previously it might have been repromoted. This is expected to be rare
but it's worth the filesystem people thinking about it in case they see
a problem with the timing change.

In a profiled run measuring dd to tmpfs the overhead of mark_page_accessed was

25142     0.7055  vmlinux-3.15.0-rc1-vanilla vmlinux-3.15.0-rc1-vanilla shmem_write_end
107830    3.0256  vmlinux-3.15.0-rc1-vanilla vmlinux-3.15.0-rc1-vanilla mark_page_accessed

3.73% overall. With the patch applied, it becomes

118185    3.1712  vmlinux-3.15.0-rc1-microopt-v1r11 vmlinux-3.15.0-rc1-microopt-v1r11 shmem_write_end
2395      0.0643  vmlinux-3.15.0-rc1-microopt-v1r11 vmlinux-3.15.0-rc1-microopt-v1r11 init_page_accessed
159       0.0043  vmlinux-3.15.0-rc1-microopt-v1r11 vmlinux-3.15.0-rc1-microopt-v1r11 mark_page_accessed

3.23% overall. shmem_write_end increases in apparent cost because the
SetPageUptodate is now to a cache line that mark_page_accessed had not
dirtied for it. Even with that taken into account, it's still fewer
atomic operations overall.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/page-flags.h |  1 +
 include/linux/swap.h       |  1 +
 mm/filemap.c               | 55 +++++++++++++++++++++++++++-------------------
 mm/shmem.c                 |  6 ++++-
 mm/swap.c                  | 11 ++++++++++
 5 files changed, 51 insertions(+), 23 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4d4b39a..2093eb7 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -198,6 +198,7 @@ struct page;	/* forward declaration */
 TESTPAGEFLAG(Locked, locked)
 PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)
 PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
+	__SETPAGEFLAG(Referenced, referenced)
 PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
 PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
 PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4a9ac85..e54312d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -314,6 +314,7 @@ extern void lru_add_page_tail(struct page *page, struct page *page_tail,
 			 struct lruvec *lruvec, struct list_head *head);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
+extern void init_page_accessed(struct page *page);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
diff --git a/mm/filemap.c b/mm/filemap.c
index a82fbe4..c28f69c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1059,24 +1059,31 @@ struct page *find_or_create_page(struct address_space *mapping,
 	int err;
 repeat:
 	page = find_lock_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(gfp_mask);
-		if (!page)
-			return NULL;
-		/*
-		 * We want a regular kernel memory (not highmem or DMA etc)
-		 * allocation for the radix tree nodes, but we need to honour
-		 * the context-specific requirements the caller has asked for.
-		 * GFP_RECLAIM_MASK collects those requirements.
-		 */
-		err = add_to_page_cache_lru(page, mapping, index,
-			(gfp_mask & GFP_RECLAIM_MASK));
-		if (unlikely(err)) {
-			page_cache_release(page);
-			page = NULL;
-			if (err == -EEXIST)
-				goto repeat;
-		}
+	if (page) {
+		mark_page_accessed(page);
+		return page;
+	}
+
+	page = __page_cache_alloc(gfp_mask);
+	if (!page)
+		return NULL;
+
+	/* Init accessed so avoit atomic mark_page_accessed later */
+	init_page_accessed(page);
+
+	/*
+	 * We want a regular kernel memory (not highmem or DMA etc)
+	 * allocation for the radix tree nodes, but we need to honour
+	 * the context-specific requirements the caller has asked for.
+	 * GFP_RECLAIM_MASK collects those requirements.
+	 */
+	err = add_to_page_cache_lru(page, mapping, index,
+		(gfp_mask & GFP_RECLAIM_MASK));
+	if (unlikely(err)) {
+		page_cache_release(page);
+		page = NULL;
+		if (err == -EEXIST)
+			goto repeat;
 	}
 	return page;
 }
@@ -2372,7 +2379,6 @@ int pagecache_write_end(struct file *file, struct address_space *mapping,
 {
 	const struct address_space_operations *aops = mapping->a_ops;
 
-	mark_page_accessed(page);
 	return aops->write_end(file, mapping, pos, len, copied, page, fsdata);
 }
 EXPORT_SYMBOL(pagecache_write_end);
@@ -2466,12 +2472,18 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 		gfp_notmask = __GFP_FS;
 repeat:
 	page = find_lock_page(mapping, index);
-	if (page)
+	if (page) {
+		mark_page_accessed(page);
 		goto found;
+	}
 
 	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
 	if (!page)
 		return NULL;
+
+	/* Init accessed so avoit atomic mark_page_accessed later */
+	init_page_accessed(page);
+
 	status = add_to_page_cache_lru(page, mapping, index,
 						GFP_KERNEL & ~gfp_notmask);
 	if (unlikely(status)) {
@@ -2530,7 +2542,7 @@ again:
 
 		status = a_ops->write_begin(file, mapping, pos, bytes, flags,
 						&page, &fsdata);
-		if (unlikely(status))
+		if (unlikely(status < 0))
 			break;
 
 		if (mapping_writably_mapped(mapping))
@@ -2539,7 +2551,6 @@ again:
 		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
 		flush_dcache_page(page);
 
-		mark_page_accessed(page);
 		status = a_ops->write_end(file, mapping, pos, bytes, copied,
 						page, fsdata);
 		if (unlikely(status < 0))
diff --git a/mm/shmem.c b/mm/shmem.c
index f47fb38..700a4ad 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1372,9 +1372,13 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,
 			struct page **pagep, void **fsdata)
 {
+	int ret;
 	struct inode *inode = mapping->host;
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
-	return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
+	ret = shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
+	if (*pagep)
+		init_page_accessed(*pagep);
+	return ret;
 }
 
 static int
diff --git a/mm/swap.c b/mm/swap.c
index fed4caf..2490dfe 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -583,6 +583,17 @@ void mark_page_accessed(struct page *page)
 EXPORT_SYMBOL(mark_page_accessed);
 
 /*
+ * Used to mark_page_accessed(page) that is not visible yet and when it is
+ * still safe to use non-atomic ops
+ */
+void init_page_accessed(struct page *page)
+{
+	if (!PageReferenced(page))
+		__SetPageReferenced(page);
+}
+EXPORT_SYMBOL(init_page_accessed);
+
+/*
  * Queue the page for addition to the LRU via pagevec. The decision on whether
  * to add the page to the [in]active [file|anon] list is deferred until the
  * pagevec is drained. This gives a chance for the caller of __lru_cache_add()
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 15/16] mm: Non-atomically mark page accessed in write_begin where possible
@ 2014-04-18 14:50   ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

aops->write_begin may allocate a new page and make it visible just to
have mark_page_accessed called almost immediately after. Once it's visible
atomic operations are necessary which is noticable overhead when writing
to an in-memory filesystem like tmpfs but should also be noticable with
fast storage.

The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial allocation
of a page cache page. This patch adds an init_page_accessed() helper which
behaves like the first call to mark_page_accessed() but may called before
the page is visible and can be done non-atomically.

In this patch, new allocations in grab_cache_page_write_begin() or
find_or_create_page() use init_page_accessed() and existing pages use
mark_page_accessed().

This places a burden because filesystems need to ensure they either use these
helpers or update the helpers they do use to call init_page_accessed()
or mark_page_accessed() as appropriate. There is also a snag in that
the timing of the mark_page_accessed() has now changed so in rare cases
it's possible a page gets to the end of the LRU as PageReferenced where
as previously it might have been repromoted. This is expected to be rare
but it's worth the filesystem people thinking about it in case they see
a problem with the timing change.

In a profiled run measuring dd to tmpfs the overhead of mark_page_accessed was

25142     0.7055  vmlinux-3.15.0-rc1-vanilla vmlinux-3.15.0-rc1-vanilla shmem_write_end
107830    3.0256  vmlinux-3.15.0-rc1-vanilla vmlinux-3.15.0-rc1-vanilla mark_page_accessed

3.73% overall. With the patch applied, it becomes

118185    3.1712  vmlinux-3.15.0-rc1-microopt-v1r11 vmlinux-3.15.0-rc1-microopt-v1r11 shmem_write_end
2395      0.0643  vmlinux-3.15.0-rc1-microopt-v1r11 vmlinux-3.15.0-rc1-microopt-v1r11 init_page_accessed
159       0.0043  vmlinux-3.15.0-rc1-microopt-v1r11 vmlinux-3.15.0-rc1-microopt-v1r11 mark_page_accessed

3.23% overall. shmem_write_end increases in apparent cost because the
SetPageUptodate is now to a cache line that mark_page_accessed had not
dirtied for it. Even with that taken into account, it's still fewer
atomic operations overall.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/page-flags.h |  1 +
 include/linux/swap.h       |  1 +
 mm/filemap.c               | 55 +++++++++++++++++++++++++++-------------------
 mm/shmem.c                 |  6 ++++-
 mm/swap.c                  | 11 ++++++++++
 5 files changed, 51 insertions(+), 23 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4d4b39a..2093eb7 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -198,6 +198,7 @@ struct page;	/* forward declaration */
 TESTPAGEFLAG(Locked, locked)
 PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)
 PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
+	__SETPAGEFLAG(Referenced, referenced)
 PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
 PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
 PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4a9ac85..e54312d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -314,6 +314,7 @@ extern void lru_add_page_tail(struct page *page, struct page *page_tail,
 			 struct lruvec *lruvec, struct list_head *head);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
+extern void init_page_accessed(struct page *page);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
diff --git a/mm/filemap.c b/mm/filemap.c
index a82fbe4..c28f69c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1059,24 +1059,31 @@ struct page *find_or_create_page(struct address_space *mapping,
 	int err;
 repeat:
 	page = find_lock_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(gfp_mask);
-		if (!page)
-			return NULL;
-		/*
-		 * We want a regular kernel memory (not highmem or DMA etc)
-		 * allocation for the radix tree nodes, but we need to honour
-		 * the context-specific requirements the caller has asked for.
-		 * GFP_RECLAIM_MASK collects those requirements.
-		 */
-		err = add_to_page_cache_lru(page, mapping, index,
-			(gfp_mask & GFP_RECLAIM_MASK));
-		if (unlikely(err)) {
-			page_cache_release(page);
-			page = NULL;
-			if (err == -EEXIST)
-				goto repeat;
-		}
+	if (page) {
+		mark_page_accessed(page);
+		return page;
+	}
+
+	page = __page_cache_alloc(gfp_mask);
+	if (!page)
+		return NULL;
+
+	/* Init accessed so avoit atomic mark_page_accessed later */
+	init_page_accessed(page);
+
+	/*
+	 * We want a regular kernel memory (not highmem or DMA etc)
+	 * allocation for the radix tree nodes, but we need to honour
+	 * the context-specific requirements the caller has asked for.
+	 * GFP_RECLAIM_MASK collects those requirements.
+	 */
+	err = add_to_page_cache_lru(page, mapping, index,
+		(gfp_mask & GFP_RECLAIM_MASK));
+	if (unlikely(err)) {
+		page_cache_release(page);
+		page = NULL;
+		if (err == -EEXIST)
+			goto repeat;
 	}
 	return page;
 }
@@ -2372,7 +2379,6 @@ int pagecache_write_end(struct file *file, struct address_space *mapping,
 {
 	const struct address_space_operations *aops = mapping->a_ops;
 
-	mark_page_accessed(page);
 	return aops->write_end(file, mapping, pos, len, copied, page, fsdata);
 }
 EXPORT_SYMBOL(pagecache_write_end);
@@ -2466,12 +2472,18 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
 		gfp_notmask = __GFP_FS;
 repeat:
 	page = find_lock_page(mapping, index);
-	if (page)
+	if (page) {
+		mark_page_accessed(page);
 		goto found;
+	}
 
 	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
 	if (!page)
 		return NULL;
+
+	/* Init accessed so avoit atomic mark_page_accessed later */
+	init_page_accessed(page);
+
 	status = add_to_page_cache_lru(page, mapping, index,
 						GFP_KERNEL & ~gfp_notmask);
 	if (unlikely(status)) {
@@ -2530,7 +2542,7 @@ again:
 
 		status = a_ops->write_begin(file, mapping, pos, bytes, flags,
 						&page, &fsdata);
-		if (unlikely(status))
+		if (unlikely(status < 0))
 			break;
 
 		if (mapping_writably_mapped(mapping))
@@ -2539,7 +2551,6 @@ again:
 		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
 		flush_dcache_page(page);
 
-		mark_page_accessed(page);
 		status = a_ops->write_end(file, mapping, pos, bytes, copied,
 						page, fsdata);
 		if (unlikely(status < 0))
diff --git a/mm/shmem.c b/mm/shmem.c
index f47fb38..700a4ad 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1372,9 +1372,13 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,
 			struct page **pagep, void **fsdata)
 {
+	int ret;
 	struct inode *inode = mapping->host;
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
-	return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
+	ret = shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
+	if (*pagep)
+		init_page_accessed(*pagep);
+	return ret;
 }
 
 static int
diff --git a/mm/swap.c b/mm/swap.c
index fed4caf..2490dfe 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -583,6 +583,17 @@ void mark_page_accessed(struct page *page)
 EXPORT_SYMBOL(mark_page_accessed);
 
 /*
+ * Used to mark_page_accessed(page) that is not visible yet and when it is
+ * still safe to use non-atomic ops
+ */
+void init_page_accessed(struct page *page)
+{
+	if (!PageReferenced(page))
+		__SetPageReferenced(page);
+}
+EXPORT_SYMBOL(init_page_accessed);
+
+/*
  * Queue the page for addition to the LRU via pagevec. The decision on whether
  * to add the page to the [in]active [file|anon] list is deferred until the
  * pagevec is drained. This gives a chance for the caller of __lru_cache_add()
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 16/16] mm: filemap: Prefetch page->flags if !PageUptodate
  2014-04-18 14:50 [PATCH 00/16] Misc page alloc, shmem and mark_page_accessed optimisations Mel Gorman
@ 2014-04-18 14:50   ` Mel Gorman
  2014-04-18 14:50   ` Mel Gorman
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

The write_end handler is likely to call SetPageUptodate which is an atomic
operation so prefetch the line.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/filemap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index c28f69c..40713da 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2551,6 +2551,9 @@ again:
 		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
 		flush_dcache_page(page);
 
+		if (!PageUptodate(page))
+			prefetchw(&page->flags);
+
 		status = a_ops->write_end(file, mapping, pos, bytes, copied,
 						page, fsdata);
 		if (unlikely(status < 0))
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH 16/16] mm: filemap: Prefetch page->flags if !PageUptodate
@ 2014-04-18 14:50   ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-18 14:50 UTC (permalink / raw)
  To: Linux-MM; +Cc: Linux-FSDevel

The write_end handler is likely to call SetPageUptodate which is an atomic
operation so prefetch the line.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/filemap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index c28f69c..40713da 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2551,6 +2551,9 @@ again:
 		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
 		flush_dcache_page(page);
 
+		if (!PageUptodate(page))
+			prefetchw(&page->flags);
+
 		status = a_ops->write_end(file, mapping, pos, bytes, copied,
 						page, fsdata);
 		if (unlikely(status < 0))
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH 10/16] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
  2014-04-18 14:50   ` Mel Gorman
  (?)
@ 2014-04-18 17:16   ` Vlastimil Babka
  -1 siblings, 0 replies; 47+ messages in thread
From: Vlastimil Babka @ 2014-04-18 17:16 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM layout; +Cc: Linux-FSDevel

On 04/18/2014 04:50 PM, Mel Gorman wrote:
> The test_bit operations in get/set pageblock flags are expensive. This patch
> reads the bitmap on a word basis and use shifts and masks to isolate the bits
> of interest. Similarly masks are used to set a local copy of the bitmap and then
> use cmpxchg to update the bitmap if there have been no other changes made in
> parallel.
> 
> In a test running dd onto tmpfs the overhead of the pageblock-related
> functions went from 1.27% in profiles to 0.5%.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  include/linux/mmzone.h          |  6 +++++- 
>  include/linux/pageblock-flags.h | 21 ++++++++++++++++----
>  mm/page_alloc.c                 | 43 +++++++++++++++++++++++++----------------
>  3 files changed, 48 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index c1dbe0b..c97b4bc 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -75,9 +75,13 @@ enum {
>  
>  extern int page_group_by_mobility_disabled;
>  
> +#define NR_MIGRATETYPE_BITS 3
> +#define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1)
> +
>  static inline int get_pageblock_migratetype(struct page *page)
>  {
> -	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
> +	BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2);
> +	return get_pageblock_flags_mask(page, NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK);
>  }
>  
>  struct free_area {
> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
> index 2ee8cd2..c89ac75 100644
> --- a/include/linux/pageblock-flags.h
> +++ b/include/linux/pageblock-flags.h
> @@ -30,9 +30,12 @@ enum pageblock_bits {
>  	PB_migrate,
>  	PB_migrate_end = PB_migrate + 3 - 1,
>  			/* 3 bits required for migrate types */
> -#ifdef CONFIG_COMPACTION
>  	PB_migrate_skip,/* If set the block is skipped by compaction */
> -#endif /* CONFIG_COMPACTION */
> +
> +	/*
> +	 * Assume the bits will always align on a word. If this assumption
> +	 * changes then get/set pageblock needs updating.
> +	 */
>  	NR_PAGEBLOCK_BITS
>  };
>  
> @@ -62,9 +65,19 @@ extern int pageblock_order;
>  /* Forward declaration */
>  struct page;
>  
> +unsigned long get_pageblock_flags_mask(struct page *page,
> +				unsigned long nr_flag_bits,
> +				unsigned long mask);
> +
>  /* Declarations for getting and setting flags. See mm/page_alloc.c */
> -unsigned long get_pageblock_flags_group(struct page *page,
> -					int start_bitidx, int end_bitidx);
> +static inline unsigned long get_pageblock_flags_group(struct page *page,
> +					int start_bitidx, int end_bitidx)
> +{
> +	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
> +	unsigned long mask = (1 << nr_flag_bits) - 1;
> +
> +	return get_pageblock_flags_mask(page, nr_flag_bits, mask);
> +}
>  void set_pageblock_flags_group(struct page *page, unsigned long flags,
>  					int start_bitidx, int end_bitidx);
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 737577c..6047866 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6012,25 +6012,24 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
>   * @end_bitidx: The last bit of interest
>   * returns pageblock_bits flags
>   */
> -unsigned long get_pageblock_flags_group(struct page *page,
> -					int start_bitidx, int end_bitidx)
> +unsigned long get_pageblock_flags_mask(struct page *page,
> +					unsigned long nr_flag_bits,
> +					unsigned long mask)

I don't think this can work with just nr_flag_bits and mask, without
taking start_bitidx into account. This probably only works when
start_bitidx == 0, which is true for PB_migrate, but not PB_migrate_skip.

>  {
>  	struct zone *zone;
>  	unsigned long *bitmap;
> -	unsigned long pfn, bitidx;
> -	unsigned long flags = 0;
> -	unsigned long value = 1;
> +	unsigned long pfn, bitidx, word_bitidx;
> +	unsigned long word;
>  
>  	zone = page_zone(page);
>  	pfn = page_to_pfn(page);
>  	bitmap = get_pageblock_bitmap(zone, pfn);
>  	bitidx = pfn_to_bitidx(zone, pfn);
> +	word_bitidx = bitidx / BITS_PER_LONG;
> +	bitidx &= (BITS_PER_LONG-1);
>  
> -	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
> -		if (test_bit(bitidx + start_bitidx, bitmap))
> -			flags |= value;
> -
> -	return flags;
> +	word = bitmap[word_bitidx];
> +	return (word >> (BITS_PER_LONG - (bitidx + nr_flag_bits))) & mask;

Ugh, so for bitidx == 0, this shifts by 61 bits, so bits 61-63 is read.
Now consider this being called by get_pageblock_skip(). That will have
nr_flags_bit == 1, so shift by 63 -> bit 63 is read, but you probably
wanted bit 60? Or 60-62 for migratetype and 63 for the skip bit. I'm not
sure anymore which one matches the old bitmap layout and how endianness
plays a role here :) Friday evening... But, changing the order of bits,
and 4-bits within words doesn't matter I guess, except making sure that
the bitmap is now being allocated aligned to whole words so that we
don't read/write past the end of it.

>  }
>  
>  /**
> @@ -6045,20 +6044,30 @@ void set_pageblock_flags_group(struct page *page, unsigned long flags,
>  {
>  	struct zone *zone;
>  	unsigned long *bitmap;
> -	unsigned long pfn, bitidx;
> -	unsigned long value = 1;
> +	unsigned long pfn, bitidx, word_bitidx;
> +	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
> +	unsigned long mask = (1 << nr_flag_bits) - 1;
> +	unsigned long old_word, new_word;
> +
> +	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
>  
>  	zone = page_zone(page);
>  	pfn = page_to_pfn(page);
>  	bitmap = get_pageblock_bitmap(zone, pfn   );
>  	bitidx = pfn_to_bitidx(zone, pfn);
> +	word_bitidx = bitidx / BITS_PER_LONG;
> +	bitidx &= (BITS_PER_LONG-1);
> +
>  	VM_BUG_ON_PAGE(!zone_spans_pfn(zone, pfn), page);
>  
> -	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
> -		if (flags & value)
> -			__set_bit(bitidx + start_bitidx, bitmap);
> -		else
> -			__clear_bit(bitidx + start_bitidx, bitmap);
> +	end_bitidx = bitidx + (end_bitidx - start_bitidx);
> +	mask <<= (BITS_PER_LONG - end_bitidx - 1);
> +	flags <<= (BITS_PER_LONG - end_bitidx - 1);

Again, for bitidx == 0 and migratetype this will shift by 61, for skip
bit it will shift by 63 and overlap. Again, start_bitidx is not
considered except when subtracted from end_bitidx.
It would be also better if the code did not differ so much from the get_
version, which makes it harder to decide they operate on the same bits.

> +	do {
> +		old_word = ACCESS_ONCE(bitmap[word_bitidx]);
> +		new_word = (old_word & ~mask) | flags;
> +	} while (cmpxchg(&bitmap[word_bitidx], old_word, new_word) != old_word);

It seems that cmpxchg is not available for SMP that's not x86 :(

>  }
>  
>  /*
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 01/16] mm: Disable zone_reclaim_mode by default
  2014-04-18 14:50 ` [PATCH 01/16] mm: Disable zone_reclaim_mode by default Mel Gorman
@ 2014-04-18 17:26     ` Andi Kleen
  0 siblings, 0 replies; 47+ messages in thread
From: Andi Kleen @ 2014-04-18 17:26 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Linux-FSDevel

Mel Gorman <mgorman@suse.de> writes:

> zone_reclaim_mode causes processes to prefer reclaiming memory from local
> node instead of spilling over to other nodes. This made sense initially when
> NUMA machines were almost exclusively HPC and the workload was partitioned
> into nodes. The NUMA penalties were sufficiently high to justify reclaiming
> the memory. On current machines and workloads it is often the case that
> zone_reclaim_mode destroys performance but not all users know how to detect
> this. 

Non local memory also often destroys performance.

> Favour the common case and disable it by default. Users that are
> sophisticated enough to know they need zone_reclaim_mode will detect it.

While I'm not totally against this change, it will destroy many
carefully tuned configurations as the default NUMA behavior may be completely
different now. So it seems like a big hammer, and it's not even clear
what problem you're exactly solving here.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 01/16] mm: Disable zone_reclaim_mode by default
@ 2014-04-18 17:26     ` Andi Kleen
  0 siblings, 0 replies; 47+ messages in thread
From: Andi Kleen @ 2014-04-18 17:26 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Linux-FSDevel

Mel Gorman <mgorman@suse.de> writes:

> zone_reclaim_mode causes processes to prefer reclaiming memory from local
> node instead of spilling over to other nodes. This made sense initially when
> NUMA machines were almost exclusively HPC and the workload was partitioned
> into nodes. The NUMA penalties were sufficiently high to justify reclaiming
> the memory. On current machines and workloads it is often the case that
> zone_reclaim_mode destroys performance but not all users know how to detect
> this. 

Non local memory also often destroys performance.

> Favour the common case and disable it by default. Users that are
> sophisticated enough to know they need zone_reclaim_mode will detect it.

While I'm not totally against this change, it will destroy many
carefully tuned configurations as the default NUMA behavior may be completely
different now. So it seems like a big hammer, and it's not even clear
what problem you're exactly solving here.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 03/16] mm: page_alloc: Do not update zlc unless the zlc is active
  2014-04-18 14:50   ` Mel Gorman
  (?)
@ 2014-04-18 17:52   ` Johannes Weiner
  -1 siblings, 0 replies; 47+ messages in thread
From: Johannes Weiner @ 2014-04-18 17:52 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Linux-FSDevel

On Fri, Apr 18, 2014 at 03:50:30PM +0100, Mel Gorman wrote:
> The zlc is used on NUMA machines to quickly skip over zones that are full.
> However it is always updated, even for the first zone scanned when the
> zlc might not even be active. As it's a write to a bitmap that potentially
> bounces cache line it's deceptively expensive and most machines will not
> care. Only update the zlc if it was active.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 04/16] mm: page_alloc: Do not treat a zone that cannot be used for dirty pages as "full"
  2014-04-18 14:50   ` Mel Gorman
  (?)
@ 2014-04-18 17:52   ` Johannes Weiner
  -1 siblings, 0 replies; 47+ messages in thread
From: Johannes Weiner @ 2014-04-18 17:52 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Linux-FSDevel

On Fri, Apr 18, 2014 at 03:50:31PM +0100, Mel Gorman wrote:
> If a zone cannot be used for a dirty page then it gets marked "full"
> which is cached in the zlc and later potentially skipped by allocation
> requests that have nothing to do with dirty zones.

Urgh.  Thanks for the fix.

> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 06/16] mm: page_alloc: Calculate classzone_idx once from the zonelist ref
  2014-04-18 14:50   ` Mel Gorman
  (?)
@ 2014-04-18 18:03   ` Johannes Weiner
  2014-04-19 11:18     ` Mel Gorman
  -1 siblings, 1 reply; 47+ messages in thread
From: Johannes Weiner @ 2014-04-18 18:03 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Linux-FSDevel

On Fri, Apr 18, 2014 at 03:50:33PM +0100, Mel Gorman wrote:
> @@ -2463,7 +2462,7 @@ static inline struct page *
>  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	struct zonelist *zonelist, enum zone_type high_zoneidx,
>  	nodemask_t *nodemask, struct zone *preferred_zone,
> -	int migratetype)
> +	int classzone_idx, int migratetype)
>  {
>  	const gfp_t wait = gfp_mask & __GFP_WAIT;
>  	struct page *page = NULL;

There is another potential update of preferred_zone in this function
after which the classzone_idx should probably be refreshed:

	/*
	 * Find the true preferred zone if the allocation is unconstrained by
	 * cpusets.
	 */
	if (!(alloc_flags & ALLOC_CPUSET) && !nodemask)
		first_zones_zonelist(zonelist, high_zoneidx, NULL,
					&preferred_zone);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 07/16] mm: page_alloc: Only check the zone id check if pages are buddies
  2014-04-18 14:50   ` Mel Gorman
  (?)
@ 2014-04-18 18:05   ` Johannes Weiner
  -1 siblings, 0 replies; 47+ messages in thread
From: Johannes Weiner @ 2014-04-18 18:05 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Linux-FSDevel

On Fri, Apr 18, 2014 at 03:50:34PM +0100, Mel Gorman wrote:
> A node/zone index is used to check if pages are compatible for merging
> but this happens unconditionally even if the buddy page is not free. Defer
> the calculation as long as possible. Ideally we would check the zone boundary
> but nodes can overlap.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/16] mm: page_alloc: Only check the alloc flags and gfp_mask for dirty once
  2014-04-18 14:50   ` Mel Gorman
  (?)
@ 2014-04-18 18:08   ` Johannes Weiner
  2014-04-19 11:19     ` Mel Gorman
  -1 siblings, 1 reply; 47+ messages in thread
From: Johannes Weiner @ 2014-04-18 18:08 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Linux-FSDevel

On Fri, Apr 18, 2014 at 03:50:35PM +0100, Mel Gorman wrote:
> Currently it's calculated once per zone in the zonelist.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

I would have assumed the compiler can detect such a loop invariant...
Alas,

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 09/16] mm: page_alloc: Take the ALLOC_NO_WATERMARK check out of the fast path
  2014-04-18 14:50   ` Mel Gorman
  (?)
@ 2014-04-18 18:10   ` Johannes Weiner
  -1 siblings, 0 replies; 47+ messages in thread
From: Johannes Weiner @ 2014-04-18 18:10 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Linux-FSDevel

On Fri, Apr 18, 2014 at 03:50:36PM +0100, Mel Gorman wrote:
> ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for
> __GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these cases
> are relatively rare events but the ALLOC_NO_WATERMARK check is an unlikely
> branch in the fast path.  This patch moves the check out of the fast path
> and after it has been determined that the watermarks have not been met. This
> helps the common fast path at the cost of making the slow path slower and
> hitting kswapd with a performance cost. It's a reasonable tradeoff.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 12/16] mm: shmem: Avoid atomic operation during shmem_getpage_gfp
  2014-04-18 14:50   ` Mel Gorman
  (?)
@ 2014-04-18 18:13   ` Johannes Weiner
  -1 siblings, 0 replies; 47+ messages in thread
From: Johannes Weiner @ 2014-04-18 18:13 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Linux-FSDevel

On Fri, Apr 18, 2014 at 03:50:39PM +0100, Mel Gorman wrote:
> shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
> before it's even added to the LRU or visible. This is unnecessary as what
> could it possible race against?  Use an unlocked variant.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 16/16] mm: filemap: Prefetch page->flags if !PageUptodate
  2014-04-18 14:50   ` Mel Gorman
  (?)
@ 2014-04-18 19:16   ` Hugh Dickins
  2014-04-19 11:23     ` Mel Gorman
  -1 siblings, 1 reply; 47+ messages in thread
From: Hugh Dickins @ 2014-04-18 19:16 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Linux-FSDevel

On Fri, 18 Apr 2014, Mel Gorman wrote:

> The write_end handler is likely to call SetPageUptodate which is an atomic
> operation so prefetch the line.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

This one seems a little odd to me: it feels as if you're compensating
for your mark_page_accessed() movement, but in too shmem-specific a way.

I see write_ends do SetPageUptodate more often than I was expecting
(with __block_commit_write() doing so even when PageUptodate already),
but even so...

Given that the write_end is likely to want to SetPageDirty, and sure
to want to clear_bit_unlock(PG_locked, &page->flags), wouldn't it be
better and less mysterious just to prefetchw(&page->flags) here
unconditionally?

(But I'm also afraid that this sets a precedent for an avalanche of
dubious prefetchw patches all over.)

Hugh

> ---
>  mm/filemap.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c28f69c..40713da 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2551,6 +2551,9 @@ again:
>  		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
>  		flush_dcache_page(page);
>  
> +		if (!PageUptodate(page))
> +			prefetchw(&page->flags);
> +
>  		status = a_ops->write_end(file, mapping, pos, bytes, copied,
>  						page, fsdata);
>  		if (unlikely(status < 0))
> -- 
> 1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 01/16] mm: Disable zone_reclaim_mode by default
  2014-04-18 17:26     ` Andi Kleen
@ 2014-04-18 21:15       ` Dave Hansen
  -1 siblings, 0 replies; 47+ messages in thread
From: Dave Hansen @ 2014-04-18 21:15 UTC (permalink / raw)
  To: Andi Kleen, Mel Gorman; +Cc: Linux-MM, Linux-FSDevel

On 04/18/2014 10:26 AM, Andi Kleen wrote:
> Mel Gorman <mgorman@suse.de> writes:
>> Favour the common case and disable it by default. Users that are
>> sophisticated enough to know they need zone_reclaim_mode will detect it.
> 
> While I'm not totally against this change, it will destroy many
> carefully tuned configurations as the default NUMA behavior may be completely
> different now. So it seems like a big hammer, and it's not even clear
> what problem you're exactly solving here.

I'm not 100% sure what the common case _is_.  Folks who want good NUMA
affinity are happy now and are happy by default.  Folks who want to fill
memory with page cache are mad and mad by default, and they're the ones
complaining.  It's hard to count the happy ones. :)

But, on the other hand, the current situation is easy to debug.  Someone
complains that they have too much free memory, and it ends up being
pretty easy to solve just looking at statistics, and things go horribly
wrong quickly.  If we apply this patch, it's much less obvious when
things are going wrong, and we have no statistics to help.  We'll need
to get folks running more things like numatop:

	https://01.org/numatop

That said, as a recipient of angry calls from customers who don't like
zone_reclaim_mode, I _do_ think this is the path we should take at the
moment.  Maybe we'll be reverting it in a few years once all of our
customers are angry about lack of NUMA locality.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 01/16] mm: Disable zone_reclaim_mode by default
@ 2014-04-18 21:15       ` Dave Hansen
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Hansen @ 2014-04-18 21:15 UTC (permalink / raw)
  To: Andi Kleen, Mel Gorman; +Cc: Linux-MM, Linux-FSDevel

On 04/18/2014 10:26 AM, Andi Kleen wrote:
> Mel Gorman <mgorman@suse.de> writes:
>> Favour the common case and disable it by default. Users that are
>> sophisticated enough to know they need zone_reclaim_mode will detect it.
> 
> While I'm not totally against this change, it will destroy many
> carefully tuned configurations as the default NUMA behavior may be completely
> different now. So it seems like a big hammer, and it's not even clear
> what problem you're exactly solving here.

I'm not 100% sure what the common case _is_.  Folks who want good NUMA
affinity are happy now and are happy by default.  Folks who want to fill
memory with page cache are mad and mad by default, and they're the ones
complaining.  It's hard to count the happy ones. :)

But, on the other hand, the current situation is easy to debug.  Someone
complains that they have too much free memory, and it ends up being
pretty easy to solve just looking at statistics, and things go horribly
wrong quickly.  If we apply this patch, it's much less obvious when
things are going wrong, and we have no statistics to help.  We'll need
to get folks running more things like numatop:

	https://01.org/numatop

That said, as a recipient of angry calls from customers who don't like
zone_reclaim_mode, I _do_ think this is the path we should take at the
moment.  Maybe we'll be reverting it in a few years once all of our
customers are angry about lack of NUMA locality.

Acked-by: Dave Hansen <dave.hansen@linux.intel.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 06/16] mm: page_alloc: Calculate classzone_idx once from the zonelist ref
  2014-04-18 18:03   ` Johannes Weiner
@ 2014-04-19 11:18     ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-19 11:18 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Linux-MM, Linux-FSDevel

On Fri, Apr 18, 2014 at 02:03:09PM -0400, Johannes Weiner wrote:
> On Fri, Apr 18, 2014 at 03:50:33PM +0100, Mel Gorman wrote:
> > @@ -2463,7 +2462,7 @@ static inline struct page *
> >  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	struct zonelist *zonelist, enum zone_type high_zoneidx,
> >  	nodemask_t *nodemask, struct zone *preferred_zone,
> > -	int migratetype)
> > +	int classzone_idx, int migratetype)
> >  {
> >  	const gfp_t wait = gfp_mask & __GFP_WAIT;
> >  	struct page *page = NULL;
> 
> There is another potential update of preferred_zone in this function
> after which the classzone_idx should probably be refreshed:
> 
> 	/*
> 	 * Find the true preferred zone if the allocation is unconstrained by
> 	 * cpusets.
> 	 */
> 	if (!(alloc_flags & ALLOC_CPUSET) && !nodemask)
> 		first_zones_zonelist(zonelist, high_zoneidx, NULL,
> 					&preferred_zone);

Thanks, I'll fix it up for v2.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 08/16] mm: page_alloc: Only check the alloc flags and gfp_mask for dirty once
  2014-04-18 18:08   ` Johannes Weiner
@ 2014-04-19 11:19     ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-19 11:19 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Linux-MM, Linux-FSDevel

On Fri, Apr 18, 2014 at 02:08:36PM -0400, Johannes Weiner wrote:
> On Fri, Apr 18, 2014 at 03:50:35PM +0100, Mel Gorman wrote:
> > Currently it's calculated once per zone in the zonelist.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> I would have assumed the compiler can detect such a loop invariant...
> Alas,
> 

Surprisingly it didn't in my case but the benefit of the patch is
marginal at best. I can drop it if it makes the code more obscure to
peoples eyes.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH 16/16] mm: filemap: Prefetch page->flags if !PageUptodate
  2014-04-18 19:16   ` Hugh Dickins
@ 2014-04-19 11:23     ` Mel Gorman
  0 siblings, 0 replies; 47+ messages in thread
From: Mel Gorman @ 2014-04-19 11:23 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linux-MM, Linux-FSDevel

On Fri, Apr 18, 2014 at 12:16:23PM -0700, Hugh Dickins wrote:
> On Fri, 18 Apr 2014, Mel Gorman wrote:
> 
> > The write_end handler is likely to call SetPageUptodate which is an atomic
> > operation so prefetch the line.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> This one seems a little odd to me: it feels as if you're compensating
> for your mark_page_accessed() movement,

Not as such. We take the penalty anyway, it's just a case of when. As
the penalty was semi-obviously in one place it seemed like a reasonable
thing to do.

> but in too shmem-specific a way.
> 
> I see write_ends do SetPageUptodate more often than I was expecting
> (with __block_commit_write() doing so even when PageUptodate already),
> but even so...
> 

Good point. I'll search for those and clean them up.

> Given that the write_end is likely to want to SetPageDirty, and sure
> to want to clear_bit_unlock(PG_locked, &page->flags), wouldn't it be
> better and less mysterious just to prefetchw(&page->flags) here
> unconditionally?
> 

Again, good point. I'm travelling at the moment but will audit the write_end
handlers when I get back and see if filesystems generally benefit or if
I was aiming at shmem too much.

> (But I'm also afraid that this sets a precedent for an avalanche of
> dubious prefetchw patches all over.)
> 

I'll include figures the next time to see if it's justified. However,
even in that case I recognise that not all CPUs treat prefetchw the same
and we might still want to drop this patch as a result regardless of
what result I see on one test machine.

Thanks Hugh

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2014-04-19 11:24 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-18 14:50 [PATCH 00/16] Misc page alloc, shmem and mark_page_accessed optimisations Mel Gorman
2014-04-18 14:50 ` [PATCH 01/16] mm: Disable zone_reclaim_mode by default Mel Gorman
2014-04-18 17:26   ` Andi Kleen
2014-04-18 17:26     ` Andi Kleen
2014-04-18 21:15     ` Dave Hansen
2014-04-18 21:15       ` Dave Hansen
2014-04-18 14:50 ` [PATCH 02/16] mm: page_alloc: Do not cache reclaim distances Mel Gorman
2014-04-18 14:50   ` Mel Gorman
2014-04-18 14:50 ` [PATCH 03/16] mm: page_alloc: Do not update zlc unless the zlc is active Mel Gorman
2014-04-18 14:50   ` Mel Gorman
2014-04-18 17:52   ` Johannes Weiner
2014-04-18 14:50 ` [PATCH 04/16] mm: page_alloc: Do not treat a zone that cannot be used for dirty pages as "full" Mel Gorman
2014-04-18 14:50   ` Mel Gorman
2014-04-18 17:52   ` Johannes Weiner
2014-04-18 14:50 ` [PATCH 05/16] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets Mel Gorman
2014-04-18 14:50 ` [PATCH 06/16] mm: page_alloc: Calculate classzone_idx once from the zonelist ref Mel Gorman
2014-04-18 14:50   ` Mel Gorman
2014-04-18 18:03   ` Johannes Weiner
2014-04-19 11:18     ` Mel Gorman
2014-04-18 14:50 ` [PATCH 07/16] mm: page_alloc: Only check the zone id check if pages are buddies Mel Gorman
2014-04-18 14:50   ` Mel Gorman
2014-04-18 18:05   ` Johannes Weiner
2014-04-18 14:50 ` [PATCH 08/16] mm: page_alloc: Only check the alloc flags and gfp_mask for dirty once Mel Gorman
2014-04-18 14:50   ` Mel Gorman
2014-04-18 18:08   ` Johannes Weiner
2014-04-19 11:19     ` Mel Gorman
2014-04-18 14:50 ` [PATCH 09/16] mm: page_alloc: Take the ALLOC_NO_WATERMARK check out of the fast path Mel Gorman
2014-04-18 14:50   ` Mel Gorman
2014-04-18 18:10   ` Johannes Weiner
2014-04-18 14:50 ` [PATCH 10/16] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps Mel Gorman
2014-04-18 14:50   ` Mel Gorman
2014-04-18 17:16   ` Vlastimil Babka
2014-04-18 14:50 ` [PATCH 11/16] mm: page_alloc: Reduce number of times page_to_pfn is called Mel Gorman
2014-04-18 14:50   ` Mel Gorman
2014-04-18 14:50 ` [PATCH 12/16] mm: shmem: Avoid atomic operation during shmem_getpage_gfp Mel Gorman
2014-04-18 14:50   ` Mel Gorman
2014-04-18 18:13   ` Johannes Weiner
2014-04-18 14:50 ` [PATCH 13/16] mm: Do not use atomic operations when releasing pages Mel Gorman
2014-04-18 14:50   ` Mel Gorman
2014-04-18 14:50 ` [PATCH 14/16] mm: Do not use unnecessary atomic operations when adding pages to the LRU Mel Gorman
2014-04-18 14:50   ` Mel Gorman
2014-04-18 14:50 ` [PATCH 15/16] mm: Non-atomically mark page accessed in write_begin where possible Mel Gorman
2014-04-18 14:50   ` Mel Gorman
2014-04-18 14:50 ` [PATCH 16/16] mm: filemap: Prefetch page->flags if !PageUptodate Mel Gorman
2014-04-18 14:50   ` Mel Gorman
2014-04-18 19:16   ` Hugh Dickins
2014-04-19 11:23     ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.