All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/17] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations
@ 2014-05-01  8:44 ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

I was investigating a performance bug that looked like dd to tmpfs
had regressed.  The bulk of the problem turned out to be a difference
in Kconfig but it got me looking at the unnecessary overhead in tmpfs,
mark_page_accessed and parts of the allocator. This series is the result.

The patches themselves have details of the performance results but here
are some of the results based on ext4. This is the result of dd'ing to
a file multiple times on tmpfs

loopdd Throughput
                     3.15.0-rc3            3.15.0-rc3
                        vanilla        lockpage-v2r33
Min         4096.0000 (  0.00%)   3891.2000 ( -5.00%)
Mean        4840.1067 (  0.00%)   5154.1333 (  6.49%)
TrimMean    4867.6571 (  0.00%)   5204.1143 (  6.91%)
Stddev       160.6807 (  0.00%)    275.1917 ( 71.27%)
Max         5017.6000 (  0.00%)   5324.8000 (  6.12%)

loopdd elapsed time
                            3.15.0-rc3            3.15.0-rc3
                               vanilla        lockpage-v2r33
Min      elapsed      0.4100 (  0.00%)      0.3900 (  4.88%)
Mean     elapsed      0.4780 (  0.00%)      0.4203 ( 12.06%)
TrimMean elapsed      0.4796 (  0.00%)      0.4179 ( 12.88%)
Stddev   elapsed      0.0353 (  0.00%)      0.0379 ( -7.23%)
Max      elapsed      0.5100 (  0.00%)      0.4800 (  5.88%)

This table shows the latency in usecs of accessing ext4-backed
mappings of various sizes

lat_mmap
                       3.15.0-rc3            3.15.0-rc3
                          vanilla        lockpage-v2r33
Procs 107M     557.0000 (  0.00%)    544.0000 (  2.33%)
Procs 214M    1150.0000 (  0.00%)   1058.0000 (  8.00%)
Procs 322M    1897.0000 (  0.00%)   1554.0000 ( 18.08%)
Procs 429M    2188.0000 (  0.00%)   2652.0000 (-21.21%)
Procs 536M    2622.0000 (  0.00%)   2473.0000 (  5.68%)
Procs 644M    3065.0000 (  0.00%)   2486.0000 ( 18.89%)
Procs 751M    3400.0000 (  0.00%)   3012.0000 ( 11.41%)
Procs 859M    3996.0000 (  0.00%)   3926.0000 (  1.75%)
Procs 966M    4646.0000 (  0.00%)   3763.0000 ( 19.01%)
Procs 1073M   4981.0000 (  0.00%)   4154.0000 ( 16.60%)
Procs 1181M   5419.0000 (  0.00%)   5152.0000 (  4.93%)
Procs 1288M   5553.0000 (  0.00%)   5538.0000 (  0.27%)
Procs 1395M   5841.0000 (  0.00%)   5730.0000 (  1.90%)
Procs 1503M   6225.0000 (  0.00%)   5981.0000 (  3.92%)
Procs 1610M   6558.0000 (  0.00%)   6332.0000 (  3.45%)
Procs 1717M   7130.0000 (  0.00%)   6741.0000 (  5.46%)
Procs 1825M   9394.0000 (  0.00%)   8483.0000 (  9.70%)
Procs 1932M   8056.0000 (  0.00%)   9427.0000 (-17.02%)
Procs 2040M   8463.0000 (  0.00%)   9030.0000 ( -6.70%)
Procs 2147M   9014.0000 (  0.00%)   8608.0000 (  4.50%)

In general the system CPU overhead is lower.

 arch/tile/mm/homecache.c        |   2 +-
 fs/btrfs/extent_io.c            |  11 +-
 fs/btrfs/file.c                 |   5 +-
 fs/buffer.c                     |   7 +-
 fs/ext4/mballoc.c               |  14 +-
 fs/f2fs/checkpoint.c            |   3 -
 fs/f2fs/node.c                  |   2 -
 fs/fuse/dev.c                   |   2 +-
 fs/fuse/file.c                  |   2 -
 fs/gfs2/aops.c                  |   1 -
 fs/gfs2/meta_io.c               |   4 +-
 fs/ntfs/attrib.c                |   1 -
 fs/ntfs/file.c                  |   1 -
 include/linux/cpuset.h          |  31 +++++
 include/linux/gfp.h             |   4 +-
 include/linux/mmzone.h          |  22 ++-
 include/linux/page-flags.h      |  18 +++
 include/linux/pageblock-flags.h |  34 ++++-
 include/linux/pagemap.h         | 115 ++++++++++++++--
 include/linux/swap.h            |   9 +-
 kernel/cpuset.c                 |   8 +-
 kernel/sched/wait.c             |   3 +-
 mm/filemap.c                    | 292 ++++++++++++++++++++++------------------
 mm/page_alloc.c                 | 226 ++++++++++++++++++-------------
 mm/shmem.c                      |   8 +-
 mm/swap.c                       |  17 ++-
 mm/swap_state.c                 |   2 +-
 mm/vmscan.c                     |   6 +-
 28 files changed, 556 insertions(+), 294 deletions(-)

-- 
1.8.4.5


^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 00/17] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations
@ 2014-05-01  8:44 ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

I was investigating a performance bug that looked like dd to tmpfs
had regressed.  The bulk of the problem turned out to be a difference
in Kconfig but it got me looking at the unnecessary overhead in tmpfs,
mark_page_accessed and parts of the allocator. This series is the result.

The patches themselves have details of the performance results but here
are some of the results based on ext4. This is the result of dd'ing to
a file multiple times on tmpfs

loopdd Throughput
                     3.15.0-rc3            3.15.0-rc3
                        vanilla        lockpage-v2r33
Min         4096.0000 (  0.00%)   3891.2000 ( -5.00%)
Mean        4840.1067 (  0.00%)   5154.1333 (  6.49%)
TrimMean    4867.6571 (  0.00%)   5204.1143 (  6.91%)
Stddev       160.6807 (  0.00%)    275.1917 ( 71.27%)
Max         5017.6000 (  0.00%)   5324.8000 (  6.12%)

loopdd elapsed time
                            3.15.0-rc3            3.15.0-rc3
                               vanilla        lockpage-v2r33
Min      elapsed      0.4100 (  0.00%)      0.3900 (  4.88%)
Mean     elapsed      0.4780 (  0.00%)      0.4203 ( 12.06%)
TrimMean elapsed      0.4796 (  0.00%)      0.4179 ( 12.88%)
Stddev   elapsed      0.0353 (  0.00%)      0.0379 ( -7.23%)
Max      elapsed      0.5100 (  0.00%)      0.4800 (  5.88%)

This table shows the latency in usecs of accessing ext4-backed
mappings of various sizes

lat_mmap
                       3.15.0-rc3            3.15.0-rc3
                          vanilla        lockpage-v2r33
Procs 107M     557.0000 (  0.00%)    544.0000 (  2.33%)
Procs 214M    1150.0000 (  0.00%)   1058.0000 (  8.00%)
Procs 322M    1897.0000 (  0.00%)   1554.0000 ( 18.08%)
Procs 429M    2188.0000 (  0.00%)   2652.0000 (-21.21%)
Procs 536M    2622.0000 (  0.00%)   2473.0000 (  5.68%)
Procs 644M    3065.0000 (  0.00%)   2486.0000 ( 18.89%)
Procs 751M    3400.0000 (  0.00%)   3012.0000 ( 11.41%)
Procs 859M    3996.0000 (  0.00%)   3926.0000 (  1.75%)
Procs 966M    4646.0000 (  0.00%)   3763.0000 ( 19.01%)
Procs 1073M   4981.0000 (  0.00%)   4154.0000 ( 16.60%)
Procs 1181M   5419.0000 (  0.00%)   5152.0000 (  4.93%)
Procs 1288M   5553.0000 (  0.00%)   5538.0000 (  0.27%)
Procs 1395M   5841.0000 (  0.00%)   5730.0000 (  1.90%)
Procs 1503M   6225.0000 (  0.00%)   5981.0000 (  3.92%)
Procs 1610M   6558.0000 (  0.00%)   6332.0000 (  3.45%)
Procs 1717M   7130.0000 (  0.00%)   6741.0000 (  5.46%)
Procs 1825M   9394.0000 (  0.00%)   8483.0000 (  9.70%)
Procs 1932M   8056.0000 (  0.00%)   9427.0000 (-17.02%)
Procs 2040M   8463.0000 (  0.00%)   9030.0000 ( -6.70%)
Procs 2147M   9014.0000 (  0.00%)   8608.0000 (  4.50%)

In general the system CPU overhead is lower.

 arch/tile/mm/homecache.c        |   2 +-
 fs/btrfs/extent_io.c            |  11 +-
 fs/btrfs/file.c                 |   5 +-
 fs/buffer.c                     |   7 +-
 fs/ext4/mballoc.c               |  14 +-
 fs/f2fs/checkpoint.c            |   3 -
 fs/f2fs/node.c                  |   2 -
 fs/fuse/dev.c                   |   2 +-
 fs/fuse/file.c                  |   2 -
 fs/gfs2/aops.c                  |   1 -
 fs/gfs2/meta_io.c               |   4 +-
 fs/ntfs/attrib.c                |   1 -
 fs/ntfs/file.c                  |   1 -
 include/linux/cpuset.h          |  31 +++++
 include/linux/gfp.h             |   4 +-
 include/linux/mmzone.h          |  22 ++-
 include/linux/page-flags.h      |  18 +++
 include/linux/pageblock-flags.h |  34 ++++-
 include/linux/pagemap.h         | 115 ++++++++++++++--
 include/linux/swap.h            |   9 +-
 kernel/cpuset.c                 |   8 +-
 kernel/sched/wait.c             |   3 +-
 mm/filemap.c                    | 292 ++++++++++++++++++++++------------------
 mm/page_alloc.c                 | 226 ++++++++++++++++++-------------
 mm/shmem.c                      |   8 +-
 mm/swap.c                       |  17 ++-
 mm/swap_state.c                 |   2 +-
 mm/vmscan.c                     |   6 +-
 28 files changed, 556 insertions(+), 294 deletions(-)

-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 01/17] mm: page_alloc: Do not update zlc unless the zlc is active
  2014-05-01  8:44 ` Mel Gorman
@ 2014-05-01  8:44   ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

The zlc is used on NUMA machines to quickly skip over zones that are full.
However it is always updated, even for the first zone scanned when the
zlc might not even be active. As it's a write to a bitmap that potentially
bounces cache line it's deceptively expensive and most machines will not
care. Only update the zlc if it was active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dba293..f8b80c3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2044,7 +2044,7 @@ try_this_zone:
 		if (page)
 			break;
 this_zone_full:
-		if (IS_ENABLED(CONFIG_NUMA))
+		if (IS_ENABLED(CONFIG_NUMA) && zlc_active)
 			zlc_mark_zone_full(zonelist, z);
 	}
 
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 01/17] mm: page_alloc: Do not update zlc unless the zlc is active
@ 2014-05-01  8:44   ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

The zlc is used on NUMA machines to quickly skip over zones that are full.
However it is always updated, even for the first zone scanned when the
zlc might not even be active. As it's a write to a bitmap that potentially
bounces cache line it's deceptively expensive and most machines will not
care. Only update the zlc if it was active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dba293..f8b80c3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2044,7 +2044,7 @@ try_this_zone:
 		if (page)
 			break;
 this_zone_full:
-		if (IS_ENABLED(CONFIG_NUMA))
+		if (IS_ENABLED(CONFIG_NUMA) && zlc_active)
 			zlc_mark_zone_full(zonelist, z);
 	}
 
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 02/17] mm: page_alloc: Do not treat a zone that cannot be used for dirty pages as "full"
  2014-05-01  8:44 ` Mel Gorman
@ 2014-05-01  8:44   ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

If a zone cannot be used for a dirty page then it gets marked "full"
which is cached in the zlc and later potentially skipped by allocation
requests that have nothing to do with dirty zones.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f8b80c3..5c559e3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1976,7 +1976,7 @@ zonelist_scan:
 		 */
 		if ((alloc_flags & ALLOC_WMARK_LOW) &&
 		    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
-			goto this_zone_full;
+			continue;
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 		if (!zone_watermark_ok(zone, order, mark,
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 02/17] mm: page_alloc: Do not treat a zone that cannot be used for dirty pages as "full"
@ 2014-05-01  8:44   ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

If a zone cannot be used for a dirty page then it gets marked "full"
which is cached in the zlc and later potentially skipped by allocation
requests that have nothing to do with dirty zones.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f8b80c3..5c559e3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1976,7 +1976,7 @@ zonelist_scan:
 		 */
 		if ((alloc_flags & ALLOC_WMARK_LOW) &&
 		    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
-			goto this_zone_full;
+			continue;
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 		if (!zone_watermark_ok(zone, order, mark,
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 03/17] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets
  2014-05-01  8:44 ` Mel Gorman
@ 2014-05-01  8:44   ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

If cpusets are not in use then we still check a global variable on every
page allocation. Use jump labels to avoid the overhead.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/cpuset.h | 31 +++++++++++++++++++++++++++++++
 kernel/cpuset.c        |  8 ++++++--
 mm/page_alloc.c        |  3 ++-
 3 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index b19d3dc..2b89e07 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -17,6 +17,35 @@
 
 extern int number_of_cpusets;	/* How many cpusets are defined in system? */
 
+#ifdef HAVE_JUMP_LABEL
+extern struct static_key cpusets_enabled_key;
+static inline bool cpusets_enabled(void)
+{
+	return static_key_false(&cpusets_enabled_key);
+}
+#else
+static inline bool cpusets_enabled(void)
+{
+	return number_of_cpusets > 1;
+}
+#endif
+
+static inline void cpuset_inc(void)
+{
+	number_of_cpusets++;
+#ifdef HAVE_JUMP_LABEL
+	static_key_slow_inc(&cpusets_enabled_key);
+#endif
+}
+
+static inline void cpuset_dec(void)
+{
+	number_of_cpusets--;
+#ifdef HAVE_JUMP_LABEL
+	static_key_slow_dec(&cpusets_enabled_key);
+#endif
+}
+
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
 extern void cpuset_update_active_cpus(bool cpu_online);
@@ -124,6 +153,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 
 #else /* !CONFIG_CPUSETS */
 
+static inline bool cpusets_enabled(void) { return false; }
+
 static inline int cpuset_init(void) { return 0; }
 static inline void cpuset_init_smp(void) {}
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 3d54c41..34ada52 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -68,6 +68,10 @@
  */
 int number_of_cpusets __read_mostly;
 
+#ifdef HAVE_JUMP_LABEL
+struct static_key cpusets_enabled_key = STATIC_KEY_INIT_FALSE;
+#endif
+
 /* See "Frequency meter" comments, below. */
 
 struct fmeter {
@@ -1888,7 +1892,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	if (is_spread_slab(parent))
 		set_bit(CS_SPREAD_SLAB, &cs->flags);
 
-	number_of_cpusets++;
+	cpuset_inc();
 
 	if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
 		goto out_unlock;
@@ -1939,7 +1943,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 	if (is_sched_load_balance(cs))
 		update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
 
-	number_of_cpusets--;
+	cpuset_dec();
 	clear_bit(CS_ONLINE, &cs->flags);
 
 	mutex_unlock(&cpuset_mutex);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5c559e3..cb12b9a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1930,7 +1930,8 @@ zonelist_scan:
 		if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
-		if ((alloc_flags & ALLOC_CPUSET) &&
+		if (cpusets_enabled() &&
+			(alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				continue;
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 03/17] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets
@ 2014-05-01  8:44   ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

If cpusets are not in use then we still check a global variable on every
page allocation. Use jump labels to avoid the overhead.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/cpuset.h | 31 +++++++++++++++++++++++++++++++
 kernel/cpuset.c        |  8 ++++++--
 mm/page_alloc.c        |  3 ++-
 3 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index b19d3dc..2b89e07 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -17,6 +17,35 @@
 
 extern int number_of_cpusets;	/* How many cpusets are defined in system? */
 
+#ifdef HAVE_JUMP_LABEL
+extern struct static_key cpusets_enabled_key;
+static inline bool cpusets_enabled(void)
+{
+	return static_key_false(&cpusets_enabled_key);
+}
+#else
+static inline bool cpusets_enabled(void)
+{
+	return number_of_cpusets > 1;
+}
+#endif
+
+static inline void cpuset_inc(void)
+{
+	number_of_cpusets++;
+#ifdef HAVE_JUMP_LABEL
+	static_key_slow_inc(&cpusets_enabled_key);
+#endif
+}
+
+static inline void cpuset_dec(void)
+{
+	number_of_cpusets--;
+#ifdef HAVE_JUMP_LABEL
+	static_key_slow_dec(&cpusets_enabled_key);
+#endif
+}
+
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
 extern void cpuset_update_active_cpus(bool cpu_online);
@@ -124,6 +153,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 
 #else /* !CONFIG_CPUSETS */
 
+static inline bool cpusets_enabled(void) { return false; }
+
 static inline int cpuset_init(void) { return 0; }
 static inline void cpuset_init_smp(void) {}
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 3d54c41..34ada52 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -68,6 +68,10 @@
  */
 int number_of_cpusets __read_mostly;
 
+#ifdef HAVE_JUMP_LABEL
+struct static_key cpusets_enabled_key = STATIC_KEY_INIT_FALSE;
+#endif
+
 /* See "Frequency meter" comments, below. */
 
 struct fmeter {
@@ -1888,7 +1892,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	if (is_spread_slab(parent))
 		set_bit(CS_SPREAD_SLAB, &cs->flags);
 
-	number_of_cpusets++;
+	cpuset_inc();
 
 	if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
 		goto out_unlock;
@@ -1939,7 +1943,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 	if (is_sched_load_balance(cs))
 		update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
 
-	number_of_cpusets--;
+	cpuset_dec();
 	clear_bit(CS_ONLINE, &cs->flags);
 
 	mutex_unlock(&cpuset_mutex);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5c559e3..cb12b9a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1930,7 +1930,8 @@ zonelist_scan:
 		if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
-		if ((alloc_flags & ALLOC_CPUSET) &&
+		if (cpusets_enabled() &&
+			(alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				continue;
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 04/17] mm: page_alloc: Calculate classzone_idx once from the zonelist ref
  2014-05-01  8:44 ` Mel Gorman
@ 2014-05-01  8:44   ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

There is no need to calculate zone_idx(preferred_zone) multiple times
or use the pgdat to figure it out.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 55 ++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb12b9a..3b6ae9d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1907,17 +1907,15 @@ static inline void init_zone_allows_reclaim(int nid)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone, int migratetype)
+		struct zone *preferred_zone, int classzone_idx, int migratetype)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
-	int classzone_idx;
 	struct zone *zone;
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
-	classzone_idx = zone_idx(preferred_zone);
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -2174,7 +2172,7 @@ static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int classzone_idx, int migratetype)
 {
 	struct page *page;
 
@@ -2192,7 +2190,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone, migratetype);
+		preferred_zone, classzone_idx, migratetype);
 	if (page)
 		goto out;
 
@@ -2227,7 +2225,7 @@ static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, bool sync_migration,
+	int classzone_idx, int migratetype, bool sync_migration,
 	bool *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
@@ -2255,7 +2253,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		page = get_page_from_freelist(gfp_mask, nodemask,
 				order, zonelist, high_zoneidx,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
-				preferred_zone, migratetype);
+				preferred_zone, classzone_idx, migratetype);
 		if (page) {
 			preferred_zone->compact_blockskip_flush = false;
 			compaction_defer_reset(preferred_zone, order, true);
@@ -2287,7 +2285,7 @@ static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, bool sync_migration,
+	int classzone_idx, int migratetype, bool sync_migration,
 	bool *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
@@ -2328,7 +2326,7 @@ static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, unsigned long *did_some_progress)
+	int classzone_idx, int migratetype, unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	bool drained = false;
@@ -2346,7 +2344,8 @@ retry:
 	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags & ~ALLOC_NO_WATERMARKS,
-					preferred_zone, migratetype);
+					preferred_zone, classzone_idx,
+					migratetype);
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
@@ -2369,14 +2368,14 @@ static inline struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int classzone_idx, int migratetype)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, classzone_idx, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
@@ -2477,7 +2476,7 @@ static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int classzone_idx, int migratetype)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -2526,15 +2525,19 @@ restart:
 	 * Find the true preferred zone if the allocation is unconstrained by
 	 * cpusets.
 	 */
-	if (!(alloc_flags & ALLOC_CPUSET) && !nodemask)
-		first_zones_zonelist(zonelist, high_zoneidx, NULL,
-					&preferred_zone);
+	if (!(alloc_flags & ALLOC_CPUSET) && !nodemask) {
+		struct zoneref *preferred_zoneref;
+		preferred_zoneref = first_zones_zonelist(zonelist, high_zoneidx,
+				nodemask ? : &cpuset_current_mems_allowed,
+				&preferred_zone);
+		classzone_idx = zonelist_zone_idx(preferred_zoneref);
+	}
 
 rebalance:
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, classzone_idx, migratetype);
 	if (page)
 		goto got_pg;
 
@@ -2549,7 +2552,7 @@ rebalance:
 
 		page = __alloc_pages_high_priority(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, classzone_idx, migratetype);
 		if (page) {
 			goto got_pg;
 		}
@@ -2582,6 +2585,7 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
+					classzone_idx,
 					migratetype, sync_migration,
 					&contended_compaction,
 					&deferred_compaction,
@@ -2605,7 +2609,8 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					migratetype, &did_some_progress);
+					classzone_idx, migratetype,
+					&did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -2624,7 +2629,7 @@ rebalance:
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask, preferred_zone,
-					migratetype);
+					classzone_idx, migratetype);
 			if (page)
 				goto got_pg;
 
@@ -2667,6 +2672,7 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
+					classzone_idx,
 					migratetype, sync_migration,
 					&contended_compaction,
 					&deferred_compaction,
@@ -2694,11 +2700,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
+	struct zoneref *preferred_zoneref;
 	struct page *page = NULL;
 	int migratetype = allocflags_to_migratetype(gfp_mask);
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
 	struct mem_cgroup *memcg = NULL;
+	int classzone_idx;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2728,11 +2736,12 @@ retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	/* The preferred zone is used for statistics later */
-	first_zones_zonelist(zonelist, high_zoneidx,
+	preferred_zoneref = first_zones_zonelist(zonelist, high_zoneidx,
 				nodemask ? : &cpuset_current_mems_allowed,
 				&preferred_zone);
 	if (!preferred_zone)
 		goto out;
+	classzone_idx = zonelist_zone_idx(preferred_zoneref);
 
 #ifdef CONFIG_CMA
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
@@ -2742,7 +2751,7 @@ retry:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
-			preferred_zone, migratetype);
+			preferred_zone, classzone_idx, migratetype);
 	if (unlikely(!page)) {
 		/*
 		 * The first pass makes sure allocations are spread
@@ -2768,7 +2777,7 @@ retry:
 		gfp_mask = memalloc_noio_flags(gfp_mask);
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, classzone_idx, migratetype);
 	}
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 04/17] mm: page_alloc: Calculate classzone_idx once from the zonelist ref
@ 2014-05-01  8:44   ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

There is no need to calculate zone_idx(preferred_zone) multiple times
or use the pgdat to figure it out.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 55 ++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb12b9a..3b6ae9d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1907,17 +1907,15 @@ static inline void init_zone_allows_reclaim(int nid)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone, int migratetype)
+		struct zone *preferred_zone, int classzone_idx, int migratetype)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
-	int classzone_idx;
 	struct zone *zone;
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
-	classzone_idx = zone_idx(preferred_zone);
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -2174,7 +2172,7 @@ static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int classzone_idx, int migratetype)
 {
 	struct page *page;
 
@@ -2192,7 +2190,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone, migratetype);
+		preferred_zone, classzone_idx, migratetype);
 	if (page)
 		goto out;
 
@@ -2227,7 +2225,7 @@ static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, bool sync_migration,
+	int classzone_idx, int migratetype, bool sync_migration,
 	bool *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
@@ -2255,7 +2253,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		page = get_page_from_freelist(gfp_mask, nodemask,
 				order, zonelist, high_zoneidx,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
-				preferred_zone, migratetype);
+				preferred_zone, classzone_idx, migratetype);
 		if (page) {
 			preferred_zone->compact_blockskip_flush = false;
 			compaction_defer_reset(preferred_zone, order, true);
@@ -2287,7 +2285,7 @@ static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, bool sync_migration,
+	int classzone_idx, int migratetype, bool sync_migration,
 	bool *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
@@ -2328,7 +2326,7 @@ static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, unsigned long *did_some_progress)
+	int classzone_idx, int migratetype, unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	bool drained = false;
@@ -2346,7 +2344,8 @@ retry:
 	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags & ~ALLOC_NO_WATERMARKS,
-					preferred_zone, migratetype);
+					preferred_zone, classzone_idx,
+					migratetype);
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
@@ -2369,14 +2368,14 @@ static inline struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int classzone_idx, int migratetype)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, classzone_idx, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
@@ -2477,7 +2476,7 @@ static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int classzone_idx, int migratetype)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -2526,15 +2525,19 @@ restart:
 	 * Find the true preferred zone if the allocation is unconstrained by
 	 * cpusets.
 	 */
-	if (!(alloc_flags & ALLOC_CPUSET) && !nodemask)
-		first_zones_zonelist(zonelist, high_zoneidx, NULL,
-					&preferred_zone);
+	if (!(alloc_flags & ALLOC_CPUSET) && !nodemask) {
+		struct zoneref *preferred_zoneref;
+		preferred_zoneref = first_zones_zonelist(zonelist, high_zoneidx,
+				nodemask ? : &cpuset_current_mems_allowed,
+				&preferred_zone);
+		classzone_idx = zonelist_zone_idx(preferred_zoneref);
+	}
 
 rebalance:
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, classzone_idx, migratetype);
 	if (page)
 		goto got_pg;
 
@@ -2549,7 +2552,7 @@ rebalance:
 
 		page = __alloc_pages_high_priority(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, classzone_idx, migratetype);
 		if (page) {
 			goto got_pg;
 		}
@@ -2582,6 +2585,7 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
+					classzone_idx,
 					migratetype, sync_migration,
 					&contended_compaction,
 					&deferred_compaction,
@@ -2605,7 +2609,8 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					migratetype, &did_some_progress);
+					classzone_idx, migratetype,
+					&did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -2624,7 +2629,7 @@ rebalance:
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask, preferred_zone,
-					migratetype);
+					classzone_idx, migratetype);
 			if (page)
 				goto got_pg;
 
@@ -2667,6 +2672,7 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
+					classzone_idx,
 					migratetype, sync_migration,
 					&contended_compaction,
 					&deferred_compaction,
@@ -2694,11 +2700,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
+	struct zoneref *preferred_zoneref;
 	struct page *page = NULL;
 	int migratetype = allocflags_to_migratetype(gfp_mask);
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
 	struct mem_cgroup *memcg = NULL;
+	int classzone_idx;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2728,11 +2736,12 @@ retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	/* The preferred zone is used for statistics later */
-	first_zones_zonelist(zonelist, high_zoneidx,
+	preferred_zoneref = first_zones_zonelist(zonelist, high_zoneidx,
 				nodemask ? : &cpuset_current_mems_allowed,
 				&preferred_zone);
 	if (!preferred_zone)
 		goto out;
+	classzone_idx = zonelist_zone_idx(preferred_zoneref);
 
 #ifdef CONFIG_CMA
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
@@ -2742,7 +2751,7 @@ retry:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
-			preferred_zone, migratetype);
+			preferred_zone, classzone_idx, migratetype);
 	if (unlikely(!page)) {
 		/*
 		 * The first pass makes sure allocations are spread
@@ -2768,7 +2777,7 @@ retry:
 		gfp_mask = memalloc_noio_flags(gfp_mask);
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, classzone_idx, migratetype);
 	}
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 05/17] mm: page_alloc: Only check the zone id check if pages are buddies
  2014-05-01  8:44 ` Mel Gorman
@ 2014-05-01  8:44   ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

A node/zone index is used to check if pages are compatible for merging
but this happens unconditionally even if the buddy page is not free. Defer
the calculation as long as possible. Ideally we would check the zone boundary
but nodes can overlap.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b6ae9d..8971953 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -508,16 +508,26 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 	if (!pfn_valid_within(page_to_pfn(buddy)))
 		return 0;
 
-	if (page_zone_id(page) != page_zone_id(buddy))
-		return 0;
-
 	if (page_is_guard(buddy) && page_order(buddy) == order) {
 		VM_BUG_ON_PAGE(page_count(buddy) != 0, buddy);
+
+		if (page_zone_id(page) != page_zone_id(buddy))
+			return 0;
+
 		return 1;
 	}
 
 	if (PageBuddy(buddy) && page_order(buddy) == order) {
 		VM_BUG_ON_PAGE(page_count(buddy) != 0, buddy);
+
+		/*
+		 * zone check is done late to avoid uselessly
+		 * calculating zone/node ids for pages that could
+		 * never merge.
+		 */
+		if (page_zone_id(page) != page_zone_id(buddy))
+			return 0;
+
 		return 1;
 	}
 	return 0;
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 05/17] mm: page_alloc: Only check the zone id check if pages are buddies
@ 2014-05-01  8:44   ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

A node/zone index is used to check if pages are compatible for merging
but this happens unconditionally even if the buddy page is not free. Defer
the calculation as long as possible. Ideally we would check the zone boundary
but nodes can overlap.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b6ae9d..8971953 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -508,16 +508,26 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 	if (!pfn_valid_within(page_to_pfn(buddy)))
 		return 0;
 
-	if (page_zone_id(page) != page_zone_id(buddy))
-		return 0;
-
 	if (page_is_guard(buddy) && page_order(buddy) == order) {
 		VM_BUG_ON_PAGE(page_count(buddy) != 0, buddy);
+
+		if (page_zone_id(page) != page_zone_id(buddy))
+			return 0;
+
 		return 1;
 	}
 
 	if (PageBuddy(buddy) && page_order(buddy) == order) {
 		VM_BUG_ON_PAGE(page_count(buddy) != 0, buddy);
+
+		/*
+		 * zone check is done late to avoid uselessly
+		 * calculating zone/node ids for pages that could
+		 * never merge.
+		 */
+		if (page_zone_id(page) != page_zone_id(buddy))
+			return 0;
+
 		return 1;
 	}
 	return 0;
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 06/17] mm: page_alloc: Only check the alloc flags and gfp_mask for dirty once
  2014-05-01  8:44 ` Mel Gorman
@ 2014-05-01  8:44   ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

Currently it's calculated once per zone in the zonelist.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8971953..2e576fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1925,6 +1925,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
+	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
+				(gfp_mask & __GFP_WRITE);
 
 zonelist_scan:
 	/*
@@ -1983,8 +1985,7 @@ zonelist_scan:
 		 * will require awareness of zones in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if ((alloc_flags & ALLOC_WMARK_LOW) &&
-		    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
+		if (consider_zone_dirty && !zone_dirty_ok(zone))
 			continue;
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 06/17] mm: page_alloc: Only check the alloc flags and gfp_mask for dirty once
@ 2014-05-01  8:44   ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

Currently it's calculated once per zone in the zonelist.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8971953..2e576fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1925,6 +1925,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
+	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
+				(gfp_mask & __GFP_WRITE);
 
 zonelist_scan:
 	/*
@@ -1983,8 +1985,7 @@ zonelist_scan:
 		 * will require awareness of zones in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if ((alloc_flags & ALLOC_WMARK_LOW) &&
-		    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
+		if (consider_zone_dirty && !zone_dirty_ok(zone))
 			continue;
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 07/17] mm: page_alloc: Take the ALLOC_NO_WATERMARK check out of the fast path
  2014-05-01  8:44 ` Mel Gorman
@ 2014-05-01  8:44   ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for
__GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these cases
are relatively rare events but the ALLOC_NO_WATERMARK check is an unlikely
branch in the fast path.  This patch moves the check out of the fast path
and after it has been determined that the watermarks have not been met. This
helps the common fast path at the cost of making the slow path slower and
hitting kswapd with a performance cost. It's a reasonable tradeoff.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e576fd..dc123ff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1944,9 +1944,6 @@ zonelist_scan:
 			(alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				continue;
-		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
-		if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
-			goto try_this_zone;
 		/*
 		 * Distribute pages in proportion to the individual
 		 * zone size to ensure fair page aging.  The zone a
@@ -1993,6 +1990,11 @@ zonelist_scan:
 				       classzone_idx, alloc_flags)) {
 			int ret;
 
+			/* Checked here to keep the fast path fast */
+			BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
+			if (alloc_flags & ALLOC_NO_WATERMARKS)
+				goto try_this_zone;
+
 			if (IS_ENABLED(CONFIG_NUMA) &&
 					!did_zlc_setup && nr_online_nodes > 1) {
 				/*
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 07/17] mm: page_alloc: Take the ALLOC_NO_WATERMARK check out of the fast path
@ 2014-05-01  8:44   ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for
__GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these cases
are relatively rare events but the ALLOC_NO_WATERMARK check is an unlikely
branch in the fast path.  This patch moves the check out of the fast path
and after it has been determined that the watermarks have not been met. This
helps the common fast path at the cost of making the slow path slower and
hitting kswapd with a performance cost. It's a reasonable tradeoff.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e576fd..dc123ff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1944,9 +1944,6 @@ zonelist_scan:
 			(alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				continue;
-		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
-		if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
-			goto try_this_zone;
 		/*
 		 * Distribute pages in proportion to the individual
 		 * zone size to ensure fair page aging.  The zone a
@@ -1993,6 +1990,11 @@ zonelist_scan:
 				       classzone_idx, alloc_flags)) {
 			int ret;
 
+			/* Checked here to keep the fast path fast */
+			BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
+			if (alloc_flags & ALLOC_NO_WATERMARKS)
+				goto try_this_zone;
+
 			if (IS_ENABLED(CONFIG_NUMA) &&
 					!did_zlc_setup && nr_online_nodes > 1) {
 				/*
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
  2014-05-01  8:44 ` Mel Gorman
@ 2014-05-01  8:44   ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

The test_bit operations in get/set pageblock flags are expensive. This patch
reads the bitmap on a word basis and use shifts and masks to isolate the bits
of interest. Similarly masks are used to set a local copy of the bitmap and then
use cmpxchg to update the bitmap if there have been no other changes made in
parallel.

In a test running dd onto tmpfs the overhead of the pageblock-related
functions went from 1.27% in profiles to 0.5%.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h          |  7 +++++-
 include/linux/pageblock-flags.h | 39 +++++++++++++++++++++++++++-----
 mm/page_alloc.c                 | 49 +++++++++++++++++++++++++----------------
 3 files changed, 69 insertions(+), 26 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fac5509..c84703d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -75,9 +75,14 @@ enum {
 
 extern int page_group_by_mobility_disabled;
 
+#define NR_MIGRATETYPE_BITS (PB_migrate_end - PB_migrate + 1)
+#define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1)
+
 static inline int get_pageblock_migratetype(struct page *page)
 {
-	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
+	BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2);
+	return get_pageblock_flags_mask(page, PB_migrate_end,
+					NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK);
 }
 
 struct free_area {
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 2ee8cd2..bc37036 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -30,9 +30,12 @@ enum pageblock_bits {
 	PB_migrate,
 	PB_migrate_end = PB_migrate + 3 - 1,
 			/* 3 bits required for migrate types */
-#ifdef CONFIG_COMPACTION
 	PB_migrate_skip,/* If set the block is skipped by compaction */
-#endif /* CONFIG_COMPACTION */
+
+	/*
+	 * Assume the bits will always align on a word. If this assumption
+	 * changes then get/set pageblock needs updating.
+	 */
 	NR_PAGEBLOCK_BITS
 };
 
@@ -62,11 +65,35 @@ extern int pageblock_order;
 /* Forward declaration */
 struct page;
 
+unsigned long get_pageblock_flags_mask(struct page *page,
+				unsigned long end_bitidx,
+				unsigned long nr_flag_bits,
+				unsigned long mask);
+void set_pageblock_flags_mask(struct page *page,
+				unsigned long flags,
+				unsigned long end_bitidx,
+				unsigned long nr_flag_bits,
+				unsigned long mask);
+
 /* Declarations for getting and setting flags. See mm/page_alloc.c */
-unsigned long get_pageblock_flags_group(struct page *page,
-					int start_bitidx, int end_bitidx);
-void set_pageblock_flags_group(struct page *page, unsigned long flags,
-					int start_bitidx, int end_bitidx);
+static inline unsigned long get_pageblock_flags_group(struct page *page,
+					int start_bitidx, int end_bitidx)
+{
+	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
+	unsigned long mask = (1 << nr_flag_bits) - 1;
+
+	return get_pageblock_flags_mask(page, end_bitidx, nr_flag_bits, mask);
+}
+
+static inline void set_pageblock_flags_group(struct page *page,
+					unsigned long flags,
+					int start_bitidx, int end_bitidx)
+{
+	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
+	unsigned long mask = (1 << nr_flag_bits) - 1;
+
+	set_pageblock_flags_mask(page, flags, end_bitidx, nr_flag_bits, mask);
+}
 
 #ifdef CONFIG_COMPACTION
 #define get_pageblock_skip(page) \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dc123ff..2cf1558 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6032,25 +6032,26 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
  * @end_bitidx: The last bit of interest
  * returns pageblock_bits flags
  */
-unsigned long get_pageblock_flags_group(struct page *page,
-					int start_bitidx, int end_bitidx)
+unsigned long get_pageblock_flags_mask(struct page *page,
+					unsigned long end_bitidx,
+					unsigned long nr_flag_bits,
+					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx;
-	unsigned long flags = 0;
-	unsigned long value = 1;
+	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long word;
 
 	zone = page_zone(page);
 	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
+	word_bitidx = bitidx / BITS_PER_LONG;
+	bitidx &= (BITS_PER_LONG-1);
 
-	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
-		if (test_bit(bitidx + start_bitidx, bitmap))
-			flags |= value;
-
-	return flags;
+	word = bitmap[word_bitidx];
+	bitidx += end_bitidx;
+	return (word >> (BITS_PER_LONG - bitidx - 1)) & mask;
 }
 
 /**
@@ -6060,25 +6061,35 @@ unsigned long get_pageblock_flags_group(struct page *page,
  * @end_bitidx: The last bit of interest
  * @flags: The flags to set
  */
-void set_pageblock_flags_group(struct page *page, unsigned long flags,
-					int start_bitidx, int end_bitidx)
+void set_pfnblock_flags_group(struct page *page, unsigned long flags,
+					unsigned long end_bitidx,
+					unsigned long nr_flag_bits,
+					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx;
-	unsigned long value = 1;
+	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long old_word, new_word;
+
+	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
 
 	zone = page_zone(page);
 	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
+	word_bitidx = bitidx / BITS_PER_LONG;
+	bitidx &= (BITS_PER_LONG-1);
+
 	VM_BUG_ON_PAGE(!zone_spans_pfn(zone, pfn), page);
 
-	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
-		if (flags & value)
-			__set_bit(bitidx + start_bitidx, bitmap);
-		else
-			__clear_bit(bitidx + start_bitidx, bitmap);
+	bitidx += end_bitidx;
+	mask <<= (BITS_PER_LONG - bitidx - 1);
+	flags <<= (BITS_PER_LONG - bitidx - 1);
+
+	do {
+		old_word = ACCESS_ONCE(bitmap[word_bitidx]);
+		new_word = (old_word & ~mask) | flags;
+	} while (cmpxchg(&bitmap[word_bitidx], old_word, new_word) != old_word);
 }
 
 /*
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
@ 2014-05-01  8:44   ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

The test_bit operations in get/set pageblock flags are expensive. This patch
reads the bitmap on a word basis and use shifts and masks to isolate the bits
of interest. Similarly masks are used to set a local copy of the bitmap and then
use cmpxchg to update the bitmap if there have been no other changes made in
parallel.

In a test running dd onto tmpfs the overhead of the pageblock-related
functions went from 1.27% in profiles to 0.5%.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h          |  7 +++++-
 include/linux/pageblock-flags.h | 39 +++++++++++++++++++++++++++-----
 mm/page_alloc.c                 | 49 +++++++++++++++++++++++++----------------
 3 files changed, 69 insertions(+), 26 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fac5509..c84703d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -75,9 +75,14 @@ enum {
 
 extern int page_group_by_mobility_disabled;
 
+#define NR_MIGRATETYPE_BITS (PB_migrate_end - PB_migrate + 1)
+#define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1)
+
 static inline int get_pageblock_migratetype(struct page *page)
 {
-	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
+	BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2);
+	return get_pageblock_flags_mask(page, PB_migrate_end,
+					NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK);
 }
 
 struct free_area {
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 2ee8cd2..bc37036 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -30,9 +30,12 @@ enum pageblock_bits {
 	PB_migrate,
 	PB_migrate_end = PB_migrate + 3 - 1,
 			/* 3 bits required for migrate types */
-#ifdef CONFIG_COMPACTION
 	PB_migrate_skip,/* If set the block is skipped by compaction */
-#endif /* CONFIG_COMPACTION */
+
+	/*
+	 * Assume the bits will always align on a word. If this assumption
+	 * changes then get/set pageblock needs updating.
+	 */
 	NR_PAGEBLOCK_BITS
 };
 
@@ -62,11 +65,35 @@ extern int pageblock_order;
 /* Forward declaration */
 struct page;
 
+unsigned long get_pageblock_flags_mask(struct page *page,
+				unsigned long end_bitidx,
+				unsigned long nr_flag_bits,
+				unsigned long mask);
+void set_pageblock_flags_mask(struct page *page,
+				unsigned long flags,
+				unsigned long end_bitidx,
+				unsigned long nr_flag_bits,
+				unsigned long mask);
+
 /* Declarations for getting and setting flags. See mm/page_alloc.c */
-unsigned long get_pageblock_flags_group(struct page *page,
-					int start_bitidx, int end_bitidx);
-void set_pageblock_flags_group(struct page *page, unsigned long flags,
-					int start_bitidx, int end_bitidx);
+static inline unsigned long get_pageblock_flags_group(struct page *page,
+					int start_bitidx, int end_bitidx)
+{
+	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
+	unsigned long mask = (1 << nr_flag_bits) - 1;
+
+	return get_pageblock_flags_mask(page, end_bitidx, nr_flag_bits, mask);
+}
+
+static inline void set_pageblock_flags_group(struct page *page,
+					unsigned long flags,
+					int start_bitidx, int end_bitidx)
+{
+	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
+	unsigned long mask = (1 << nr_flag_bits) - 1;
+
+	set_pageblock_flags_mask(page, flags, end_bitidx, nr_flag_bits, mask);
+}
 
 #ifdef CONFIG_COMPACTION
 #define get_pageblock_skip(page) \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dc123ff..2cf1558 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6032,25 +6032,26 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
  * @end_bitidx: The last bit of interest
  * returns pageblock_bits flags
  */
-unsigned long get_pageblock_flags_group(struct page *page,
-					int start_bitidx, int end_bitidx)
+unsigned long get_pageblock_flags_mask(struct page *page,
+					unsigned long end_bitidx,
+					unsigned long nr_flag_bits,
+					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx;
-	unsigned long flags = 0;
-	unsigned long value = 1;
+	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long word;
 
 	zone = page_zone(page);
 	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
+	word_bitidx = bitidx / BITS_PER_LONG;
+	bitidx &= (BITS_PER_LONG-1);
 
-	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
-		if (test_bit(bitidx + start_bitidx, bitmap))
-			flags |= value;
-
-	return flags;
+	word = bitmap[word_bitidx];
+	bitidx += end_bitidx;
+	return (word >> (BITS_PER_LONG - bitidx - 1)) & mask;
 }
 
 /**
@@ -6060,25 +6061,35 @@ unsigned long get_pageblock_flags_group(struct page *page,
  * @end_bitidx: The last bit of interest
  * @flags: The flags to set
  */
-void set_pageblock_flags_group(struct page *page, unsigned long flags,
-					int start_bitidx, int end_bitidx)
+void set_pfnblock_flags_group(struct page *page, unsigned long flags,
+					unsigned long end_bitidx,
+					unsigned long nr_flag_bits,
+					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx;
-	unsigned long value = 1;
+	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long old_word, new_word;
+
+	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
 
 	zone = page_zone(page);
 	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
+	word_bitidx = bitidx / BITS_PER_LONG;
+	bitidx &= (BITS_PER_LONG-1);
+
 	VM_BUG_ON_PAGE(!zone_spans_pfn(zone, pfn), page);
 
-	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
-		if (flags & value)
-			__set_bit(bitidx + start_bitidx, bitmap);
-		else
-			__clear_bit(bitidx + start_bitidx, bitmap);
+	bitidx += end_bitidx;
+	mask <<= (BITS_PER_LONG - bitidx - 1);
+	flags <<= (BITS_PER_LONG - bitidx - 1);
+
+	do {
+		old_word = ACCESS_ONCE(bitmap[word_bitidx]);
+		new_word = (old_word & ~mask) | flags;
+	} while (cmpxchg(&bitmap[word_bitidx], old_word, new_word) != old_word);
 }
 
 /*
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 09/17] mm: page_alloc: Reduce number of times page_to_pfn is called
  2014-05-01  8:44 ` Mel Gorman
@ 2014-05-01  8:44   ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

In the free path we calculate page_to_pfn multiple times. Reduce that.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h          |  9 +++++++--
 include/linux/pageblock-flags.h | 35 +++++++++++++++--------------------
 mm/page_alloc.c                 | 32 ++++++++++++++++++--------------
 3 files changed, 40 insertions(+), 36 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c84703d..2c3037a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -78,10 +78,15 @@ extern int page_group_by_mobility_disabled;
 #define NR_MIGRATETYPE_BITS (PB_migrate_end - PB_migrate + 1)
 #define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1)
 
-static inline int get_pageblock_migratetype(struct page *page)
+#define get_pageblock_migratetype(page)					\
+	get_pfnblock_flags_mask(page, page_to_pfn(page),		\
+			PB_migrate_end, NR_MIGRATETYPE_BITS,		\
+			MIGRATETYPE_MASK)
+
+static inline int get_pfnblock_migratetype(struct page *page, unsigned long pfn)
 {
 	BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2);
-	return get_pageblock_flags_mask(page, PB_migrate_end,
+	return get_pfnblock_flags_mask(page, pfn, PB_migrate_end,
 					NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK);
 }
 
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index bc37036..2900b42 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -65,35 +65,30 @@ extern int pageblock_order;
 /* Forward declaration */
 struct page;
 
-unsigned long get_pageblock_flags_mask(struct page *page,
+unsigned long get_pfnblock_flags_mask(struct page *page,
+				unsigned long pfn,
 				unsigned long end_bitidx,
 				unsigned long nr_flag_bits,
 				unsigned long mask);
-void set_pageblock_flags_mask(struct page *page,
+
+void set_pfnblock_flags_mask(struct page *page,
 				unsigned long flags,
+				unsigned long pfn,
 				unsigned long end_bitidx,
 				unsigned long nr_flag_bits,
 				unsigned long mask);
 
 /* Declarations for getting and setting flags. See mm/page_alloc.c */
-static inline unsigned long get_pageblock_flags_group(struct page *page,
-					int start_bitidx, int end_bitidx)
-{
-	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
-	unsigned long mask = (1 << nr_flag_bits) - 1;
-
-	return get_pageblock_flags_mask(page, end_bitidx, nr_flag_bits, mask);
-}
-
-static inline void set_pageblock_flags_group(struct page *page,
-					unsigned long flags,
-					int start_bitidx, int end_bitidx)
-{
-	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
-	unsigned long mask = (1 << nr_flag_bits) - 1;
-
-	set_pageblock_flags_mask(page, flags, end_bitidx, nr_flag_bits, mask);
-}
+#define get_pageblock_flags_group(page, start_bitidx, end_bitidx) \
+	get_pfnblock_flags_mask(page, page_to_pfn(page),		\
+			end_bitidx,					\
+			end_bitidx - start_bitidx + 1,			\
+			(1 << (end_bitidx - start_bitidx + 1)) - 1)
+#define set_pageblock_flags_group(page, flags, start_bitidx, end_bitidx) \
+	set_pfnblock_flags_mask(page, flags, page_to_pfn(page),		\
+			end_bitidx,					\
+			end_bitidx - start_bitidx + 1,			\
+			(1 << (end_bitidx - start_bitidx + 1)) - 1)
 
 #ifdef CONFIG_COMPACTION
 #define get_pageblock_skip(page) \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2cf1558..61d45fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -559,6 +559,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
  */
 
 static inline void __free_one_page(struct page *page,
+		unsigned long pfn,
 		struct zone *zone, unsigned int order,
 		int migratetype)
 {
@@ -575,7 +576,7 @@ static inline void __free_one_page(struct page *page,
 
 	VM_BUG_ON(migratetype == -1);
 
-	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
+	page_idx = pfn & ((1 << MAX_ORDER) - 1);
 
 	VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
@@ -710,7 +711,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			list_del(&page->lru);
 			mt = get_freepage_migratetype(page);
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
-			__free_one_page(page, zone, 0, mt);
+			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
 			trace_mm_page_pcpu_drain(page, 0, mt);
 			if (likely(!is_migrate_isolate_page(page))) {
 				__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
@@ -722,13 +723,15 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	spin_unlock(&zone->lock);
 }
 
-static void free_one_page(struct zone *zone, struct page *page, int order,
+static void free_one_page(struct zone *zone,
+				struct page *page, unsigned long pfn,
+				int order,
 				int migratetype)
 {
 	spin_lock(&zone->lock);
 	zone->pages_scanned = 0;
 
-	__free_one_page(page, zone, order, migratetype);
+	__free_one_page(page, pfn, zone, order, migratetype);
 	if (unlikely(!is_migrate_isolate(migratetype)))
 		__mod_zone_freepage_state(zone, 1 << order, migratetype);
 	spin_unlock(&zone->lock);
@@ -765,15 +768,16 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
 	int migratetype;
+	unsigned long pfn = page_to_pfn(page);
 
 	if (!free_pages_prepare(page, order))
 		return;
 
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	migratetype = get_pageblock_migratetype(page);
+	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_freepage_migratetype(page, migratetype);
-	free_one_page(page_zone(page), page, order, migratetype);
+	free_one_page(page_zone(page), page, pfn, order, migratetype);
 	local_irq_restore(flags);
 }
 
@@ -1376,12 +1380,13 @@ void free_hot_cold_page(struct page *page, int cold)
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
 	if (!free_pages_prepare(page, 0))
 		return;
 
-	migratetype = get_pageblock_migratetype(page);
+	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_freepage_migratetype(page, migratetype);
 	local_irq_save(flags);
 	__count_vm_event(PGFREE);
@@ -1395,7 +1400,7 @@ void free_hot_cold_page(struct page *page, int cold)
 	 */
 	if (migratetype >= MIGRATE_PCPTYPES) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(zone, page, 0, migratetype);
+			free_one_page(zone, page, pfn, 0, migratetype);
 			goto out;
 		}
 		migratetype = MIGRATE_MOVABLE;
@@ -6032,18 +6037,17 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
  * @end_bitidx: The last bit of interest
  * returns pageblock_bits flags
  */
-unsigned long get_pageblock_flags_mask(struct page *page,
+unsigned long get_pfnblock_flags_mask(struct page *page, unsigned long pfn,
 					unsigned long end_bitidx,
 					unsigned long nr_flag_bits,
 					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long bitidx, word_bitidx;
 	unsigned long word;
 
 	zone = page_zone(page);
-	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
 	word_bitidx = bitidx / BITS_PER_LONG;
@@ -6061,20 +6065,20 @@ unsigned long get_pageblock_flags_mask(struct page *page,
  * @end_bitidx: The last bit of interest
  * @flags: The flags to set
  */
-void set_pfnblock_flags_group(struct page *page, unsigned long flags,
+void set_pfnblock_flags_mask(struct page *page, unsigned long flags,
+					unsigned long pfn,
 					unsigned long end_bitidx,
 					unsigned long nr_flag_bits,
 					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long bitidx, word_bitidx;
 	unsigned long old_word, new_word;
 
 	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
 
 	zone = page_zone(page);
-	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
 	word_bitidx = bitidx / BITS_PER_LONG;
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 09/17] mm: page_alloc: Reduce number of times page_to_pfn is called
@ 2014-05-01  8:44   ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

In the free path we calculate page_to_pfn multiple times. Reduce that.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h          |  9 +++++++--
 include/linux/pageblock-flags.h | 35 +++++++++++++++--------------------
 mm/page_alloc.c                 | 32 ++++++++++++++++++--------------
 3 files changed, 40 insertions(+), 36 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c84703d..2c3037a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -78,10 +78,15 @@ extern int page_group_by_mobility_disabled;
 #define NR_MIGRATETYPE_BITS (PB_migrate_end - PB_migrate + 1)
 #define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1)
 
-static inline int get_pageblock_migratetype(struct page *page)
+#define get_pageblock_migratetype(page)					\
+	get_pfnblock_flags_mask(page, page_to_pfn(page),		\
+			PB_migrate_end, NR_MIGRATETYPE_BITS,		\
+			MIGRATETYPE_MASK)
+
+static inline int get_pfnblock_migratetype(struct page *page, unsigned long pfn)
 {
 	BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2);
-	return get_pageblock_flags_mask(page, PB_migrate_end,
+	return get_pfnblock_flags_mask(page, pfn, PB_migrate_end,
 					NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK);
 }
 
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index bc37036..2900b42 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -65,35 +65,30 @@ extern int pageblock_order;
 /* Forward declaration */
 struct page;
 
-unsigned long get_pageblock_flags_mask(struct page *page,
+unsigned long get_pfnblock_flags_mask(struct page *page,
+				unsigned long pfn,
 				unsigned long end_bitidx,
 				unsigned long nr_flag_bits,
 				unsigned long mask);
-void set_pageblock_flags_mask(struct page *page,
+
+void set_pfnblock_flags_mask(struct page *page,
 				unsigned long flags,
+				unsigned long pfn,
 				unsigned long end_bitidx,
 				unsigned long nr_flag_bits,
 				unsigned long mask);
 
 /* Declarations for getting and setting flags. See mm/page_alloc.c */
-static inline unsigned long get_pageblock_flags_group(struct page *page,
-					int start_bitidx, int end_bitidx)
-{
-	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
-	unsigned long mask = (1 << nr_flag_bits) - 1;
-
-	return get_pageblock_flags_mask(page, end_bitidx, nr_flag_bits, mask);
-}
-
-static inline void set_pageblock_flags_group(struct page *page,
-					unsigned long flags,
-					int start_bitidx, int end_bitidx)
-{
-	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
-	unsigned long mask = (1 << nr_flag_bits) - 1;
-
-	set_pageblock_flags_mask(page, flags, end_bitidx, nr_flag_bits, mask);
-}
+#define get_pageblock_flags_group(page, start_bitidx, end_bitidx) \
+	get_pfnblock_flags_mask(page, page_to_pfn(page),		\
+			end_bitidx,					\
+			end_bitidx - start_bitidx + 1,			\
+			(1 << (end_bitidx - start_bitidx + 1)) - 1)
+#define set_pageblock_flags_group(page, flags, start_bitidx, end_bitidx) \
+	set_pfnblock_flags_mask(page, flags, page_to_pfn(page),		\
+			end_bitidx,					\
+			end_bitidx - start_bitidx + 1,			\
+			(1 << (end_bitidx - start_bitidx + 1)) - 1)
 
 #ifdef CONFIG_COMPACTION
 #define get_pageblock_skip(page) \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2cf1558..61d45fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -559,6 +559,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
  */
 
 static inline void __free_one_page(struct page *page,
+		unsigned long pfn,
 		struct zone *zone, unsigned int order,
 		int migratetype)
 {
@@ -575,7 +576,7 @@ static inline void __free_one_page(struct page *page,
 
 	VM_BUG_ON(migratetype == -1);
 
-	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
+	page_idx = pfn & ((1 << MAX_ORDER) - 1);
 
 	VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
@@ -710,7 +711,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			list_del(&page->lru);
 			mt = get_freepage_migratetype(page);
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
-			__free_one_page(page, zone, 0, mt);
+			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
 			trace_mm_page_pcpu_drain(page, 0, mt);
 			if (likely(!is_migrate_isolate_page(page))) {
 				__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
@@ -722,13 +723,15 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	spin_unlock(&zone->lock);
 }
 
-static void free_one_page(struct zone *zone, struct page *page, int order,
+static void free_one_page(struct zone *zone,
+				struct page *page, unsigned long pfn,
+				int order,
 				int migratetype)
 {
 	spin_lock(&zone->lock);
 	zone->pages_scanned = 0;
 
-	__free_one_page(page, zone, order, migratetype);
+	__free_one_page(page, pfn, zone, order, migratetype);
 	if (unlikely(!is_migrate_isolate(migratetype)))
 		__mod_zone_freepage_state(zone, 1 << order, migratetype);
 	spin_unlock(&zone->lock);
@@ -765,15 +768,16 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
 	int migratetype;
+	unsigned long pfn = page_to_pfn(page);
 
 	if (!free_pages_prepare(page, order))
 		return;
 
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	migratetype = get_pageblock_migratetype(page);
+	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_freepage_migratetype(page, migratetype);
-	free_one_page(page_zone(page), page, order, migratetype);
+	free_one_page(page_zone(page), page, pfn, order, migratetype);
 	local_irq_restore(flags);
 }
 
@@ -1376,12 +1380,13 @@ void free_hot_cold_page(struct page *page, int cold)
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
 	if (!free_pages_prepare(page, 0))
 		return;
 
-	migratetype = get_pageblock_migratetype(page);
+	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_freepage_migratetype(page, migratetype);
 	local_irq_save(flags);
 	__count_vm_event(PGFREE);
@@ -1395,7 +1400,7 @@ void free_hot_cold_page(struct page *page, int cold)
 	 */
 	if (migratetype >= MIGRATE_PCPTYPES) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(zone, page, 0, migratetype);
+			free_one_page(zone, page, pfn, 0, migratetype);
 			goto out;
 		}
 		migratetype = MIGRATE_MOVABLE;
@@ -6032,18 +6037,17 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
  * @end_bitidx: The last bit of interest
  * returns pageblock_bits flags
  */
-unsigned long get_pageblock_flags_mask(struct page *page,
+unsigned long get_pfnblock_flags_mask(struct page *page, unsigned long pfn,
 					unsigned long end_bitidx,
 					unsigned long nr_flag_bits,
 					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long bitidx, word_bitidx;
 	unsigned long word;
 
 	zone = page_zone(page);
-	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
 	word_bitidx = bitidx / BITS_PER_LONG;
@@ -6061,20 +6065,20 @@ unsigned long get_pageblock_flags_mask(struct page *page,
  * @end_bitidx: The last bit of interest
  * @flags: The flags to set
  */
-void set_pfnblock_flags_group(struct page *page, unsigned long flags,
+void set_pfnblock_flags_mask(struct page *page, unsigned long flags,
+					unsigned long pfn,
 					unsigned long end_bitidx,
 					unsigned long nr_flag_bits,
 					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long bitidx, word_bitidx;
 	unsigned long old_word, new_word;
 
 	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
 
 	zone = page_zone(page);
-	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
 	word_bitidx = bitidx / BITS_PER_LONG;
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 10/17] mm: page_alloc: Lookup pageblock migratetype with IRQs enabled during free
  2014-05-01  8:44 ` Mel Gorman
@ 2014-05-01  8:44   ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

get_pageblock_migratetype() is called during free with IRQs disabled. This
is unnecessary and disables IRQs for longer than necessary.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 61d45fd..2e55bc8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -773,9 +773,9 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	if (!free_pages_prepare(page, order))
 		return;
 
+	migratetype = get_pfnblock_migratetype(page, pfn);
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_freepage_migratetype(page, migratetype);
 	free_one_page(page_zone(page), page, pfn, order, migratetype);
 	local_irq_restore(flags);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 10/17] mm: page_alloc: Lookup pageblock migratetype with IRQs enabled during free
@ 2014-05-01  8:44   ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

get_pageblock_migratetype() is called during free with IRQs disabled. This
is unnecessary and disables IRQs for longer than necessary.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 61d45fd..2e55bc8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -773,9 +773,9 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	if (!free_pages_prepare(page, order))
 		return;
 
+	migratetype = get_pfnblock_migratetype(page, pfn);
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_freepage_migratetype(page, migratetype);
 	free_one_page(page_zone(page), page, pfn, order, migratetype);
 	local_irq_restore(flags);
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 11/17] mm: page_alloc: Use unsigned int for order in more places
  2014-05-01  8:44 ` Mel Gorman
@ 2014-05-01  8:44   ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

X86 prefers the use of unsigned types for iterators and there is a
tendency to mix whether a signed or unsigned type if used for page
order. This converts a number of sites in mm/page_alloc.c to use
unsigned int for order where possible.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |  8 ++++----
 mm/page_alloc.c        | 43 +++++++++++++++++++++++--------------------
 2 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2c3037a..d20403d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -818,10 +818,10 @@ static inline bool pgdat_is_empty(pg_data_t *pgdat)
 extern struct mutex zonelists_mutex;
 void build_all_zonelists(pg_data_t *pgdat, struct zone *zone);
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
-bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
-		int classzone_idx, int alloc_flags);
-bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
-		int classzone_idx, int alloc_flags);
+bool zone_watermark_ok(struct zone *z, unsigned int order,
+		unsigned long mark, int classzone_idx, int alloc_flags);
+bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
+		unsigned long mark, int classzone_idx, int alloc_flags);
 enum memmap_context {
 	MEMMAP_EARLY,
 	MEMMAP_HOTPLUG,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e55bc8..087c178 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -408,7 +408,8 @@ static int destroy_compound_page(struct page *page, unsigned long order)
 	return bad;
 }
 
-static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags)
+static inline void prep_zero_page(struct page *page, unsigned int order,
+							gfp_t gfp_flags)
 {
 	int i;
 
@@ -452,7 +453,7 @@ static inline void set_page_guard_flag(struct page *page) { }
 static inline void clear_page_guard_flag(struct page *page) { }
 #endif
 
-static inline void set_page_order(struct page *page, int order)
+static inline void set_page_order(struct page *page, unsigned int order)
 {
 	set_page_private(page, order);
 	__SetPageBuddy(page);
@@ -503,7 +504,7 @@ __find_buddy_index(unsigned long page_idx, unsigned int order)
  * For recording page's order, we use page_private(page).
  */
 static inline int page_is_buddy(struct page *page, struct page *buddy,
-								int order)
+							unsigned int order)
 {
 	if (!pfn_valid_within(page_to_pfn(buddy)))
 		return 0;
@@ -725,7 +726,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 
 static void free_one_page(struct zone *zone,
 				struct page *page, unsigned long pfn,
-				int order,
+				unsigned int order,
 				int migratetype)
 {
 	spin_lock(&zone->lock);
@@ -896,7 +897,7 @@ static inline int check_new_page(struct page *page)
 	return 0;
 }
 
-static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
+static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags)
 {
 	int i;
 
@@ -1104,16 +1105,17 @@ static int try_to_steal_freepages(struct zone *zone, struct page *page,
 
 /* Remove an element from the buddy allocator from the fallback list */
 static inline struct page *
-__rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
+__rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
 {
 	struct free_area *area;
-	int current_order;
+	unsigned int current_order;
 	struct page *page;
 	int migratetype, new_type, i;
 
 	/* Find the largest possible block of pages in the other list */
-	for (current_order = MAX_ORDER-1; current_order >= order;
-						--current_order) {
+	for (current_order = MAX_ORDER-1;
+				current_order >= order && current_order <= MAX_ORDER-1;
+				--current_order) {
 		for (i = 0;; i++) {
 			migratetype = fallbacks[start_migratetype][i];
 
@@ -1341,7 +1343,7 @@ void mark_free_pages(struct zone *zone)
 {
 	unsigned long pfn, max_zone_pfn;
 	unsigned long flags;
-	int order, t;
+	unsigned int order, t;
 	struct list_head *curr;
 
 	if (zone_is_empty(zone))
@@ -1537,8 +1539,8 @@ int split_free_page(struct page *page)
  */
 static inline
 struct page *buffered_rmqueue(struct zone *preferred_zone,
-			struct zone *zone, int order, gfp_t gfp_flags,
-			int migratetype)
+			struct zone *zone, unsigned int order,
+			gfp_t gfp_flags, int migratetype)
 {
 	unsigned long flags;
 	struct page *page;
@@ -1687,8 +1689,9 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
  * Return true if free pages are above 'mark'. This takes into account the order
  * of the allocation.
  */
-static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
-		      int classzone_idx, int alloc_flags, long free_pages)
+static bool __zone_watermark_ok(struct zone *z, unsigned int order,
+			unsigned long mark, int classzone_idx, int alloc_flags,
+			long free_pages)
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
@@ -1722,15 +1725,15 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 	return true;
 }
 
-bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 		      int classzone_idx, int alloc_flags)
 {
 	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
 					zone_page_state(z, NR_FREE_PAGES));
 }
 
-bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
-		      int classzone_idx, int alloc_flags)
+bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
+			unsigned long mark, int classzone_idx, int alloc_flags)
 {
 	long free_pages = zone_page_state(z, NR_FREE_PAGES);
 
@@ -4123,7 +4126,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 
 static void __meminit zone_init_free_lists(struct zone *zone)
 {
-	int order, t;
+	unsigned int order, t;
 	for_each_migratetype_order(order, t) {
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 		zone->free_area[order].nr_free = 0;
@@ -6447,7 +6450,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 {
 	struct page *page;
 	struct zone *zone;
-	int order, i;
+	unsigned int order, i;
 	unsigned long pfn;
 	unsigned long flags;
 	/* find the first valid pfn */
@@ -6499,7 +6502,7 @@ bool is_free_buddy_page(struct page *page)
 	struct zone *zone = page_zone(page);
 	unsigned long pfn = page_to_pfn(page);
 	unsigned long flags;
-	int order;
+	unsigned int order;
 
 	spin_lock_irqsave(&zone->lock, flags);
 	for (order = 0; order < MAX_ORDER; order++) {
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 11/17] mm: page_alloc: Use unsigned int for order in more places
@ 2014-05-01  8:44   ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

X86 prefers the use of unsigned types for iterators and there is a
tendency to mix whether a signed or unsigned type if used for page
order. This converts a number of sites in mm/page_alloc.c to use
unsigned int for order where possible.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |  8 ++++----
 mm/page_alloc.c        | 43 +++++++++++++++++++++++--------------------
 2 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2c3037a..d20403d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -818,10 +818,10 @@ static inline bool pgdat_is_empty(pg_data_t *pgdat)
 extern struct mutex zonelists_mutex;
 void build_all_zonelists(pg_data_t *pgdat, struct zone *zone);
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
-bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
-		int classzone_idx, int alloc_flags);
-bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
-		int classzone_idx, int alloc_flags);
+bool zone_watermark_ok(struct zone *z, unsigned int order,
+		unsigned long mark, int classzone_idx, int alloc_flags);
+bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
+		unsigned long mark, int classzone_idx, int alloc_flags);
 enum memmap_context {
 	MEMMAP_EARLY,
 	MEMMAP_HOTPLUG,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e55bc8..087c178 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -408,7 +408,8 @@ static int destroy_compound_page(struct page *page, unsigned long order)
 	return bad;
 }
 
-static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags)
+static inline void prep_zero_page(struct page *page, unsigned int order,
+							gfp_t gfp_flags)
 {
 	int i;
 
@@ -452,7 +453,7 @@ static inline void set_page_guard_flag(struct page *page) { }
 static inline void clear_page_guard_flag(struct page *page) { }
 #endif
 
-static inline void set_page_order(struct page *page, int order)
+static inline void set_page_order(struct page *page, unsigned int order)
 {
 	set_page_private(page, order);
 	__SetPageBuddy(page);
@@ -503,7 +504,7 @@ __find_buddy_index(unsigned long page_idx, unsigned int order)
  * For recording page's order, we use page_private(page).
  */
 static inline int page_is_buddy(struct page *page, struct page *buddy,
-								int order)
+							unsigned int order)
 {
 	if (!pfn_valid_within(page_to_pfn(buddy)))
 		return 0;
@@ -725,7 +726,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 
 static void free_one_page(struct zone *zone,
 				struct page *page, unsigned long pfn,
-				int order,
+				unsigned int order,
 				int migratetype)
 {
 	spin_lock(&zone->lock);
@@ -896,7 +897,7 @@ static inline int check_new_page(struct page *page)
 	return 0;
 }
 
-static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
+static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags)
 {
 	int i;
 
@@ -1104,16 +1105,17 @@ static int try_to_steal_freepages(struct zone *zone, struct page *page,
 
 /* Remove an element from the buddy allocator from the fallback list */
 static inline struct page *
-__rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
+__rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
 {
 	struct free_area *area;
-	int current_order;
+	unsigned int current_order;
 	struct page *page;
 	int migratetype, new_type, i;
 
 	/* Find the largest possible block of pages in the other list */
-	for (current_order = MAX_ORDER-1; current_order >= order;
-						--current_order) {
+	for (current_order = MAX_ORDER-1;
+				current_order >= order && current_order <= MAX_ORDER-1;
+				--current_order) {
 		for (i = 0;; i++) {
 			migratetype = fallbacks[start_migratetype][i];
 
@@ -1341,7 +1343,7 @@ void mark_free_pages(struct zone *zone)
 {
 	unsigned long pfn, max_zone_pfn;
 	unsigned long flags;
-	int order, t;
+	unsigned int order, t;
 	struct list_head *curr;
 
 	if (zone_is_empty(zone))
@@ -1537,8 +1539,8 @@ int split_free_page(struct page *page)
  */
 static inline
 struct page *buffered_rmqueue(struct zone *preferred_zone,
-			struct zone *zone, int order, gfp_t gfp_flags,
-			int migratetype)
+			struct zone *zone, unsigned int order,
+			gfp_t gfp_flags, int migratetype)
 {
 	unsigned long flags;
 	struct page *page;
@@ -1687,8 +1689,9 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
  * Return true if free pages are above 'mark'. This takes into account the order
  * of the allocation.
  */
-static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
-		      int classzone_idx, int alloc_flags, long free_pages)
+static bool __zone_watermark_ok(struct zone *z, unsigned int order,
+			unsigned long mark, int classzone_idx, int alloc_flags,
+			long free_pages)
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
@@ -1722,15 +1725,15 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 	return true;
 }
 
-bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 		      int classzone_idx, int alloc_flags)
 {
 	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
 					zone_page_state(z, NR_FREE_PAGES));
 }
 
-bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
-		      int classzone_idx, int alloc_flags)
+bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
+			unsigned long mark, int classzone_idx, int alloc_flags)
 {
 	long free_pages = zone_page_state(z, NR_FREE_PAGES);
 
@@ -4123,7 +4126,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 
 static void __meminit zone_init_free_lists(struct zone *zone)
 {
-	int order, t;
+	unsigned int order, t;
 	for_each_migratetype_order(order, t) {
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 		zone->free_area[order].nr_free = 0;
@@ -6447,7 +6450,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 {
 	struct page *page;
 	struct zone *zone;
-	int order, i;
+	unsigned int order, i;
 	unsigned long pfn;
 	unsigned long flags;
 	/* find the first valid pfn */
@@ -6499,7 +6502,7 @@ bool is_free_buddy_page(struct page *page)
 	struct zone *zone = page_zone(page);
 	unsigned long pfn = page_to_pfn(page);
 	unsigned long flags;
-	int order;
+	unsigned int order;
 
 	spin_lock_irqsave(&zone->lock, flags);
 	for (order = 0; order < MAX_ORDER; order++) {
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 12/17] mm: page_alloc: Convert hot/cold parameter and immediate callers to bool
  2014-05-01  8:44 ` Mel Gorman
@ 2014-05-01  8:44   ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

cold is a bool, make it one. Make the likely case the "if" part of the
block instead of the else as according to the optimisation manual this
is preferred.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/tile/mm/homecache.c |  2 +-
 fs/fuse/dev.c            |  2 +-
 include/linux/gfp.h      |  4 ++--
 include/linux/pagemap.h  |  2 +-
 include/linux/swap.h     |  2 +-
 mm/page_alloc.c          | 20 ++++++++++----------
 mm/swap.c                |  4 ++--
 mm/swap_state.c          |  2 +-
 mm/vmscan.c              |  6 +++---
 9 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 004ba56..33294fd 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -417,7 +417,7 @@ void __homecache_free_pages(struct page *page, unsigned int order)
 	if (put_page_testzero(page)) {
 		homecache_change_page_home(page, order, PAGE_HOME_HASH);
 		if (order == 0) {
-			free_hot_cold_page(page, 0);
+			free_hot_cold_page(page, false);
 		} else {
 			init_page_count(page);
 			__free_pages(page, order);
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index aac71ce..098f97b 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1614,7 +1614,7 @@ out_finish:
 
 static void fuse_retrieve_end(struct fuse_conn *fc, struct fuse_req *req)
 {
-	release_pages(req->pages, req->num_pages, 0);
+	release_pages(req->pages, req->num_pages, false);
 }
 
 static int fuse_retrieve(struct fuse_conn *fc, struct inode *inode,
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 39b81dc..3824ac6 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -369,8 +369,8 @@ void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
 
 extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
-extern void free_hot_cold_page(struct page *page, int cold);
-extern void free_hot_cold_page_list(struct list_head *list, int cold);
+extern void free_hot_cold_page(struct page *page, bool cold);
+extern void free_hot_cold_page_list(struct list_head *list, bool cold);
 
 extern void __free_memcg_kmem_pages(struct page *page, unsigned int order);
 extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 45598f1..9175f52 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -110,7 +110,7 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 
 #define page_cache_get(page)		get_page(page)
 #define page_cache_release(page)	put_page(page)
-void release_pages(struct page **pages, int nr, int cold);
+void release_pages(struct page **pages, int nr, bool cold);
 
 /*
  * speculatively take a reference to a page.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3507115..da8a250 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -496,7 +496,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
 #define free_page_and_swap_cache(page) \
 	page_cache_release(page)
 #define free_pages_and_swap_cache(pages, nr) \
-	release_pages((pages), (nr), 0);
+	release_pages((pages), (nr), false);
 
 static inline void show_swap_cache_info(void)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 087c178..94c5d06 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1189,7 +1189,7 @@ retry_reserve:
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			unsigned long count, struct list_head *list,
-			int migratetype, int cold)
+			int migratetype, bool cold)
 {
 	int mt = migratetype, i;
 
@@ -1208,7 +1208,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 * merge IO requests if the physical pages are ordered
 		 * properly.
 		 */
-		if (likely(cold == 0))
+		if (likely(!cold))
 			list_add(&page->lru, list);
 		else
 			list_add_tail(&page->lru, list);
@@ -1375,9 +1375,9 @@ void mark_free_pages(struct zone *zone)
 
 /*
  * Free a 0-order page
- * cold == 1 ? free a cold page : free a hot page
+ * cold == true ? free a cold page : free a hot page
  */
-void free_hot_cold_page(struct page *page, int cold)
+void free_hot_cold_page(struct page *page, bool cold)
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
@@ -1409,10 +1409,10 @@ void free_hot_cold_page(struct page *page, int cold)
 	}
 
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
-	if (cold)
-		list_add_tail(&page->lru, &pcp->lists[migratetype]);
-	else
+	if (!cold)
 		list_add(&page->lru, &pcp->lists[migratetype]);
+	else
+		list_add_tail(&page->lru, &pcp->lists[migratetype]);
 	pcp->count++;
 	if (pcp->count >= pcp->high) {
 		unsigned long batch = ACCESS_ONCE(pcp->batch);
@@ -1427,7 +1427,7 @@ out:
 /*
  * Free a list of 0-order pages
  */
-void free_hot_cold_page_list(struct list_head *list, int cold)
+void free_hot_cold_page_list(struct list_head *list, bool cold)
 {
 	struct page *page, *next;
 
@@ -1544,7 +1544,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 {
 	unsigned long flags;
 	struct page *page;
-	int cold = !!(gfp_flags & __GFP_COLD);
+	bool cold = ((gfp_flags & __GFP_COLD) != 0);
 
 again:
 	if (likely(order == 0)) {
@@ -2849,7 +2849,7 @@ void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
 		if (order == 0)
-			free_hot_cold_page(page, 0);
+			free_hot_cold_page(page, false);
 		else
 			__free_pages_ok(page, order);
 	}
diff --git a/mm/swap.c b/mm/swap.c
index 9ce43ba..f2228b7 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -67,7 +67,7 @@ static void __page_cache_release(struct page *page)
 static void __put_single_page(struct page *page)
 {
 	__page_cache_release(page);
-	free_hot_cold_page(page, 0);
+	free_hot_cold_page(page, false);
 }
 
 static void __put_compound_page(struct page *page)
@@ -813,7 +813,7 @@ void lru_add_drain_all(void)
  * grabbed the page via the LRU.  If it did, give up: shrink_inactive_list()
  * will free it.
  */
-void release_pages(struct page **pages, int nr, int cold)
+void release_pages(struct page **pages, int nr, bool cold)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e76ace3..2972eee 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -270,7 +270,7 @@ void free_pages_and_swap_cache(struct page **pages, int nr)
 
 		for (i = 0; i < todo; i++)
 			free_swap_cache(pagep[i]);
-		release_pages(pagep, todo, 0);
+		release_pages(pagep, todo, false);
 		pagep += todo;
 		nr -= todo;
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3f56c8d..8db1318 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1121,7 +1121,7 @@ keep:
 		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
 	}
 
-	free_hot_cold_page_list(&free_pages, 1);
+	free_hot_cold_page_list(&free_pages, true);
 
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
@@ -1519,7 +1519,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	free_hot_cold_page_list(&page_list, 1);
+	free_hot_cold_page_list(&page_list, true);
 
 	/*
 	 * If reclaim is isolating dirty pages under writeback, it implies
@@ -1740,7 +1740,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
 	spin_unlock_irq(&zone->lru_lock);
 
-	free_hot_cold_page_list(&l_hold, 1);
+	free_hot_cold_page_list(&l_hold, true);
 }
 
 #ifdef CONFIG_SWAP
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 12/17] mm: page_alloc: Convert hot/cold parameter and immediate callers to bool
@ 2014-05-01  8:44   ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

cold is a bool, make it one. Make the likely case the "if" part of the
block instead of the else as according to the optimisation manual this
is preferred.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 arch/tile/mm/homecache.c |  2 +-
 fs/fuse/dev.c            |  2 +-
 include/linux/gfp.h      |  4 ++--
 include/linux/pagemap.h  |  2 +-
 include/linux/swap.h     |  2 +-
 mm/page_alloc.c          | 20 ++++++++++----------
 mm/swap.c                |  4 ++--
 mm/swap_state.c          |  2 +-
 mm/vmscan.c              |  6 +++---
 9 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 004ba56..33294fd 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -417,7 +417,7 @@ void __homecache_free_pages(struct page *page, unsigned int order)
 	if (put_page_testzero(page)) {
 		homecache_change_page_home(page, order, PAGE_HOME_HASH);
 		if (order == 0) {
-			free_hot_cold_page(page, 0);
+			free_hot_cold_page(page, false);
 		} else {
 			init_page_count(page);
 			__free_pages(page, order);
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index aac71ce..098f97b 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1614,7 +1614,7 @@ out_finish:
 
 static void fuse_retrieve_end(struct fuse_conn *fc, struct fuse_req *req)
 {
-	release_pages(req->pages, req->num_pages, 0);
+	release_pages(req->pages, req->num_pages, false);
 }
 
 static int fuse_retrieve(struct fuse_conn *fc, struct inode *inode,
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 39b81dc..3824ac6 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -369,8 +369,8 @@ void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
 
 extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
-extern void free_hot_cold_page(struct page *page, int cold);
-extern void free_hot_cold_page_list(struct list_head *list, int cold);
+extern void free_hot_cold_page(struct page *page, bool cold);
+extern void free_hot_cold_page_list(struct list_head *list, bool cold);
 
 extern void __free_memcg_kmem_pages(struct page *page, unsigned int order);
 extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 45598f1..9175f52 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -110,7 +110,7 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 
 #define page_cache_get(page)		get_page(page)
 #define page_cache_release(page)	put_page(page)
-void release_pages(struct page **pages, int nr, int cold);
+void release_pages(struct page **pages, int nr, bool cold);
 
 /*
  * speculatively take a reference to a page.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3507115..da8a250 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -496,7 +496,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
 #define free_page_and_swap_cache(page) \
 	page_cache_release(page)
 #define free_pages_and_swap_cache(pages, nr) \
-	release_pages((pages), (nr), 0);
+	release_pages((pages), (nr), false);
 
 static inline void show_swap_cache_info(void)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 087c178..94c5d06 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1189,7 +1189,7 @@ retry_reserve:
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			unsigned long count, struct list_head *list,
-			int migratetype, int cold)
+			int migratetype, bool cold)
 {
 	int mt = migratetype, i;
 
@@ -1208,7 +1208,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 * merge IO requests if the physical pages are ordered
 		 * properly.
 		 */
-		if (likely(cold == 0))
+		if (likely(!cold))
 			list_add(&page->lru, list);
 		else
 			list_add_tail(&page->lru, list);
@@ -1375,9 +1375,9 @@ void mark_free_pages(struct zone *zone)
 
 /*
  * Free a 0-order page
- * cold == 1 ? free a cold page : free a hot page
+ * cold == true ? free a cold page : free a hot page
  */
-void free_hot_cold_page(struct page *page, int cold)
+void free_hot_cold_page(struct page *page, bool cold)
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
@@ -1409,10 +1409,10 @@ void free_hot_cold_page(struct page *page, int cold)
 	}
 
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
-	if (cold)
-		list_add_tail(&page->lru, &pcp->lists[migratetype]);
-	else
+	if (!cold)
 		list_add(&page->lru, &pcp->lists[migratetype]);
+	else
+		list_add_tail(&page->lru, &pcp->lists[migratetype]);
 	pcp->count++;
 	if (pcp->count >= pcp->high) {
 		unsigned long batch = ACCESS_ONCE(pcp->batch);
@@ -1427,7 +1427,7 @@ out:
 /*
  * Free a list of 0-order pages
  */
-void free_hot_cold_page_list(struct list_head *list, int cold)
+void free_hot_cold_page_list(struct list_head *list, bool cold)
 {
 	struct page *page, *next;
 
@@ -1544,7 +1544,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 {
 	unsigned long flags;
 	struct page *page;
-	int cold = !!(gfp_flags & __GFP_COLD);
+	bool cold = ((gfp_flags & __GFP_COLD) != 0);
 
 again:
 	if (likely(order == 0)) {
@@ -2849,7 +2849,7 @@ void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
 		if (order == 0)
-			free_hot_cold_page(page, 0);
+			free_hot_cold_page(page, false);
 		else
 			__free_pages_ok(page, order);
 	}
diff --git a/mm/swap.c b/mm/swap.c
index 9ce43ba..f2228b7 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -67,7 +67,7 @@ static void __page_cache_release(struct page *page)
 static void __put_single_page(struct page *page)
 {
 	__page_cache_release(page);
-	free_hot_cold_page(page, 0);
+	free_hot_cold_page(page, false);
 }
 
 static void __put_compound_page(struct page *page)
@@ -813,7 +813,7 @@ void lru_add_drain_all(void)
  * grabbed the page via the LRU.  If it did, give up: shrink_inactive_list()
  * will free it.
  */
-void release_pages(struct page **pages, int nr, int cold)
+void release_pages(struct page **pages, int nr, bool cold)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e76ace3..2972eee 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -270,7 +270,7 @@ void free_pages_and_swap_cache(struct page **pages, int nr)
 
 		for (i = 0; i < todo; i++)
 			free_swap_cache(pagep[i]);
-		release_pages(pagep, todo, 0);
+		release_pages(pagep, todo, false);
 		pagep += todo;
 		nr -= todo;
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3f56c8d..8db1318 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1121,7 +1121,7 @@ keep:
 		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
 	}
 
-	free_hot_cold_page_list(&free_pages, 1);
+	free_hot_cold_page_list(&free_pages, true);
 
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
@@ -1519,7 +1519,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	free_hot_cold_page_list(&page_list, 1);
+	free_hot_cold_page_list(&page_list, true);
 
 	/*
 	 * If reclaim is isolating dirty pages under writeback, it implies
@@ -1740,7 +1740,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
 	spin_unlock_irq(&zone->lru_lock);
 
-	free_hot_cold_page_list(&l_hold, 1);
+	free_hot_cold_page_list(&l_hold, true);
 }
 
 #ifdef CONFIG_SWAP
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 13/17] mm: shmem: Avoid atomic operation during shmem_getpage_gfp
  2014-05-01  8:44 ` Mel Gorman
@ 2014-05-01  8:44   ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
before it's even added to the LRU or visible. This is unnecessary as what
could it possible race against?  Use an unlocked variant.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/page-flags.h | 1 +
 mm/shmem.c                 | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index d1fe1a7..4d4b39a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -208,6 +208,7 @@ PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned)	/* Xen */
 PAGEFLAG(SavePinned, savepinned);			/* Xen */
 PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved)
 PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
+	__SETPAGEFLAG(SwapBacked, swapbacked)
 
 __PAGEFLAG(SlobFree, slob_free)
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 9f70e02..f47fb38 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1132,7 +1132,7 @@ repeat:
 			goto decused;
 		}
 
-		SetPageSwapBacked(page);
+		__SetPageSwapBacked(page);
 		__set_page_locked(page);
 		error = mem_cgroup_charge_file(page, current->mm,
 						gfp & GFP_RECLAIM_MASK);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 13/17] mm: shmem: Avoid atomic operation during shmem_getpage_gfp
@ 2014-05-01  8:44   ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
before it's even added to the LRU or visible. This is unnecessary as what
could it possible race against?  Use an unlocked variant.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/page-flags.h | 1 +
 mm/shmem.c                 | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index d1fe1a7..4d4b39a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -208,6 +208,7 @@ PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned)	/* Xen */
 PAGEFLAG(SavePinned, savepinned);			/* Xen */
 PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved)
 PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
+	__SETPAGEFLAG(SwapBacked, swapbacked)
 
 __PAGEFLAG(SlobFree, slob_free)
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 9f70e02..f47fb38 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1132,7 +1132,7 @@ repeat:
 			goto decused;
 		}
 
-		SetPageSwapBacked(page);
+		__SetPageSwapBacked(page);
 		__set_page_locked(page);
 		error = mem_cgroup_charge_file(page, current->mm,
 						gfp & GFP_RECLAIM_MASK);
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 14/17] mm: Do not use atomic operations when releasing pages
  2014-05-01  8:44 ` Mel Gorman
@ 2014-05-01  8:44   ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

There should be no references to it any more and a parallel mark should
not be reordered against us. Use non-locked varient to clear page active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/swap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index f2228b7..7a5bdd7 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -854,7 +854,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 		}
 
 		/* Clear Active bit in case of parallel mark_page_accessed */
-		ClearPageActive(page);
+		__ClearPageActive(page);
 
 		list_add(&page->lru, &pages_to_free);
 	}
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 14/17] mm: Do not use atomic operations when releasing pages
@ 2014-05-01  8:44   ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

There should be no references to it any more and a parallel mark should
not be reordered against us. Use non-locked varient to clear page active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/swap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index f2228b7..7a5bdd7 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -854,7 +854,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 		}
 
 		/* Clear Active bit in case of parallel mark_page_accessed */
-		ClearPageActive(page);
+		__ClearPageActive(page);
 
 		list_add(&page->lru, &pages_to_free);
 	}
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 15/17] mm: Do not use unnecessary atomic operations when adding pages to the LRU
  2014-05-01  8:44 ` Mel Gorman
@ 2014-05-01  8:44   ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

When adding pages to the LRU we clear the active bit unconditionally. As the
page could be reachable from other paths we cannot use unlocked operations
without risk of corruption such as a parallel mark_page_accessed. This
patch test if is necessary to clear the atomic flag before using an atomic
operation. In the unlikely even this races with mark_page_accesssed the
consequences are simply that the page may be promoted to the active list
that might have been left on the inactive list before the patch. This is
a marginal consequence.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/swap.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index da8a250..395dcab 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -329,13 +329,15 @@ extern void add_page_to_unevictable_list(struct page *page);
  */
 static inline void lru_cache_add_anon(struct page *page)
 {
-	ClearPageActive(page);
+	if (PageActive(page))
+		ClearPageActive(page);
 	__lru_cache_add(page);
 }
 
 static inline void lru_cache_add_file(struct page *page)
 {
-	ClearPageActive(page);
+	if (PageActive(page))
+		ClearPageActive(page);
 	__lru_cache_add(page);
 }
 
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 15/17] mm: Do not use unnecessary atomic operations when adding pages to the LRU
@ 2014-05-01  8:44   ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

When adding pages to the LRU we clear the active bit unconditionally. As the
page could be reachable from other paths we cannot use unlocked operations
without risk of corruption such as a parallel mark_page_accessed. This
patch test if is necessary to clear the atomic flag before using an atomic
operation. In the unlikely even this races with mark_page_accesssed the
consequences are simply that the page may be promoted to the active list
that might have been left on the inactive list before the patch. This is
a marginal consequence.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/swap.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index da8a250..395dcab 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -329,13 +329,15 @@ extern void add_page_to_unevictable_list(struct page *page);
  */
 static inline void lru_cache_add_anon(struct page *page)
 {
-	ClearPageActive(page);
+	if (PageActive(page))
+		ClearPageActive(page);
 	__lru_cache_add(page);
 }
 
 static inline void lru_cache_add_file(struct page *page)
 {
-	ClearPageActive(page);
+	if (PageActive(page))
+		ClearPageActive(page);
 	__lru_cache_add(page);
 }
 
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 16/17] mm: Non-atomically mark page accessed during page cache allocation where possible
  2014-05-01  8:44 ` Mel Gorman
@ 2014-05-01  8:44   ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

The most obvious example of the problem being tackled is that
aops->write_begin may allocate a new page and make it visible only to
have mark_page_accessed called almost immediately after. Once the page is
visible the atomic operations are necessary which is noticable overhead when
writing to an in-memory filesystem like tmpfs but should also be noticable
with fast storage. The objective of the patch is to initialse the accessed
information with non-atomic operations before the page is visible.

The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial allocation
of a page cache page. This patch adds an init_page_accessed() helper which
behaves like the first call to mark_page_accessed() but may called before
the page is visible and can be done non-atomically.

The primary APIs of concern in this care are the following and are used
by most filesystems.

	find_get_page
	find_lock_page
	find_or_create_page
	grab_cache_page_nowait
	grab_cache_page_write_begin

All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its behavior
such as whether the page should be marked accessed or not. Then old API
is preserved but is basically a thin wrapper around this core function.

Each of the filesystems are then updated to avoid calling mark_page_accessed
when it is known that the VM interfaces have already done the job.
There is a slight snag in that the timing of the mark_page_accessed() has
now changed so in rare cases it's possible a page gets to the end of the LRU
as PageReferenced where as previously it might have been repromoted. This
is expected to be rare but it's worth the filesystem people thinking about
it in case they see a problem with the timing change. It is also the case
that some filesystems may be marking pages accessed that previously did not
but it makes sense that filesystems have consistent behaviour in this regard.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of
the file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or
NUMA artifacts. Throughput and wall times are presented for sync IO, only
wall times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.

async dd
                                    3.15.0-rc3            3.15.0-rc3
                                       vanilla           accessed-v2
ext3    Max      elapsed     13.6800 (  0.00%)     12.9200 (  5.56%)
tmpfs   Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
btrfs   Max      elapsed     12.8700 (  0.00%)     12.8200 (  0.39%)
ext4    Max      elapsed     13.5400 (  0.00%)     13.4000 (  1.03%)
xfs     Max      elapsed      1.9900 (  0.00%)      2.0000 ( -0.50%)

In most cases a respectable gain is shown. xfs is an exception but the
differences there are marginal. xfs was not directly using mark_page_accessed
but now uses it indirectly via grab_cache_page_write_begin which means
that XFS now behaves consistently in comparison to filesystems that use
block_write_begin() with respect to page activation.

        samples percentage
ext3     102029     1.1533  vmlinux-3.15.0-rc3-vanilla		mark_page_accessed
ext3      22676     0.2409  vmlinux-3.15.0-rc3-accessed-v2r  	mark_page_accessed
ext3       3560     0.0378  vmlinux-3.15.0-rc3-accessed-v2r     init_page_accessed
ext4      61524     0.8354  vmlinux-3.15.0-rc3-vanilla          mark_page_accessed
ext4       2177     0.0285  vmlinux-3.15.0-rc3-accessed-v2r     init_page_accessed
ext4       2025     0.0265  vmlinux-3.15.0-rc3-accessed-v2r     mark_page_accessed
xfs       56976     1.5582  vmlinux-3.15.0-rc3-vanilla          mark_page_accessed
xfs        2133     0.0601  vmlinux-3.15.0-rc3-accessed-v2r     init_page_accessed
xfs         100     0.0028  vmlinux-3.15.0-rc3-accessed-v2r     mark_page_accessed
btrfs     10678     0.1379  vmlinux-3.15.0-rc3-vanilla          mark_page_accessed
btrfs      2069     0.0271  vmlinux-3.15.0-rc3-accessed-v2r     init_page_accessed
btrfs       609     0.0080  vmlinux-3.15.0-rc3-accessed-v2r     mark_page_accessed
tmpfs     58424     3.1887  vmlinux-3.15.0-rc3-vanilla          mark_page_accessed
tmpfs      1249     0.0693  vmlinux-3.15.0-rc3-accessed-v2r     init_page_accessed
tmpfs        96     0.0053  vmlinux-3.15.0-rc3-accessed-v2r      mark_page_accessed

In all cases there is a massive reduction on the number of cycles spend
in mark_page_accessed doing atomic operations.

sync dd
                               3.15.0-rc3            3.15.0-rc3
                                  vanilla           accessed-v2
ext3    Max    tput    115.0000 (  0.00%)    116.0000 (  0.87%)
ext3    Max elapsed     15.1600 (  0.00%)     15.0900 (  0.46%)
ext4    Max    tput    121.0000 (  0.00%)    121.0000 (  0.00%)
ext4    Max elapsed     14.5700 (  0.00%)     14.4900 (  0.55%)
tmpfs   Max    tput   5017.6000 (  0.00%)   5324.8000 (  6.12%) (granularity is poor)
tmpfs   Max elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
btrfs   Max    tput    128.0000 (  0.00%)    128.0000 (  0.00%)
btrfs   Max elapsed     13.5700 (  0.00%)     13.5000 (  0.52%)
xfs     Max    tput    122.0000 (  0.00%)    122.0000 (  0.00%)
xfs     Max elapsed     14.3700 (  0.00%)     14.4500 ( -0.56%)

With the exception of tmpfs for obvious reasons the cost of
mark_page_accessed is mostly hidden and easily missed. Similar costs are
incurred in profiles thoughw with similar savings when the patch is applied.
xfs again loses out here at the cost of behaving similar to other filesystems
when aging pages.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/btrfs/extent_io.c       |  11 +--
 fs/btrfs/file.c            |   5 +-
 fs/buffer.c                |   7 +-
 fs/ext4/mballoc.c          |  14 ++--
 fs/f2fs/checkpoint.c       |   3 -
 fs/f2fs/node.c             |   2 -
 fs/fuse/file.c             |   2 -
 fs/gfs2/aops.c             |   1 -
 fs/gfs2/meta_io.c          |   4 +-
 fs/ntfs/attrib.c           |   1 -
 fs/ntfs/file.c             |   1 -
 include/linux/page-flags.h |   1 +
 include/linux/pagemap.h    | 107 ++++++++++++++++++++++--
 include/linux/swap.h       |   1 +
 mm/filemap.c               | 202 +++++++++++++++++----------------------------
 mm/shmem.c                 |   6 +-
 mm/swap.c                  |  11 +++
 17 files changed, 217 insertions(+), 162 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3955e47..158833c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4510,7 +4510,8 @@ static void check_buffer_tree_ref(struct extent_buffer *eb)
 	spin_unlock(&eb->refs_lock);
 }
 
-static void mark_extent_buffer_accessed(struct extent_buffer *eb)
+static void mark_extent_buffer_accessed(struct extent_buffer *eb,
+		struct page *accessed)
 {
 	unsigned long num_pages, i;
 
@@ -4519,7 +4520,8 @@ static void mark_extent_buffer_accessed(struct extent_buffer *eb)
 	num_pages = num_extent_pages(eb->start, eb->len);
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = extent_buffer_page(eb, i);
-		mark_page_accessed(p);
+		if (p != accessed)
+			mark_page_accessed(p);
 	}
 }
 
@@ -4533,7 +4535,7 @@ struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
 			       start >> PAGE_CACHE_SHIFT);
 	if (eb && atomic_inc_not_zero(&eb->refs)) {
 		rcu_read_unlock();
-		mark_extent_buffer_accessed(eb);
+		mark_extent_buffer_accessed(eb, NULL);
 		return eb;
 	}
 	rcu_read_unlock();
@@ -4581,7 +4583,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 				spin_unlock(&mapping->private_lock);
 				unlock_page(p);
 				page_cache_release(p);
-				mark_extent_buffer_accessed(exists);
+				mark_extent_buffer_accessed(exists, p);
 				goto free_eb;
 			}
 
@@ -4596,7 +4598,6 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		attach_extent_buffer_page(eb, p);
 		spin_unlock(&mapping->private_lock);
 		WARN_ON(PageDirty(p));
-		mark_page_accessed(p);
 		eb->pages[i] = p;
 		if (!PageUptodate(p))
 			uptodate = 0;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index ae6af07..74272a3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -470,11 +470,12 @@ static void btrfs_drop_pages(struct page **pages, size_t num_pages)
 	for (i = 0; i < num_pages; i++) {
 		/* page checked is some magic around finding pages that
 		 * have been modified without going through btrfs_set_page_dirty
-		 * clear it here
+		 * clear it here. There should be no need to mark the pages
+		 * accessed as prepare_pages should have marked them accessed
+		 * in prepare_pages via find_or_create_page()
 		 */
 		ClearPageChecked(pages[i]);
 		unlock_page(pages[i]);
-		mark_page_accessed(pages[i]);
 		page_cache_release(pages[i]);
 	}
 }
diff --git a/fs/buffer.c b/fs/buffer.c
index 9ddb9fc..83627b1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -227,7 +227,7 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
 	int all_mapped = 1;
 
 	index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits);
-	page = find_get_page(bd_mapping, index);
+	page = find_get_page_flags(bd_mapping, index, FGP_ACCESSED);
 	if (!page)
 		goto out;
 
@@ -1366,12 +1366,13 @@ __find_get_block(struct block_device *bdev, sector_t block, unsigned size)
 	struct buffer_head *bh = lookup_bh_lru(bdev, block, size);
 
 	if (bh == NULL) {
+		/* __find_get_block_slow will mark the page accessed */
 		bh = __find_get_block_slow(bdev, block);
 		if (bh)
 			bh_lru_install(bh);
-	}
-	if (bh)
+	} else
 		touch_buffer(bh);
+
 	return bh;
 }
 EXPORT_SYMBOL(__find_get_block);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index c8238a2..afe8a13 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1044,6 +1044,8 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group)
 	 * allocating. If we are looking at the buddy cache we would
 	 * have taken a reference using ext4_mb_load_buddy and that
 	 * would have pinned buddy page to page cache.
+	 * The call to ext4_mb_get_buddy_page_lock will mark the
+	 * page accessed.
 	 */
 	ret = ext4_mb_get_buddy_page_lock(sb, group, &e4b);
 	if (ret || !EXT4_MB_GRP_NEED_INIT(this_grp)) {
@@ -1062,7 +1064,6 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group)
 		ret = -EIO;
 		goto err;
 	}
-	mark_page_accessed(page);
 
 	if (e4b.bd_buddy_page == NULL) {
 		/*
@@ -1082,7 +1083,6 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group)
 		ret = -EIO;
 		goto err;
 	}
-	mark_page_accessed(page);
 err:
 	ext4_mb_put_buddy_page_lock(&e4b);
 	return ret;
@@ -1141,7 +1141,7 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
 
 	/* we could use find_or_create_page(), but it locks page
 	 * what we'd like to avoid in fast path ... */
-	page = find_get_page(inode->i_mapping, pnum);
+	page = find_get_page_flags(inode->i_mapping, pnum, FGP_ACCESSED);
 	if (page == NULL || !PageUptodate(page)) {
 		if (page)
 			/*
@@ -1176,15 +1176,16 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
 		ret = -EIO;
 		goto err;
 	}
+
+	/* Pages marked accessed already */
 	e4b->bd_bitmap_page = page;
 	e4b->bd_bitmap = page_address(page) + (poff * sb->s_blocksize);
-	mark_page_accessed(page);
 
 	block++;
 	pnum = block / blocks_per_page;
 	poff = block % blocks_per_page;
 
-	page = find_get_page(inode->i_mapping, pnum);
+	page = find_get_page_flags(inode->i_mapping, pnum, FGP_ACCESSED);
 	if (page == NULL || !PageUptodate(page)) {
 		if (page)
 			page_cache_release(page);
@@ -1209,9 +1210,10 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
 		ret = -EIO;
 		goto err;
 	}
+
+	/* Pages marked accessed already */
 	e4b->bd_buddy_page = page;
 	e4b->bd_buddy = page_address(page) + (poff * sb->s_blocksize);
-	mark_page_accessed(page);
 
 	BUG_ON(e4b->bd_bitmap_page == NULL);
 	BUG_ON(e4b->bd_buddy_page == NULL);
diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 4aa521a..c405b8f 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -69,7 +69,6 @@ repeat:
 		goto repeat;
 	}
 out:
-	mark_page_accessed(page);
 	return page;
 }
 
@@ -137,13 +136,11 @@ int ra_meta_pages(struct f2fs_sb_info *sbi, int start, int nrpages, int type)
 		if (!page)
 			continue;
 		if (PageUptodate(page)) {
-			mark_page_accessed(page);
 			f2fs_put_page(page, 1);
 			continue;
 		}
 
 		f2fs_submit_page_mbio(sbi, page, blk_addr, &fio);
-		mark_page_accessed(page);
 		f2fs_put_page(page, 0);
 	}
 out:
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index a161e95..57caa6e 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -967,7 +967,6 @@ repeat:
 		goto repeat;
 	}
 got_it:
-	mark_page_accessed(page);
 	return page;
 }
 
@@ -1022,7 +1021,6 @@ page_hit:
 		f2fs_put_page(page, 1);
 		return ERR_PTR(-EIO);
 	}
-	mark_page_accessed(page);
 	return page;
 }
 
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 13f8bde..85a3359 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1089,8 +1089,6 @@ static ssize_t fuse_fill_write_pages(struct fuse_req *req,
 		tmp = iov_iter_copy_from_user_atomic(page, ii, offset, bytes);
 		flush_dcache_page(page);
 
-		mark_page_accessed(page);
-
 		if (!tmp) {
 			unlock_page(page);
 			page_cache_release(page);
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index ce62dca..3c1ab7b 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -577,7 +577,6 @@ int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos,
 		p = kmap_atomic(page);
 		memcpy(buf + copied, p + offset, amt);
 		kunmap_atomic(p);
-		mark_page_accessed(page);
 		page_cache_release(page);
 		copied += amt;
 		index++;
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index 2cf09b6..b984a6e 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -136,7 +136,8 @@ struct buffer_head *gfs2_getbuf(struct gfs2_glock *gl, u64 blkno, int create)
 			yield();
 		}
 	} else {
-		page = find_lock_page(mapping, index);
+		page = find_get_page_flags(mapping, index,
+						FGP_LOCK|FGP_ACCESSED);
 		if (!page)
 			return NULL;
 	}
@@ -153,7 +154,6 @@ struct buffer_head *gfs2_getbuf(struct gfs2_glock *gl, u64 blkno, int create)
 		map_bh(bh, sdp->sd_vfs, blkno);
 
 	unlock_page(page);
-	mark_page_accessed(page);
 	page_cache_release(page);
 
 	return bh;
diff --git a/fs/ntfs/attrib.c b/fs/ntfs/attrib.c
index a27e3fe..250ed5b 100644
--- a/fs/ntfs/attrib.c
+++ b/fs/ntfs/attrib.c
@@ -1748,7 +1748,6 @@ int ntfs_attr_make_non_resident(ntfs_inode *ni, const u32 data_size)
 	if (page) {
 		set_page_dirty(page);
 		unlock_page(page);
-		mark_page_accessed(page);
 		page_cache_release(page);
 	}
 	ntfs_debug("Done.");
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index db9bd8a..86ddab9 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2060,7 +2060,6 @@ static ssize_t ntfs_file_buffered_write(struct kiocb *iocb,
 		}
 		do {
 			unlock_page(pages[--do_pages]);
-			mark_page_accessed(pages[do_pages]);
 			page_cache_release(pages[do_pages]);
 		} while (do_pages);
 		if (unlikely(status))
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4d4b39a..2093eb7 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -198,6 +198,7 @@ struct page;	/* forward declaration */
 TESTPAGEFLAG(Locked, locked)
 PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)
 PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
+	__SETPAGEFLAG(Referenced, referenced)
 PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
 PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
 PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 9175f52..e5ffaa0 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -259,12 +259,109 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
 pgoff_t page_cache_prev_hole(struct address_space *mapping,
 			     pgoff_t index, unsigned long max_scan);
 
+#define FGP_ACCESSED		0x00000001
+#define FGP_LOCK		0x00000002
+#define FGP_CREAT		0x00000004
+#define FGP_WRITE		0x00000008
+#define FGP_NOFS		0x00000010
+#define FGP_NOWAIT		0x00000020
+
+struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
+		int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask);
+
+/**
+ * find_get_page - find and get a page reference
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned with an increased refcount.
+ *
+ * Otherwise, %NULL is returned.
+ */
+static inline struct page *find_get_page(struct address_space *mapping,
+					pgoff_t offset)
+{
+	return pagecache_get_page(mapping, offset, 0, 0, 0);
+}
+
+static inline struct page *find_get_page_flags(struct address_space *mapping,
+					pgoff_t offset, int fgp_flags)
+{
+	return pagecache_get_page(mapping, offset, fgp_flags, 0, 0);
+}
+
+/**
+ * find_lock_page - locate, pin and lock a pagecache page
+ * pagecache_get_page - find and get a page reference
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned locked and with an increased
+ * refcount.
+ *
+ * Otherwise, %NULL is returned.
+ *
+ * find_lock_page() may sleep.
+ */
+static inline struct page *find_lock_page(struct address_space *mapping,
+					pgoff_t offset)
+{
+	return pagecache_get_page(mapping, offset, FGP_LOCK, 0, 0);
+}
+
+/**
+ * find_or_create_page - locate or add a pagecache page
+ * @mapping: the page's address_space
+ * @index: the page's index into the mapping
+ * @gfp_mask: page allocation mode
+ *
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned locked and with an increased
+ * refcount.
+ *
+ * If the page is not present, a new page is allocated using @gfp_mask
+ * and added to the page cache and the VM's LRU list.  The page is
+ * returned locked and with an increased refcount.
+ *
+ * On memory exhaustion, %NULL is returned.
+ *
+ * find_or_create_page() may sleep, even if @gfp_flags specifies an
+ * atomic allocation!
+ */
+static inline struct page *find_or_create_page(struct address_space *mapping,
+					pgoff_t offset, gfp_t gfp_mask)
+{
+	return pagecache_get_page(mapping, offset,
+					FGP_LOCK|FGP_ACCESSED|FGP_CREAT,
+					gfp_mask, gfp_mask & GFP_RECLAIM_MASK);
+}
+
+/**
+ * grab_cache_page_nowait - returns locked page at given index in given cache
+ * @mapping: target address_space
+ * @index: the page index
+ *
+ * Same as grab_cache_page(), but do not wait if the page is unavailable.
+ * This is intended for speculative data generators, where the data can
+ * be regenerated if the page couldn't be grabbed.  This routine should
+ * be safe to call while holding the lock for another page.
+ *
+ * Clear __GFP_FS when allocating the page to avoid recursion into the fs
+ * and deadlock against the caller's locked page.
+ */
+static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
+				pgoff_t index)
+{
+	return pagecache_get_page(mapping, index,
+			FGP_LOCK|FGP_CREAT|FGP_NOFS|FGP_NOWAIT,
+			mapping_gfp_mask(mapping),
+			GFP_NOFS);
+}
+
 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset);
-struct page *find_get_page(struct address_space *mapping, pgoff_t offset);
 struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset);
-struct page *find_lock_page(struct address_space *mapping, pgoff_t offset);
-struct page *find_or_create_page(struct address_space *mapping, pgoff_t index,
-				 gfp_t gfp_mask);
 unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
 			  unsigned int nr_entries, struct page **entries,
 			  pgoff_t *indices);
@@ -287,8 +384,6 @@ static inline struct page *grab_cache_page(struct address_space *mapping,
 	return find_or_create_page(mapping, index, mapping_gfp_mask(mapping));
 }
 
-extern struct page * grab_cache_page_nowait(struct address_space *mapping,
-				pgoff_t index);
 extern struct page * read_cache_page(struct address_space *mapping,
 				pgoff_t index, filler_t *filler, void *data);
 extern struct page * read_cache_page_gfp(struct address_space *mapping,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 395dcab..b570ad5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -314,6 +314,7 @@ extern void lru_add_page_tail(struct page *page, struct page *page_tail,
 			 struct lruvec *lruvec, struct list_head *head);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
+extern void init_page_accessed(struct page *page);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
diff --git a/mm/filemap.c b/mm/filemap.c
index 5020b28..c60ed0f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -955,26 +955,6 @@ out:
 EXPORT_SYMBOL(find_get_entry);
 
 /**
- * find_get_page - find and get a page reference
- * @mapping: the address_space to search
- * @offset: the page index
- *
- * Looks up the page cache slot at @mapping & @offset.  If there is a
- * page cache page, it is returned with an increased refcount.
- *
- * Otherwise, %NULL is returned.
- */
-struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
-{
-	struct page *page = find_get_entry(mapping, offset);
-
-	if (radix_tree_exceptional_entry(page))
-		page = NULL;
-	return page;
-}
-EXPORT_SYMBOL(find_get_page);
-
-/**
  * find_lock_entry - locate, pin and lock a page cache entry
  * @mapping: the address_space to search
  * @offset: the page cache index
@@ -1011,66 +991,84 @@ repeat:
 EXPORT_SYMBOL(find_lock_entry);
 
 /**
- * find_lock_page - locate, pin and lock a pagecache page
+ * pagecache_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
+ * @fgp_flags: PCG flags
+ * @gfp_mask: gfp mask to use if a page is to be allocated
  *
- * Looks up the page cache slot at @mapping & @offset.  If there is a
- * page cache page, it is returned locked and with an increased
- * refcount.
- *
- * Otherwise, %NULL is returned.
- *
- * find_lock_page() may sleep.
- */
-struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
-{
-	struct page *page = find_lock_entry(mapping, offset);
-
-	if (radix_tree_exceptional_entry(page))
-		page = NULL;
-	return page;
-}
-EXPORT_SYMBOL(find_lock_page);
-
-/**
- * find_or_create_page - locate or add a pagecache page
- * @mapping: the page's address_space
- * @index: the page's index into the mapping
- * @gfp_mask: page allocation mode
+ * Looks up the page cache slot at @mapping & @offset.
  *
- * Looks up the page cache slot at @mapping & @offset.  If there is a
- * page cache page, it is returned locked and with an increased
- * refcount.
+ * PCG flags modify how the page is returned
  *
- * If the page is not present, a new page is allocated using @gfp_mask
- * and added to the page cache and the VM's LRU list.  The page is
- * returned locked and with an increased refcount.
+ * FGP_ACCESSED: the page will be marked accessed
+ * FGP_LOCK: Page is return locked
+ * FGP_CREAT: If page is not present then a new page is allocated using
+ *		@gfp_mask and added to the page cache and the VM's LRU
+ *		list. The page is returned locked and with an increased
+ *		refcount. Otherwise, %NULL is returned.
  *
- * On memory exhaustion, %NULL is returned.
+ * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
+ * if the GFP flags specified for FGP_CREAT are atomic.
  *
- * find_or_create_page() may sleep, even if @gfp_flags specifies an
- * atomic allocation!
+ * If there is a page cache page, it is returned with an increased refcount.
  */
-struct page *find_or_create_page(struct address_space *mapping,
-		pgoff_t index, gfp_t gfp_mask)
+struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
+	int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask)
 {
 	struct page *page;
-	int err;
+
 repeat:
-	page = find_lock_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(gfp_mask);
+	page = find_get_entry(mapping, offset);
+	if (radix_tree_exceptional_entry(page))
+		page = NULL;
+	if (!page)
+		goto no_page;
+
+	if (fgp_flags & FGP_LOCK) {
+		if (fgp_flags & FGP_NOWAIT) {
+			if (!trylock_page(page)) {
+				page_cache_release(page);
+				return NULL;
+			}
+		} else {
+			lock_page(page);
+		}
+
+		/* Has the page been truncated? */
+		if (unlikely(page->mapping != mapping)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto repeat;
+		}
+		VM_BUG_ON_PAGE(page->index != offset, page);
+	}
+
+	if (page && (fgp_flags & FGP_ACCESSED))
+		mark_page_accessed(page);
+
+no_page:
+	if (!page && (fgp_flags & FGP_CREAT)) {
+		int err;
+		if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping))
+			cache_gfp_mask |= __GFP_WRITE;
+		if (fgp_flags & FGP_NOFS) {
+			cache_gfp_mask &= ~__GFP_FS;
+			radix_gfp_mask &= ~__GFP_FS;
+		}
+
+		page = __page_cache_alloc(cache_gfp_mask);
 		if (!page)
 			return NULL;
-		/*
-		 * We want a regular kernel memory (not highmem or DMA etc)
-		 * allocation for the radix tree nodes, but we need to honour
-		 * the context-specific requirements the caller has asked for.
-		 * GFP_RECLAIM_MASK collects those requirements.
-		 */
-		err = add_to_page_cache_lru(page, mapping, index,
-			(gfp_mask & GFP_RECLAIM_MASK));
+
+		if (WARN_ON_ONCE(!(fgp_flags & FGP_LOCK)))
+			fgp_flags |= FGP_LOCK;
+
+		/* Init accessed so avoit atomic mark_page_accessed later */
+		if (fgp_flags & FGP_ACCESSED)
+			init_page_accessed(page);
+
+		err = add_to_page_cache_lru(page, mapping, offset, radix_gfp_mask);
 		if (unlikely(err)) {
 			page_cache_release(page);
 			page = NULL;
@@ -1078,9 +1076,10 @@ repeat:
 				goto repeat;
 		}
 	}
+
 	return page;
 }
-EXPORT_SYMBOL(find_or_create_page);
+EXPORT_SYMBOL(pagecache_get_page);
 
 /**
  * find_get_entries - gang pagecache lookup
@@ -1370,39 +1369,6 @@ repeat:
 }
 EXPORT_SYMBOL(find_get_pages_tag);
 
-/**
- * grab_cache_page_nowait - returns locked page at given index in given cache
- * @mapping: target address_space
- * @index: the page index
- *
- * Same as grab_cache_page(), but do not wait if the page is unavailable.
- * This is intended for speculative data generators, where the data can
- * be regenerated if the page couldn't be grabbed.  This routine should
- * be safe to call while holding the lock for another page.
- *
- * Clear __GFP_FS when allocating the page to avoid recursion into the fs
- * and deadlock against the caller's locked page.
- */
-struct page *
-grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
-{
-	struct page *page = find_get_page(mapping, index);
-
-	if (page) {
-		if (trylock_page(page))
-			return page;
-		page_cache_release(page);
-		return NULL;
-	}
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
-	if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
-		page_cache_release(page);
-		page = NULL;
-	}
-	return page;
-}
-EXPORT_SYMBOL(grab_cache_page_nowait);
-
 /*
  * CD/DVDs are error prone. When a medium error occurs, the driver may fail
  * a _large_ part of the i/o request. Imagine the worst scenario:
@@ -2372,7 +2338,6 @@ int pagecache_write_end(struct file *file, struct address_space *mapping,
 {
 	const struct address_space_operations *aops = mapping->a_ops;
 
-	mark_page_accessed(page);
 	return aops->write_end(file, mapping, pos, len, copied, page, fsdata);
 }
 EXPORT_SYMBOL(pagecache_write_end);
@@ -2454,34 +2419,18 @@ EXPORT_SYMBOL(generic_file_direct_write);
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 					pgoff_t index, unsigned flags)
 {
-	int status;
-	gfp_t gfp_mask;
 	struct page *page;
-	gfp_t gfp_notmask = 0;
+	int fgp_flags = FGP_LOCK|FGP_ACCESSED|FGP_WRITE|FGP_CREAT;
 
-	gfp_mask = mapping_gfp_mask(mapping);
-	if (mapping_cap_account_dirty(mapping))
-		gfp_mask |= __GFP_WRITE;
 	if (flags & AOP_FLAG_NOFS)
-		gfp_notmask = __GFP_FS;
-repeat:
-	page = find_lock_page(mapping, index);
+		fgp_flags |= FGP_NOFS;
+
+	page = pagecache_get_page(mapping, index, fgp_flags,
+			mapping_gfp_mask(mapping),
+			GFP_KERNEL);
 	if (page)
-		goto found;
+		wait_for_stable_page(page);
 
-	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
-	if (!page)
-		return NULL;
-	status = add_to_page_cache_lru(page, mapping, index,
-						GFP_KERNEL & ~gfp_notmask);
-	if (unlikely(status)) {
-		page_cache_release(page);
-		if (status == -EEXIST)
-			goto repeat;
-		return NULL;
-	}
-found:
-	wait_for_stable_page(page);
 	return page;
 }
 EXPORT_SYMBOL(grab_cache_page_write_begin);
@@ -2530,7 +2479,7 @@ again:
 
 		status = a_ops->write_begin(file, mapping, pos, bytes, flags,
 						&page, &fsdata);
-		if (unlikely(status))
+		if (unlikely(status < 0))
 			break;
 
 		if (mapping_writably_mapped(mapping))
@@ -2539,7 +2488,6 @@ again:
 		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
 		flush_dcache_page(page);
 
-		mark_page_accessed(page);
 		status = a_ops->write_end(file, mapping, pos, bytes, copied,
 						page, fsdata);
 		if (unlikely(status < 0))
diff --git a/mm/shmem.c b/mm/shmem.c
index f47fb38..700a4ad 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1372,9 +1372,13 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,
 			struct page **pagep, void **fsdata)
 {
+	int ret;
 	struct inode *inode = mapping->host;
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
-	return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
+	ret = shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
+	if (*pagep)
+		init_page_accessed(*pagep);
+	return ret;
 }
 
 static int
diff --git a/mm/swap.c b/mm/swap.c
index 7a5bdd7..77baa36 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -583,6 +583,17 @@ void mark_page_accessed(struct page *page)
 EXPORT_SYMBOL(mark_page_accessed);
 
 /*
+ * Used to mark_page_accessed(page) that is not visible yet and when it is
+ * still safe to use non-atomic ops
+ */
+void init_page_accessed(struct page *page)
+{
+	if (!PageReferenced(page))
+		__SetPageReferenced(page);
+}
+EXPORT_SYMBOL(init_page_accessed);
+
+/*
  * Queue the page for addition to the LRU via pagevec. The decision on whether
  * to add the page to the [in]active [file|anon] list is deferred until the
  * pagevec is drained. This gives a chance for the caller of __lru_cache_add()
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 16/17] mm: Non-atomically mark page accessed during page cache allocation where possible
@ 2014-05-01  8:44   ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

The most obvious example of the problem being tackled is that
aops->write_begin may allocate a new page and make it visible only to
have mark_page_accessed called almost immediately after. Once the page is
visible the atomic operations are necessary which is noticable overhead when
writing to an in-memory filesystem like tmpfs but should also be noticable
with fast storage. The objective of the patch is to initialse the accessed
information with non-atomic operations before the page is visible.

The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial allocation
of a page cache page. This patch adds an init_page_accessed() helper which
behaves like the first call to mark_page_accessed() but may called before
the page is visible and can be done non-atomically.

The primary APIs of concern in this care are the following and are used
by most filesystems.

	find_get_page
	find_lock_page
	find_or_create_page
	grab_cache_page_nowait
	grab_cache_page_write_begin

All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its behavior
such as whether the page should be marked accessed or not. Then old API
is preserved but is basically a thin wrapper around this core function.

Each of the filesystems are then updated to avoid calling mark_page_accessed
when it is known that the VM interfaces have already done the job.
There is a slight snag in that the timing of the mark_page_accessed() has
now changed so in rare cases it's possible a page gets to the end of the LRU
as PageReferenced where as previously it might have been repromoted. This
is expected to be rare but it's worth the filesystem people thinking about
it in case they see a problem with the timing change. It is also the case
that some filesystems may be marking pages accessed that previously did not
but it makes sense that filesystems have consistent behaviour in this regard.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of
the file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or
NUMA artifacts. Throughput and wall times are presented for sync IO, only
wall times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.

async dd
                                    3.15.0-rc3            3.15.0-rc3
                                       vanilla           accessed-v2
ext3    Max      elapsed     13.6800 (  0.00%)     12.9200 (  5.56%)
tmpfs   Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
btrfs   Max      elapsed     12.8700 (  0.00%)     12.8200 (  0.39%)
ext4    Max      elapsed     13.5400 (  0.00%)     13.4000 (  1.03%)
xfs     Max      elapsed      1.9900 (  0.00%)      2.0000 ( -0.50%)

In most cases a respectable gain is shown. xfs is an exception but the
differences there are marginal. xfs was not directly using mark_page_accessed
but now uses it indirectly via grab_cache_page_write_begin which means
that XFS now behaves consistently in comparison to filesystems that use
block_write_begin() with respect to page activation.

        samples percentage
ext3     102029     1.1533  vmlinux-3.15.0-rc3-vanilla		mark_page_accessed
ext3      22676     0.2409  vmlinux-3.15.0-rc3-accessed-v2r  	mark_page_accessed
ext3       3560     0.0378  vmlinux-3.15.0-rc3-accessed-v2r     init_page_accessed
ext4      61524     0.8354  vmlinux-3.15.0-rc3-vanilla          mark_page_accessed
ext4       2177     0.0285  vmlinux-3.15.0-rc3-accessed-v2r     init_page_accessed
ext4       2025     0.0265  vmlinux-3.15.0-rc3-accessed-v2r     mark_page_accessed
xfs       56976     1.5582  vmlinux-3.15.0-rc3-vanilla          mark_page_accessed
xfs        2133     0.0601  vmlinux-3.15.0-rc3-accessed-v2r     init_page_accessed
xfs         100     0.0028  vmlinux-3.15.0-rc3-accessed-v2r     mark_page_accessed
btrfs     10678     0.1379  vmlinux-3.15.0-rc3-vanilla          mark_page_accessed
btrfs      2069     0.0271  vmlinux-3.15.0-rc3-accessed-v2r     init_page_accessed
btrfs       609     0.0080  vmlinux-3.15.0-rc3-accessed-v2r     mark_page_accessed
tmpfs     58424     3.1887  vmlinux-3.15.0-rc3-vanilla          mark_page_accessed
tmpfs      1249     0.0693  vmlinux-3.15.0-rc3-accessed-v2r     init_page_accessed
tmpfs        96     0.0053  vmlinux-3.15.0-rc3-accessed-v2r      mark_page_accessed

In all cases there is a massive reduction on the number of cycles spend
in mark_page_accessed doing atomic operations.

sync dd
                               3.15.0-rc3            3.15.0-rc3
                                  vanilla           accessed-v2
ext3    Max    tput    115.0000 (  0.00%)    116.0000 (  0.87%)
ext3    Max elapsed     15.1600 (  0.00%)     15.0900 (  0.46%)
ext4    Max    tput    121.0000 (  0.00%)    121.0000 (  0.00%)
ext4    Max elapsed     14.5700 (  0.00%)     14.4900 (  0.55%)
tmpfs   Max    tput   5017.6000 (  0.00%)   5324.8000 (  6.12%) (granularity is poor)
tmpfs   Max elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
btrfs   Max    tput    128.0000 (  0.00%)    128.0000 (  0.00%)
btrfs   Max elapsed     13.5700 (  0.00%)     13.5000 (  0.52%)
xfs     Max    tput    122.0000 (  0.00%)    122.0000 (  0.00%)
xfs     Max elapsed     14.3700 (  0.00%)     14.4500 ( -0.56%)

With the exception of tmpfs for obvious reasons the cost of
mark_page_accessed is mostly hidden and easily missed. Similar costs are
incurred in profiles thoughw with similar savings when the patch is applied.
xfs again loses out here at the cost of behaving similar to other filesystems
when aging pages.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/btrfs/extent_io.c       |  11 +--
 fs/btrfs/file.c            |   5 +-
 fs/buffer.c                |   7 +-
 fs/ext4/mballoc.c          |  14 ++--
 fs/f2fs/checkpoint.c       |   3 -
 fs/f2fs/node.c             |   2 -
 fs/fuse/file.c             |   2 -
 fs/gfs2/aops.c             |   1 -
 fs/gfs2/meta_io.c          |   4 +-
 fs/ntfs/attrib.c           |   1 -
 fs/ntfs/file.c             |   1 -
 include/linux/page-flags.h |   1 +
 include/linux/pagemap.h    | 107 ++++++++++++++++++++++--
 include/linux/swap.h       |   1 +
 mm/filemap.c               | 202 +++++++++++++++++----------------------------
 mm/shmem.c                 |   6 +-
 mm/swap.c                  |  11 +++
 17 files changed, 217 insertions(+), 162 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3955e47..158833c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4510,7 +4510,8 @@ static void check_buffer_tree_ref(struct extent_buffer *eb)
 	spin_unlock(&eb->refs_lock);
 }
 
-static void mark_extent_buffer_accessed(struct extent_buffer *eb)
+static void mark_extent_buffer_accessed(struct extent_buffer *eb,
+		struct page *accessed)
 {
 	unsigned long num_pages, i;
 
@@ -4519,7 +4520,8 @@ static void mark_extent_buffer_accessed(struct extent_buffer *eb)
 	num_pages = num_extent_pages(eb->start, eb->len);
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = extent_buffer_page(eb, i);
-		mark_page_accessed(p);
+		if (p != accessed)
+			mark_page_accessed(p);
 	}
 }
 
@@ -4533,7 +4535,7 @@ struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
 			       start >> PAGE_CACHE_SHIFT);
 	if (eb && atomic_inc_not_zero(&eb->refs)) {
 		rcu_read_unlock();
-		mark_extent_buffer_accessed(eb);
+		mark_extent_buffer_accessed(eb, NULL);
 		return eb;
 	}
 	rcu_read_unlock();
@@ -4581,7 +4583,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 				spin_unlock(&mapping->private_lock);
 				unlock_page(p);
 				page_cache_release(p);
-				mark_extent_buffer_accessed(exists);
+				mark_extent_buffer_accessed(exists, p);
 				goto free_eb;
 			}
 
@@ -4596,7 +4598,6 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		attach_extent_buffer_page(eb, p);
 		spin_unlock(&mapping->private_lock);
 		WARN_ON(PageDirty(p));
-		mark_page_accessed(p);
 		eb->pages[i] = p;
 		if (!PageUptodate(p))
 			uptodate = 0;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index ae6af07..74272a3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -470,11 +470,12 @@ static void btrfs_drop_pages(struct page **pages, size_t num_pages)
 	for (i = 0; i < num_pages; i++) {
 		/* page checked is some magic around finding pages that
 		 * have been modified without going through btrfs_set_page_dirty
-		 * clear it here
+		 * clear it here. There should be no need to mark the pages
+		 * accessed as prepare_pages should have marked them accessed
+		 * in prepare_pages via find_or_create_page()
 		 */
 		ClearPageChecked(pages[i]);
 		unlock_page(pages[i]);
-		mark_page_accessed(pages[i]);
 		page_cache_release(pages[i]);
 	}
 }
diff --git a/fs/buffer.c b/fs/buffer.c
index 9ddb9fc..83627b1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -227,7 +227,7 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
 	int all_mapped = 1;
 
 	index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits);
-	page = find_get_page(bd_mapping, index);
+	page = find_get_page_flags(bd_mapping, index, FGP_ACCESSED);
 	if (!page)
 		goto out;
 
@@ -1366,12 +1366,13 @@ __find_get_block(struct block_device *bdev, sector_t block, unsigned size)
 	struct buffer_head *bh = lookup_bh_lru(bdev, block, size);
 
 	if (bh == NULL) {
+		/* __find_get_block_slow will mark the page accessed */
 		bh = __find_get_block_slow(bdev, block);
 		if (bh)
 			bh_lru_install(bh);
-	}
-	if (bh)
+	} else
 		touch_buffer(bh);
+
 	return bh;
 }
 EXPORT_SYMBOL(__find_get_block);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index c8238a2..afe8a13 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1044,6 +1044,8 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group)
 	 * allocating. If we are looking at the buddy cache we would
 	 * have taken a reference using ext4_mb_load_buddy and that
 	 * would have pinned buddy page to page cache.
+	 * The call to ext4_mb_get_buddy_page_lock will mark the
+	 * page accessed.
 	 */
 	ret = ext4_mb_get_buddy_page_lock(sb, group, &e4b);
 	if (ret || !EXT4_MB_GRP_NEED_INIT(this_grp)) {
@@ -1062,7 +1064,6 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group)
 		ret = -EIO;
 		goto err;
 	}
-	mark_page_accessed(page);
 
 	if (e4b.bd_buddy_page == NULL) {
 		/*
@@ -1082,7 +1083,6 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group)
 		ret = -EIO;
 		goto err;
 	}
-	mark_page_accessed(page);
 err:
 	ext4_mb_put_buddy_page_lock(&e4b);
 	return ret;
@@ -1141,7 +1141,7 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
 
 	/* we could use find_or_create_page(), but it locks page
 	 * what we'd like to avoid in fast path ... */
-	page = find_get_page(inode->i_mapping, pnum);
+	page = find_get_page_flags(inode->i_mapping, pnum, FGP_ACCESSED);
 	if (page == NULL || !PageUptodate(page)) {
 		if (page)
 			/*
@@ -1176,15 +1176,16 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
 		ret = -EIO;
 		goto err;
 	}
+
+	/* Pages marked accessed already */
 	e4b->bd_bitmap_page = page;
 	e4b->bd_bitmap = page_address(page) + (poff * sb->s_blocksize);
-	mark_page_accessed(page);
 
 	block++;
 	pnum = block / blocks_per_page;
 	poff = block % blocks_per_page;
 
-	page = find_get_page(inode->i_mapping, pnum);
+	page = find_get_page_flags(inode->i_mapping, pnum, FGP_ACCESSED);
 	if (page == NULL || !PageUptodate(page)) {
 		if (page)
 			page_cache_release(page);
@@ -1209,9 +1210,10 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
 		ret = -EIO;
 		goto err;
 	}
+
+	/* Pages marked accessed already */
 	e4b->bd_buddy_page = page;
 	e4b->bd_buddy = page_address(page) + (poff * sb->s_blocksize);
-	mark_page_accessed(page);
 
 	BUG_ON(e4b->bd_bitmap_page == NULL);
 	BUG_ON(e4b->bd_buddy_page == NULL);
diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 4aa521a..c405b8f 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -69,7 +69,6 @@ repeat:
 		goto repeat;
 	}
 out:
-	mark_page_accessed(page);
 	return page;
 }
 
@@ -137,13 +136,11 @@ int ra_meta_pages(struct f2fs_sb_info *sbi, int start, int nrpages, int type)
 		if (!page)
 			continue;
 		if (PageUptodate(page)) {
-			mark_page_accessed(page);
 			f2fs_put_page(page, 1);
 			continue;
 		}
 
 		f2fs_submit_page_mbio(sbi, page, blk_addr, &fio);
-		mark_page_accessed(page);
 		f2fs_put_page(page, 0);
 	}
 out:
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index a161e95..57caa6e 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -967,7 +967,6 @@ repeat:
 		goto repeat;
 	}
 got_it:
-	mark_page_accessed(page);
 	return page;
 }
 
@@ -1022,7 +1021,6 @@ page_hit:
 		f2fs_put_page(page, 1);
 		return ERR_PTR(-EIO);
 	}
-	mark_page_accessed(page);
 	return page;
 }
 
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 13f8bde..85a3359 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1089,8 +1089,6 @@ static ssize_t fuse_fill_write_pages(struct fuse_req *req,
 		tmp = iov_iter_copy_from_user_atomic(page, ii, offset, bytes);
 		flush_dcache_page(page);
 
-		mark_page_accessed(page);
-
 		if (!tmp) {
 			unlock_page(page);
 			page_cache_release(page);
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index ce62dca..3c1ab7b 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -577,7 +577,6 @@ int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos,
 		p = kmap_atomic(page);
 		memcpy(buf + copied, p + offset, amt);
 		kunmap_atomic(p);
-		mark_page_accessed(page);
 		page_cache_release(page);
 		copied += amt;
 		index++;
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index 2cf09b6..b984a6e 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -136,7 +136,8 @@ struct buffer_head *gfs2_getbuf(struct gfs2_glock *gl, u64 blkno, int create)
 			yield();
 		}
 	} else {
-		page = find_lock_page(mapping, index);
+		page = find_get_page_flags(mapping, index,
+						FGP_LOCK|FGP_ACCESSED);
 		if (!page)
 			return NULL;
 	}
@@ -153,7 +154,6 @@ struct buffer_head *gfs2_getbuf(struct gfs2_glock *gl, u64 blkno, int create)
 		map_bh(bh, sdp->sd_vfs, blkno);
 
 	unlock_page(page);
-	mark_page_accessed(page);
 	page_cache_release(page);
 
 	return bh;
diff --git a/fs/ntfs/attrib.c b/fs/ntfs/attrib.c
index a27e3fe..250ed5b 100644
--- a/fs/ntfs/attrib.c
+++ b/fs/ntfs/attrib.c
@@ -1748,7 +1748,6 @@ int ntfs_attr_make_non_resident(ntfs_inode *ni, const u32 data_size)
 	if (page) {
 		set_page_dirty(page);
 		unlock_page(page);
-		mark_page_accessed(page);
 		page_cache_release(page);
 	}
 	ntfs_debug("Done.");
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index db9bd8a..86ddab9 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2060,7 +2060,6 @@ static ssize_t ntfs_file_buffered_write(struct kiocb *iocb,
 		}
 		do {
 			unlock_page(pages[--do_pages]);
-			mark_page_accessed(pages[do_pages]);
 			page_cache_release(pages[do_pages]);
 		} while (do_pages);
 		if (unlikely(status))
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4d4b39a..2093eb7 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -198,6 +198,7 @@ struct page;	/* forward declaration */
 TESTPAGEFLAG(Locked, locked)
 PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)
 PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
+	__SETPAGEFLAG(Referenced, referenced)
 PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
 PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
 PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 9175f52..e5ffaa0 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -259,12 +259,109 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
 pgoff_t page_cache_prev_hole(struct address_space *mapping,
 			     pgoff_t index, unsigned long max_scan);
 
+#define FGP_ACCESSED		0x00000001
+#define FGP_LOCK		0x00000002
+#define FGP_CREAT		0x00000004
+#define FGP_WRITE		0x00000008
+#define FGP_NOFS		0x00000010
+#define FGP_NOWAIT		0x00000020
+
+struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
+		int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask);
+
+/**
+ * find_get_page - find and get a page reference
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned with an increased refcount.
+ *
+ * Otherwise, %NULL is returned.
+ */
+static inline struct page *find_get_page(struct address_space *mapping,
+					pgoff_t offset)
+{
+	return pagecache_get_page(mapping, offset, 0, 0, 0);
+}
+
+static inline struct page *find_get_page_flags(struct address_space *mapping,
+					pgoff_t offset, int fgp_flags)
+{
+	return pagecache_get_page(mapping, offset, fgp_flags, 0, 0);
+}
+
+/**
+ * find_lock_page - locate, pin and lock a pagecache page
+ * pagecache_get_page - find and get a page reference
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned locked and with an increased
+ * refcount.
+ *
+ * Otherwise, %NULL is returned.
+ *
+ * find_lock_page() may sleep.
+ */
+static inline struct page *find_lock_page(struct address_space *mapping,
+					pgoff_t offset)
+{
+	return pagecache_get_page(mapping, offset, FGP_LOCK, 0, 0);
+}
+
+/**
+ * find_or_create_page - locate or add a pagecache page
+ * @mapping: the page's address_space
+ * @index: the page's index into the mapping
+ * @gfp_mask: page allocation mode
+ *
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned locked and with an increased
+ * refcount.
+ *
+ * If the page is not present, a new page is allocated using @gfp_mask
+ * and added to the page cache and the VM's LRU list.  The page is
+ * returned locked and with an increased refcount.
+ *
+ * On memory exhaustion, %NULL is returned.
+ *
+ * find_or_create_page() may sleep, even if @gfp_flags specifies an
+ * atomic allocation!
+ */
+static inline struct page *find_or_create_page(struct address_space *mapping,
+					pgoff_t offset, gfp_t gfp_mask)
+{
+	return pagecache_get_page(mapping, offset,
+					FGP_LOCK|FGP_ACCESSED|FGP_CREAT,
+					gfp_mask, gfp_mask & GFP_RECLAIM_MASK);
+}
+
+/**
+ * grab_cache_page_nowait - returns locked page at given index in given cache
+ * @mapping: target address_space
+ * @index: the page index
+ *
+ * Same as grab_cache_page(), but do not wait if the page is unavailable.
+ * This is intended for speculative data generators, where the data can
+ * be regenerated if the page couldn't be grabbed.  This routine should
+ * be safe to call while holding the lock for another page.
+ *
+ * Clear __GFP_FS when allocating the page to avoid recursion into the fs
+ * and deadlock against the caller's locked page.
+ */
+static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
+				pgoff_t index)
+{
+	return pagecache_get_page(mapping, index,
+			FGP_LOCK|FGP_CREAT|FGP_NOFS|FGP_NOWAIT,
+			mapping_gfp_mask(mapping),
+			GFP_NOFS);
+}
+
 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset);
-struct page *find_get_page(struct address_space *mapping, pgoff_t offset);
 struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset);
-struct page *find_lock_page(struct address_space *mapping, pgoff_t offset);
-struct page *find_or_create_page(struct address_space *mapping, pgoff_t index,
-				 gfp_t gfp_mask);
 unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
 			  unsigned int nr_entries, struct page **entries,
 			  pgoff_t *indices);
@@ -287,8 +384,6 @@ static inline struct page *grab_cache_page(struct address_space *mapping,
 	return find_or_create_page(mapping, index, mapping_gfp_mask(mapping));
 }
 
-extern struct page * grab_cache_page_nowait(struct address_space *mapping,
-				pgoff_t index);
 extern struct page * read_cache_page(struct address_space *mapping,
 				pgoff_t index, filler_t *filler, void *data);
 extern struct page * read_cache_page_gfp(struct address_space *mapping,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 395dcab..b570ad5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -314,6 +314,7 @@ extern void lru_add_page_tail(struct page *page, struct page *page_tail,
 			 struct lruvec *lruvec, struct list_head *head);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
+extern void init_page_accessed(struct page *page);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
diff --git a/mm/filemap.c b/mm/filemap.c
index 5020b28..c60ed0f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -955,26 +955,6 @@ out:
 EXPORT_SYMBOL(find_get_entry);
 
 /**
- * find_get_page - find and get a page reference
- * @mapping: the address_space to search
- * @offset: the page index
- *
- * Looks up the page cache slot at @mapping & @offset.  If there is a
- * page cache page, it is returned with an increased refcount.
- *
- * Otherwise, %NULL is returned.
- */
-struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
-{
-	struct page *page = find_get_entry(mapping, offset);
-
-	if (radix_tree_exceptional_entry(page))
-		page = NULL;
-	return page;
-}
-EXPORT_SYMBOL(find_get_page);
-
-/**
  * find_lock_entry - locate, pin and lock a page cache entry
  * @mapping: the address_space to search
  * @offset: the page cache index
@@ -1011,66 +991,84 @@ repeat:
 EXPORT_SYMBOL(find_lock_entry);
 
 /**
- * find_lock_page - locate, pin and lock a pagecache page
+ * pagecache_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
+ * @fgp_flags: PCG flags
+ * @gfp_mask: gfp mask to use if a page is to be allocated
  *
- * Looks up the page cache slot at @mapping & @offset.  If there is a
- * page cache page, it is returned locked and with an increased
- * refcount.
- *
- * Otherwise, %NULL is returned.
- *
- * find_lock_page() may sleep.
- */
-struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
-{
-	struct page *page = find_lock_entry(mapping, offset);
-
-	if (radix_tree_exceptional_entry(page))
-		page = NULL;
-	return page;
-}
-EXPORT_SYMBOL(find_lock_page);
-
-/**
- * find_or_create_page - locate or add a pagecache page
- * @mapping: the page's address_space
- * @index: the page's index into the mapping
- * @gfp_mask: page allocation mode
+ * Looks up the page cache slot at @mapping & @offset.
  *
- * Looks up the page cache slot at @mapping & @offset.  If there is a
- * page cache page, it is returned locked and with an increased
- * refcount.
+ * PCG flags modify how the page is returned
  *
- * If the page is not present, a new page is allocated using @gfp_mask
- * and added to the page cache and the VM's LRU list.  The page is
- * returned locked and with an increased refcount.
+ * FGP_ACCESSED: the page will be marked accessed
+ * FGP_LOCK: Page is return locked
+ * FGP_CREAT: If page is not present then a new page is allocated using
+ *		@gfp_mask and added to the page cache and the VM's LRU
+ *		list. The page is returned locked and with an increased
+ *		refcount. Otherwise, %NULL is returned.
  *
- * On memory exhaustion, %NULL is returned.
+ * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
+ * if the GFP flags specified for FGP_CREAT are atomic.
  *
- * find_or_create_page() may sleep, even if @gfp_flags specifies an
- * atomic allocation!
+ * If there is a page cache page, it is returned with an increased refcount.
  */
-struct page *find_or_create_page(struct address_space *mapping,
-		pgoff_t index, gfp_t gfp_mask)
+struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
+	int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask)
 {
 	struct page *page;
-	int err;
+
 repeat:
-	page = find_lock_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(gfp_mask);
+	page = find_get_entry(mapping, offset);
+	if (radix_tree_exceptional_entry(page))
+		page = NULL;
+	if (!page)
+		goto no_page;
+
+	if (fgp_flags & FGP_LOCK) {
+		if (fgp_flags & FGP_NOWAIT) {
+			if (!trylock_page(page)) {
+				page_cache_release(page);
+				return NULL;
+			}
+		} else {
+			lock_page(page);
+		}
+
+		/* Has the page been truncated? */
+		if (unlikely(page->mapping != mapping)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto repeat;
+		}
+		VM_BUG_ON_PAGE(page->index != offset, page);
+	}
+
+	if (page && (fgp_flags & FGP_ACCESSED))
+		mark_page_accessed(page);
+
+no_page:
+	if (!page && (fgp_flags & FGP_CREAT)) {
+		int err;
+		if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping))
+			cache_gfp_mask |= __GFP_WRITE;
+		if (fgp_flags & FGP_NOFS) {
+			cache_gfp_mask &= ~__GFP_FS;
+			radix_gfp_mask &= ~__GFP_FS;
+		}
+
+		page = __page_cache_alloc(cache_gfp_mask);
 		if (!page)
 			return NULL;
-		/*
-		 * We want a regular kernel memory (not highmem or DMA etc)
-		 * allocation for the radix tree nodes, but we need to honour
-		 * the context-specific requirements the caller has asked for.
-		 * GFP_RECLAIM_MASK collects those requirements.
-		 */
-		err = add_to_page_cache_lru(page, mapping, index,
-			(gfp_mask & GFP_RECLAIM_MASK));
+
+		if (WARN_ON_ONCE(!(fgp_flags & FGP_LOCK)))
+			fgp_flags |= FGP_LOCK;
+
+		/* Init accessed so avoit atomic mark_page_accessed later */
+		if (fgp_flags & FGP_ACCESSED)
+			init_page_accessed(page);
+
+		err = add_to_page_cache_lru(page, mapping, offset, radix_gfp_mask);
 		if (unlikely(err)) {
 			page_cache_release(page);
 			page = NULL;
@@ -1078,9 +1076,10 @@ repeat:
 				goto repeat;
 		}
 	}
+
 	return page;
 }
-EXPORT_SYMBOL(find_or_create_page);
+EXPORT_SYMBOL(pagecache_get_page);
 
 /**
  * find_get_entries - gang pagecache lookup
@@ -1370,39 +1369,6 @@ repeat:
 }
 EXPORT_SYMBOL(find_get_pages_tag);
 
-/**
- * grab_cache_page_nowait - returns locked page at given index in given cache
- * @mapping: target address_space
- * @index: the page index
- *
- * Same as grab_cache_page(), but do not wait if the page is unavailable.
- * This is intended for speculative data generators, where the data can
- * be regenerated if the page couldn't be grabbed.  This routine should
- * be safe to call while holding the lock for another page.
- *
- * Clear __GFP_FS when allocating the page to avoid recursion into the fs
- * and deadlock against the caller's locked page.
- */
-struct page *
-grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
-{
-	struct page *page = find_get_page(mapping, index);
-
-	if (page) {
-		if (trylock_page(page))
-			return page;
-		page_cache_release(page);
-		return NULL;
-	}
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
-	if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
-		page_cache_release(page);
-		page = NULL;
-	}
-	return page;
-}
-EXPORT_SYMBOL(grab_cache_page_nowait);
-
 /*
  * CD/DVDs are error prone. When a medium error occurs, the driver may fail
  * a _large_ part of the i/o request. Imagine the worst scenario:
@@ -2372,7 +2338,6 @@ int pagecache_write_end(struct file *file, struct address_space *mapping,
 {
 	const struct address_space_operations *aops = mapping->a_ops;
 
-	mark_page_accessed(page);
 	return aops->write_end(file, mapping, pos, len, copied, page, fsdata);
 }
 EXPORT_SYMBOL(pagecache_write_end);
@@ -2454,34 +2419,18 @@ EXPORT_SYMBOL(generic_file_direct_write);
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 					pgoff_t index, unsigned flags)
 {
-	int status;
-	gfp_t gfp_mask;
 	struct page *page;
-	gfp_t gfp_notmask = 0;
+	int fgp_flags = FGP_LOCK|FGP_ACCESSED|FGP_WRITE|FGP_CREAT;
 
-	gfp_mask = mapping_gfp_mask(mapping);
-	if (mapping_cap_account_dirty(mapping))
-		gfp_mask |= __GFP_WRITE;
 	if (flags & AOP_FLAG_NOFS)
-		gfp_notmask = __GFP_FS;
-repeat:
-	page = find_lock_page(mapping, index);
+		fgp_flags |= FGP_NOFS;
+
+	page = pagecache_get_page(mapping, index, fgp_flags,
+			mapping_gfp_mask(mapping),
+			GFP_KERNEL);
 	if (page)
-		goto found;
+		wait_for_stable_page(page);
 
-	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
-	if (!page)
-		return NULL;
-	status = add_to_page_cache_lru(page, mapping, index,
-						GFP_KERNEL & ~gfp_notmask);
-	if (unlikely(status)) {
-		page_cache_release(page);
-		if (status == -EEXIST)
-			goto repeat;
-		return NULL;
-	}
-found:
-	wait_for_stable_page(page);
 	return page;
 }
 EXPORT_SYMBOL(grab_cache_page_write_begin);
@@ -2530,7 +2479,7 @@ again:
 
 		status = a_ops->write_begin(file, mapping, pos, bytes, flags,
 						&page, &fsdata);
-		if (unlikely(status))
+		if (unlikely(status < 0))
 			break;
 
 		if (mapping_writably_mapped(mapping))
@@ -2539,7 +2488,6 @@ again:
 		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
 		flush_dcache_page(page);
 
-		mark_page_accessed(page);
 		status = a_ops->write_end(file, mapping, pos, bytes, copied,
 						page, fsdata);
 		if (unlikely(status < 0))
diff --git a/mm/shmem.c b/mm/shmem.c
index f47fb38..700a4ad 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1372,9 +1372,13 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,
 			struct page **pagep, void **fsdata)
 {
+	int ret;
 	struct inode *inode = mapping->host;
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
-	return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
+	ret = shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
+	if (*pagep)
+		init_page_accessed(*pagep);
+	return ret;
 }
 
 static int
diff --git a/mm/swap.c b/mm/swap.c
index 7a5bdd7..77baa36 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -583,6 +583,17 @@ void mark_page_accessed(struct page *page)
 EXPORT_SYMBOL(mark_page_accessed);
 
 /*
+ * Used to mark_page_accessed(page) that is not visible yet and when it is
+ * still safe to use non-atomic ops
+ */
+void init_page_accessed(struct page *page)
+{
+	if (!PageReferenced(page))
+		__SetPageReferenced(page);
+}
+EXPORT_SYMBOL(init_page_accessed);
+
+/*
  * Queue the page for addition to the LRU via pagevec. The decision on whether
  * to add the page to the [in]active [file|anon] list is deferred until the
  * pagevec is drained. This gives a chance for the caller of __lru_cache_add()
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 17/17] mm: filemap: Avoid unnecessary barries and waitqueue lookup in unlock_page fastpath
  2014-05-01  8:44 ` Mel Gorman
@ 2014-05-01  8:44   ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

From: Nick Piggin <npiggin@suse.de>

This patch introduces a new page flag for 64-bit capable machines,
PG_waiters, to signal there are processes waiting on PG_lock and uses it to
avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.

This adds a few branches to the fast path but avoids bouncing a dirty
cache line between CPUs. 32-bit machines always take the slow path but the
primary motivation for this patch is large machines so I do not think that
is a concern.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of
the file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or
NUMA artifacts. Throughput and wall times are presented for sync IO, only
wall times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running. The
kernels being compared are "accessed-v2" which is the patch series up
to this patch where as lockpage-v2 includes this patch.

async dd
                              3.15.0-rc3            3.15.0-rc3
                             accessed-v2           lockpage-v2
ext3   Max elapsed     12.9200 (  0.00%)     12.6700 (  1.93%)
ext4   Max elapsed     13.4000 (  0.00%)     13.3800 (  0.15%)
tmpfs  Max elapsed      0.4900 (  0.00%)      0.4800 (  2.04%)
btrfs  Max elapsed     12.8200 (  0.00%)     12.8200 (  0.00%)
Max      elapsed        2.0000 (  0.00%)      2.1100 ( -5.50%)

By and large it was an improvement. xfs was a shame but FWIW in this
case the stddev for xfs is quite high and this result is well within
the noise. For clarity here are the full set of xfs results

                            3.15.0-rc3            3.15.0-rc3
                        accessed-v2          lockpage-v2
Min      elapsed      0.5700 (  0.00%)      0.5400 (  5.26%)
Mean     elapsed      1.1157 (  0.00%)      1.1460 ( -2.72%)
TrimMean elapsed      1.1386 (  0.00%)      1.1757 ( -3.26%)
Stddev   elapsed      0.3653 (  0.00%)      0.4202 (-15.02%)
Max      elapsed      2.0000 (  0.00%)      2.1100 ( -5.50%)

The mean figures are well within the stddev. Still not a very happy
result but not enough to get upset about either.

     samples percentage
ext3   62312     0.6586  vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
ext3   46530     0.4918  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
ext3    6447     0.0915  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
ext3   48619     0.6900  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page
ext4  112692    1.5815   vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
ext4   80699     1.1325  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
ext4   11461     0.1587  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
ext4  127146     1.7605  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page
tmpfs  17599     1.4799  vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
tmpfs  13838     1.1636  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
tmpfs      4    2.3e-04  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
tmpfs  29061     1.6878  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page
btrfs  6762      0.0883  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
btrfs  72237     0.9428  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page
btrfs  63208     0.8140  vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
btrfs  56963     0.7335  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
xfs    32350     0.9279  vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
xfs    25115     0.7204  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
xfs     1981     0.0718  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
xfs    31085     1.1269  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page

In all cases note the large reduction in the time spent in page_waitqueue
as the page flag allows the cost to be avoided. In most cases, the time
spend in unlock_page is also decreased.

sync dd

ext3   Max    tput    116.0000 (  0.00%)    115.0000 ( -0.86%)
ext3   Max elapsed     15.3100 (  0.00%)     15.2600 (  0.33%)
ext4   Max    tput    120.0000 (  0.00%)    123.0000 (  2.50%)
ext4   Max elapsed     14.7300 (  0.00%)     14.7300 (  0.00%)
tmpfs  Max    tput   5324.8000 (  0.00%)   5324.8000 (  0.00%)
tmpfs  Max elapsed      0.4900 (  0.00%)      0.4800 (  2.04%)
btrfs  Max    tput    128.0000 (  0.00%)    128.0000 (  0.00%)
btrfs  Max elapsed     13.5000 (  0.00%)     13.6200 ( -0.89%)
xfs    Max    tput    122.0000 (  0.00%)    123.0000 (  0.82%)
xfs    Max elapsed     14.4500 (  0.00%)     14.6500 ( -1.38%)

Not a universal win in terms of headline performance but system CPU usage
is reduced and the profiles do show that less time is spent looking up
waitqueues so how much this benefits will depend on the machine used and
the exact workload.

The Intel vm-scalability tests tell a similar story. The ones measured here
are broadly based on dd of files 10 times the size of memory with one dd per
CPU in the system

                                               3.15.0-rc3            3.15.0-rc3
                                              accessed-v2           lockpage-v2
ext3   lru-file-readonce    elapsed      3.7100 (  0.00%)      3.5500 (  4.31%)
ext3   lru-file-readtwice   elapsed      6.0000 (  0.00%)      6.1300 ( -2.17%)
ext3   lru-file-ddspread    elapsed      8.7800 (  0.00%)      8.4700 (  3.53%)
ext4   lru-file-readonce    elapsed      3.6700 (  0.00%)      3.5700 (  2.72%)
ext4   lru-file-readtwice   elapsed      6.5200 (  0.00%)      6.1600 (  5.52%)
ext4   lru-file-ddspread    elapsed      9.2800 (  0.00%)      9.2400 (  0.43%)
btrfs  lru-file-readonce    elapsed      5.0200 (  0.00%)      4.9700 (  1.00%)
btrfs  lru-file-readtwice   elapsed      7.6100 (  0.00%)      7.5500 (  0.79%)
btrfs  lru-file-ddspread    elapsed     10.7900 (  0.00%)     10.7400 (  0.46%)
xfs    lru-file-readonce    elapsed      3.6700 (  0.00%)      3.6400 (  0.82%)
xfs    lru-file-readtwice   elapsed      5.9300 (  0.00%)      6.0100 ( -1.35%)
xfs    lru-file-ddspread    elapsed      9.0500 (  0.00%)      8.9700 (  0.88%)

In most cases the time to read the file is lowered. Unlike the previous test
there is no impact on mark_page_accessed as the pages are already resident for
this test and there is no opportunity to mark the pages accessed without using
atomic operations. Instead the profiles show a reduction in the time spent in
page_waitqueue. This is the profile data for lru-file-readonce only.

     samples percentage
ext3   13447     0.5236  vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
ext3    9763     0.3801  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
ext3       3    1.2e-04  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
ext3   13840     0.5550  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page
ext4   15976     0.5951  vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
ext4    9920     0.3695  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
ext4       5    2.0e-04  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
ext4   13963     0.5542  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page
btrfs  13447     0.3720  vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
btrfs   8349     0.2310  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
btrfs      7    2.0e-04  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
btrfs  12583     0.3549  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page
xfs    13028     0.5234  vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
xfs     9698     0.3896  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
xfs        5    2.0e-04  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
xfs    15269     0.6215  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page

The time spent in unlock_page is similar as the lock bit still has to
be cleared but the time spent in page_waitqueue is virtually eliminated.

This is similarly reflected in the time taken to mmap a range of pages.
These are the results for xfs only but the other filesystems tell a
similar story.

                       3.15.0-rc3            3.15.0-rc3
                      accessed-v2           lockpage-v2
Procs 107M     533.0000 (  0.00%)    539.0000 ( -1.13%)
Procs 214M    1093.0000 (  0.00%)   1045.0000 (  4.39%)
Procs 322M    1572.0000 (  0.00%)   1334.0000 ( 15.14%)
Procs 429M    2012.0000 (  0.00%)   1998.0000 (  0.70%)
Procs 536M    2517.0000 (  0.00%)   3052.0000 (-21.26%)
Procs 644M    2916.0000 (  0.00%)   2856.0000 (  2.06%)
Procs 751M    3472.0000 (  0.00%)   3284.0000 (  5.41%)
Procs 859M    3810.0000 (  0.00%)   3854.0000 ( -1.15%)
Procs 966M    4411.0000 (  0.00%)   4296.0000 (  2.61%)
Procs 1073M   4923.0000 (  0.00%)   4791.0000 (  2.68%)
Procs 1181M   5237.0000 (  0.00%)   5169.0000 (  1.30%)
Procs 1288M   5587.0000 (  0.00%)   5494.0000 (  1.66%)
Procs 1395M   5771.0000 (  0.00%)   5790.0000 ( -0.33%)
Procs 1503M   6149.0000 (  0.00%)   5950.0000 (  3.24%)
Procs 1610M   6479.0000 (  0.00%)   6239.0000 (  3.70%)
Procs 1717M   6860.0000 (  0.00%)   6702.0000 (  2.30%)
Procs 1825M   7292.0000 (  0.00%)   7108.0000 (  2.52%)
Procs 1932M   7673.0000 (  0.00%)   7541.0000 (  1.72%)
Procs 2040M   8146.0000 (  0.00%)   7919.0000 (  2.79%)
Procs 2147M   8692.0000 (  0.00%)   8355.0000 (  3.88%)

         samples percentage
xfs        90552     1.4634  vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
xfs        71598     1.1571  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
xfs         2773     0.0447  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
xfs       110399     1.7796  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page

[jack@suse.cz: Fix add_page_wait_queue]
[mhocko@suse.cz: Use sleep_on_page_killable in __wait_on_page_locked_killable]
[steiner@sgi.com: Do not update struct page unnecessarily]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/page-flags.h | 16 +++++++++
 include/linux/pagemap.h    |  6 ++--
 kernel/sched/wait.c        |  3 +-
 mm/filemap.c               | 90 ++++++++++++++++++++++++++++++++++++++++++----
 mm/page_alloc.c            |  1 +
 5 files changed, 106 insertions(+), 10 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 2093eb7..4c52d42 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -87,6 +87,7 @@ enum pageflags {
 	PG_private_2,		/* If pagecache, has fs aux data */
 	PG_writeback,		/* Page is under writeback */
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
+	PG_waiters,		/* Page has PG_locked waiters. */
 	PG_head,		/* A head page */
 	PG_tail,		/* A tail page */
 #else
@@ -213,6 +214,20 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
 
 __PAGEFLAG(SlobFree, slob_free)
 
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+PAGEFLAG(Waiters, waiters)
+#define __PG_WAITERS		(1 << PG_waiters)
+#else
+/* Always fallback to slow path on 32-bit */
+static inline bool PageWaiters(struct page *page)
+{
+	return true;
+}
+static inline void ClearPageWaiters(struct page *page) {}
+static inline void SetPageWaiters(struct page *page) {}
+#define __PG_WAITERS		0
+#endif /* CONFIG_PAGEFLAGS_EXTENDED */
+
 /*
  * Private page markings that may be used by the filesystem that owns the page
  * for its own purposes.
@@ -506,6 +521,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 	 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
 	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
+	 __PG_WAITERS | \
 	 __PG_COMPOUND_LOCK)
 
 /*
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e5ffaa0..2ec2d78 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -485,13 +485,15 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
  * Never use this directly!
  */
 extern void wait_on_page_bit(struct page *page, int bit_nr);
+extern void __wait_on_page_locked(struct page *page);
 
 extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
+extern int __wait_on_page_locked_killable(struct page *page);
 
 static inline int wait_on_page_locked_killable(struct page *page)
 {
 	if (PageLocked(page))
-		return wait_on_page_bit_killable(page, PG_locked);
+		return __wait_on_page_locked_killable(page);
 	return 0;
 }
 
@@ -505,7 +507,7 @@ static inline int wait_on_page_locked_killable(struct page *page)
 static inline void wait_on_page_locked(struct page *page)
 {
 	if (PageLocked(page))
-		wait_on_page_bit(page, PG_locked);
+		__wait_on_page_locked(page);
 }
 
 /* 
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 7d50f79..fb83fe0 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -304,8 +304,7 @@ int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
 		= container_of(wait, struct wait_bit_queue, wait);
 
 	if (wait_bit->key.flags != key->flags ||
-			wait_bit->key.bit_nr != key->bit_nr ||
-			test_bit(key->bit_nr, key->flags))
+			wait_bit->key.bit_nr != key->bit_nr)
 		return 0;
 	else
 		return autoremove_wake_function(wait, mode, sync, key);
diff --git a/mm/filemap.c b/mm/filemap.c
index c60ed0f..93e4385 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -720,10 +720,23 @@ void add_page_wait_queue(struct page *page, wait_queue_t *waiter)
 
 	spin_lock_irqsave(&q->lock, flags);
 	__add_wait_queue(q, waiter);
+	if (!PageWaiters(page))
+		SetPageWaiters(page);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL_GPL(add_page_wait_queue);
 
+/*
+ * If PageWaiters was found to be set at unlock time, __wake_page_waiters
+ * should be called to actually perform the wakeup of waiters.
+ */
+static inline void __wake_page_waiters(struct page *page)
+{
+	ClearPageWaiters(page);
+	smp_mb__after_clear_bit();
+	wake_up_page(page, PG_locked);
+}
+
 /**
  * unlock_page - unlock a locked page
  * @page: the page
@@ -740,8 +753,8 @@ void unlock_page(struct page *page)
 {
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	clear_bit_unlock(PG_locked, &page->flags);
-	smp_mb__after_clear_bit();
-	wake_up_page(page, PG_locked);
+	if (unlikely(PageWaiters(page)))
+		__wake_page_waiters(page);
 }
 EXPORT_SYMBOL(unlock_page);
 
@@ -768,22 +781,87 @@ EXPORT_SYMBOL(end_page_writeback);
  */
 void __lock_page(struct page *page)
 {
+	wait_queue_head_t *wq = page_waitqueue(page);
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
-	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
-							TASK_UNINTERRUPTIBLE);
+	do {
+		prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+		if (!PageWaiters(page))
+			SetPageWaiters(page);
+		if (likely(PageLocked(page)))
+			sleep_on_page(page);
+	} while (!trylock_page(page));
+	finish_wait(wq, &wait.wait);
 }
 EXPORT_SYMBOL(__lock_page);
 
 int __lock_page_killable(struct page *page)
 {
+	wait_queue_head_t *wq = page_waitqueue(page);
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+	int err = 0;
+
+	do {
+		prepare_to_wait(wq, &wait.wait, TASK_KILLABLE);
+		if (!PageWaiters(page))
+			SetPageWaiters(page);
+		if (likely(PageLocked(page))) {
+			err = sleep_on_page_killable(page);
+			if (err)
+				break;
+		}
+	} while (!trylock_page(page));
+	finish_wait(wq, &wait.wait);
 
-	return __wait_on_bit_lock(page_waitqueue(page), &wait,
-					sleep_on_page_killable, TASK_KILLABLE);
+	return err;
 }
 EXPORT_SYMBOL_GPL(__lock_page_killable);
 
+int  __wait_on_page_locked_killable(struct page *page)
+{
+	int ret = 0;
+	wait_queue_head_t *wq = page_waitqueue(page);
+	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+
+	if (!test_bit(PG_locked, &page->flags))
+		return 0;
+	do {
+		prepare_to_wait(wq, &wait.wait, TASK_KILLABLE);
+		if (!PageWaiters(page))
+			SetPageWaiters(page);
+		if (likely(PageLocked(page)))
+			ret = sleep_on_page_killable(page);
+		finish_wait(wq, &wait.wait);
+	} while (PageLocked(page) && !ret);
+
+	/* Clean up a potentially dangling PG_waiters */
+	if (unlikely(PageWaiters(page)))
+		__wake_page_waiters(page);
+
+	return ret;
+}
+EXPORT_SYMBOL(__wait_on_page_locked_killable);
+
+void  __wait_on_page_locked(struct page *page)
+{
+	wait_queue_head_t *wq = page_waitqueue(page);
+	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+
+	do {
+		prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+		if (!PageWaiters(page))
+			SetPageWaiters(page);
+		if (likely(PageLocked(page)))
+			sleep_on_page(page);
+	} while (PageLocked(page));
+	finish_wait(wq, &wait.wait);
+
+	/* Clean up a potentially dangling PG_waiters */
+	if (unlikely(PageWaiters(page)))
+		__wake_page_waiters(page);
+}
+EXPORT_SYMBOL(__wait_on_page_locked);
+
 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 			 unsigned int flags)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 94c5d06..0e0e9f7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6533,6 +6533,7 @@ static const struct trace_print_flags pageflag_names[] = {
 	{1UL << PG_private_2,		"private_2"	},
 	{1UL << PG_writeback,		"writeback"	},
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
+	{1UL << PG_waiters,		"waiters"	},
 	{1UL << PG_head,		"head"		},
 	{1UL << PG_tail,		"tail"		},
 #else
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH 17/17] mm: filemap: Avoid unnecessary barries and waitqueue lookup in unlock_page fastpath
@ 2014-05-01  8:44   ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01  8:44 UTC (permalink / raw)
  To: Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Mel Gorman, Linux Kernel

From: Nick Piggin <npiggin@suse.de>

This patch introduces a new page flag for 64-bit capable machines,
PG_waiters, to signal there are processes waiting on PG_lock and uses it to
avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.

This adds a few branches to the fast path but avoids bouncing a dirty
cache line between CPUs. 32-bit machines always take the slow path but the
primary motivation for this patch is large machines so I do not think that
is a concern.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of
the file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or
NUMA artifacts. Throughput and wall times are presented for sync IO, only
wall times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running. The
kernels being compared are "accessed-v2" which is the patch series up
to this patch where as lockpage-v2 includes this patch.

async dd
                              3.15.0-rc3            3.15.0-rc3
                             accessed-v2           lockpage-v2
ext3   Max elapsed     12.9200 (  0.00%)     12.6700 (  1.93%)
ext4   Max elapsed     13.4000 (  0.00%)     13.3800 (  0.15%)
tmpfs  Max elapsed      0.4900 (  0.00%)      0.4800 (  2.04%)
btrfs  Max elapsed     12.8200 (  0.00%)     12.8200 (  0.00%)
Max      elapsed        2.0000 (  0.00%)      2.1100 ( -5.50%)

By and large it was an improvement. xfs was a shame but FWIW in this
case the stddev for xfs is quite high and this result is well within
the noise. For clarity here are the full set of xfs results

                            3.15.0-rc3            3.15.0-rc3
                        accessed-v2          lockpage-v2
Min      elapsed      0.5700 (  0.00%)      0.5400 (  5.26%)
Mean     elapsed      1.1157 (  0.00%)      1.1460 ( -2.72%)
TrimMean elapsed      1.1386 (  0.00%)      1.1757 ( -3.26%)
Stddev   elapsed      0.3653 (  0.00%)      0.4202 (-15.02%)
Max      elapsed      2.0000 (  0.00%)      2.1100 ( -5.50%)

The mean figures are well within the stddev. Still not a very happy
result but not enough to get upset about either.

     samples percentage
ext3   62312     0.6586  vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
ext3   46530     0.4918  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
ext3    6447     0.0915  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
ext3   48619     0.6900  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page
ext4  112692    1.5815   vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
ext4   80699     1.1325  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
ext4   11461     0.1587  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
ext4  127146     1.7605  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page
tmpfs  17599     1.4799  vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
tmpfs  13838     1.1636  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
tmpfs      4    2.3e-04  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
tmpfs  29061     1.6878  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page
btrfs  6762      0.0883  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
btrfs  72237     0.9428  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page
btrfs  63208     0.8140  vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
btrfs  56963     0.7335  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
xfs    32350     0.9279  vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
xfs    25115     0.7204  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
xfs     1981     0.0718  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
xfs    31085     1.1269  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page

In all cases note the large reduction in the time spent in page_waitqueue
as the page flag allows the cost to be avoided. In most cases, the time
spend in unlock_page is also decreased.

sync dd

ext3   Max    tput    116.0000 (  0.00%)    115.0000 ( -0.86%)
ext3   Max elapsed     15.3100 (  0.00%)     15.2600 (  0.33%)
ext4   Max    tput    120.0000 (  0.00%)    123.0000 (  2.50%)
ext4   Max elapsed     14.7300 (  0.00%)     14.7300 (  0.00%)
tmpfs  Max    tput   5324.8000 (  0.00%)   5324.8000 (  0.00%)
tmpfs  Max elapsed      0.4900 (  0.00%)      0.4800 (  2.04%)
btrfs  Max    tput    128.0000 (  0.00%)    128.0000 (  0.00%)
btrfs  Max elapsed     13.5000 (  0.00%)     13.6200 ( -0.89%)
xfs    Max    tput    122.0000 (  0.00%)    123.0000 (  0.82%)
xfs    Max elapsed     14.4500 (  0.00%)     14.6500 ( -1.38%)

Not a universal win in terms of headline performance but system CPU usage
is reduced and the profiles do show that less time is spent looking up
waitqueues so how much this benefits will depend on the machine used and
the exact workload.

The Intel vm-scalability tests tell a similar story. The ones measured here
are broadly based on dd of files 10 times the size of memory with one dd per
CPU in the system

                                               3.15.0-rc3            3.15.0-rc3
                                              accessed-v2           lockpage-v2
ext3   lru-file-readonce    elapsed      3.7100 (  0.00%)      3.5500 (  4.31%)
ext3   lru-file-readtwice   elapsed      6.0000 (  0.00%)      6.1300 ( -2.17%)
ext3   lru-file-ddspread    elapsed      8.7800 (  0.00%)      8.4700 (  3.53%)
ext4   lru-file-readonce    elapsed      3.6700 (  0.00%)      3.5700 (  2.72%)
ext4   lru-file-readtwice   elapsed      6.5200 (  0.00%)      6.1600 (  5.52%)
ext4   lru-file-ddspread    elapsed      9.2800 (  0.00%)      9.2400 (  0.43%)
btrfs  lru-file-readonce    elapsed      5.0200 (  0.00%)      4.9700 (  1.00%)
btrfs  lru-file-readtwice   elapsed      7.6100 (  0.00%)      7.5500 (  0.79%)
btrfs  lru-file-ddspread    elapsed     10.7900 (  0.00%)     10.7400 (  0.46%)
xfs    lru-file-readonce    elapsed      3.6700 (  0.00%)      3.6400 (  0.82%)
xfs    lru-file-readtwice   elapsed      5.9300 (  0.00%)      6.0100 ( -1.35%)
xfs    lru-file-ddspread    elapsed      9.0500 (  0.00%)      8.9700 (  0.88%)

In most cases the time to read the file is lowered. Unlike the previous test
there is no impact on mark_page_accessed as the pages are already resident for
this test and there is no opportunity to mark the pages accessed without using
atomic operations. Instead the profiles show a reduction in the time spent in
page_waitqueue. This is the profile data for lru-file-readonce only.

     samples percentage
ext3   13447     0.5236  vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
ext3    9763     0.3801  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
ext3       3    1.2e-04  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
ext3   13840     0.5550  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page
ext4   15976     0.5951  vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
ext4    9920     0.3695  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
ext4       5    2.0e-04  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
ext4   13963     0.5542  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page
btrfs  13447     0.3720  vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
btrfs   8349     0.2310  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
btrfs      7    2.0e-04  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
btrfs  12583     0.3549  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page
xfs    13028     0.5234  vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
xfs     9698     0.3896  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
xfs        5    2.0e-04  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
xfs    15269     0.6215  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page

The time spent in unlock_page is similar as the lock bit still has to
be cleared but the time spent in page_waitqueue is virtually eliminated.

This is similarly reflected in the time taken to mmap a range of pages.
These are the results for xfs only but the other filesystems tell a
similar story.

                       3.15.0-rc3            3.15.0-rc3
                      accessed-v2           lockpage-v2
Procs 107M     533.0000 (  0.00%)    539.0000 ( -1.13%)
Procs 214M    1093.0000 (  0.00%)   1045.0000 (  4.39%)
Procs 322M    1572.0000 (  0.00%)   1334.0000 ( 15.14%)
Procs 429M    2012.0000 (  0.00%)   1998.0000 (  0.70%)
Procs 536M    2517.0000 (  0.00%)   3052.0000 (-21.26%)
Procs 644M    2916.0000 (  0.00%)   2856.0000 (  2.06%)
Procs 751M    3472.0000 (  0.00%)   3284.0000 (  5.41%)
Procs 859M    3810.0000 (  0.00%)   3854.0000 ( -1.15%)
Procs 966M    4411.0000 (  0.00%)   4296.0000 (  2.61%)
Procs 1073M   4923.0000 (  0.00%)   4791.0000 (  2.68%)
Procs 1181M   5237.0000 (  0.00%)   5169.0000 (  1.30%)
Procs 1288M   5587.0000 (  0.00%)   5494.0000 (  1.66%)
Procs 1395M   5771.0000 (  0.00%)   5790.0000 ( -0.33%)
Procs 1503M   6149.0000 (  0.00%)   5950.0000 (  3.24%)
Procs 1610M   6479.0000 (  0.00%)   6239.0000 (  3.70%)
Procs 1717M   6860.0000 (  0.00%)   6702.0000 (  2.30%)
Procs 1825M   7292.0000 (  0.00%)   7108.0000 (  2.52%)
Procs 1932M   7673.0000 (  0.00%)   7541.0000 (  1.72%)
Procs 2040M   8146.0000 (  0.00%)   7919.0000 (  2.79%)
Procs 2147M   8692.0000 (  0.00%)   8355.0000 (  3.88%)

         samples percentage
xfs        90552     1.4634  vmlinux-3.15.0-rc3-accessed-v2r33 page_waitqueue
xfs        71598     1.1571  vmlinux-3.15.0-rc3-accessed-v2r33 unlock_page
xfs         2773     0.0447  vmlinux-3.15.0-rc3-lockpage-v2r33 page_waitqueue
xfs       110399     1.7796  vmlinux-3.15.0-rc3-lockpage-v2r33 unlock_page

[jack@suse.cz: Fix add_page_wait_queue]
[mhocko@suse.cz: Use sleep_on_page_killable in __wait_on_page_locked_killable]
[steiner@sgi.com: Do not update struct page unnecessarily]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/page-flags.h | 16 +++++++++
 include/linux/pagemap.h    |  6 ++--
 kernel/sched/wait.c        |  3 +-
 mm/filemap.c               | 90 ++++++++++++++++++++++++++++++++++++++++++----
 mm/page_alloc.c            |  1 +
 5 files changed, 106 insertions(+), 10 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 2093eb7..4c52d42 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -87,6 +87,7 @@ enum pageflags {
 	PG_private_2,		/* If pagecache, has fs aux data */
 	PG_writeback,		/* Page is under writeback */
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
+	PG_waiters,		/* Page has PG_locked waiters. */
 	PG_head,		/* A head page */
 	PG_tail,		/* A tail page */
 #else
@@ -213,6 +214,20 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
 
 __PAGEFLAG(SlobFree, slob_free)
 
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+PAGEFLAG(Waiters, waiters)
+#define __PG_WAITERS		(1 << PG_waiters)
+#else
+/* Always fallback to slow path on 32-bit */
+static inline bool PageWaiters(struct page *page)
+{
+	return true;
+}
+static inline void ClearPageWaiters(struct page *page) {}
+static inline void SetPageWaiters(struct page *page) {}
+#define __PG_WAITERS		0
+#endif /* CONFIG_PAGEFLAGS_EXTENDED */
+
 /*
  * Private page markings that may be used by the filesystem that owns the page
  * for its own purposes.
@@ -506,6 +521,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 	 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
 	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
+	 __PG_WAITERS | \
 	 __PG_COMPOUND_LOCK)
 
 /*
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e5ffaa0..2ec2d78 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -485,13 +485,15 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
  * Never use this directly!
  */
 extern void wait_on_page_bit(struct page *page, int bit_nr);
+extern void __wait_on_page_locked(struct page *page);
 
 extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
+extern int __wait_on_page_locked_killable(struct page *page);
 
 static inline int wait_on_page_locked_killable(struct page *page)
 {
 	if (PageLocked(page))
-		return wait_on_page_bit_killable(page, PG_locked);
+		return __wait_on_page_locked_killable(page);
 	return 0;
 }
 
@@ -505,7 +507,7 @@ static inline int wait_on_page_locked_killable(struct page *page)
 static inline void wait_on_page_locked(struct page *page)
 {
 	if (PageLocked(page))
-		wait_on_page_bit(page, PG_locked);
+		__wait_on_page_locked(page);
 }
 
 /* 
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 7d50f79..fb83fe0 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -304,8 +304,7 @@ int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
 		= container_of(wait, struct wait_bit_queue, wait);
 
 	if (wait_bit->key.flags != key->flags ||
-			wait_bit->key.bit_nr != key->bit_nr ||
-			test_bit(key->bit_nr, key->flags))
+			wait_bit->key.bit_nr != key->bit_nr)
 		return 0;
 	else
 		return autoremove_wake_function(wait, mode, sync, key);
diff --git a/mm/filemap.c b/mm/filemap.c
index c60ed0f..93e4385 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -720,10 +720,23 @@ void add_page_wait_queue(struct page *page, wait_queue_t *waiter)
 
 	spin_lock_irqsave(&q->lock, flags);
 	__add_wait_queue(q, waiter);
+	if (!PageWaiters(page))
+		SetPageWaiters(page);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
 EXPORT_SYMBOL_GPL(add_page_wait_queue);
 
+/*
+ * If PageWaiters was found to be set at unlock time, __wake_page_waiters
+ * should be called to actually perform the wakeup of waiters.
+ */
+static inline void __wake_page_waiters(struct page *page)
+{
+	ClearPageWaiters(page);
+	smp_mb__after_clear_bit();
+	wake_up_page(page, PG_locked);
+}
+
 /**
  * unlock_page - unlock a locked page
  * @page: the page
@@ -740,8 +753,8 @@ void unlock_page(struct page *page)
 {
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	clear_bit_unlock(PG_locked, &page->flags);
-	smp_mb__after_clear_bit();
-	wake_up_page(page, PG_locked);
+	if (unlikely(PageWaiters(page)))
+		__wake_page_waiters(page);
 }
 EXPORT_SYMBOL(unlock_page);
 
@@ -768,22 +781,87 @@ EXPORT_SYMBOL(end_page_writeback);
  */
 void __lock_page(struct page *page)
 {
+	wait_queue_head_t *wq = page_waitqueue(page);
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
-	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
-							TASK_UNINTERRUPTIBLE);
+	do {
+		prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+		if (!PageWaiters(page))
+			SetPageWaiters(page);
+		if (likely(PageLocked(page)))
+			sleep_on_page(page);
+	} while (!trylock_page(page));
+	finish_wait(wq, &wait.wait);
 }
 EXPORT_SYMBOL(__lock_page);
 
 int __lock_page_killable(struct page *page)
 {
+	wait_queue_head_t *wq = page_waitqueue(page);
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+	int err = 0;
+
+	do {
+		prepare_to_wait(wq, &wait.wait, TASK_KILLABLE);
+		if (!PageWaiters(page))
+			SetPageWaiters(page);
+		if (likely(PageLocked(page))) {
+			err = sleep_on_page_killable(page);
+			if (err)
+				break;
+		}
+	} while (!trylock_page(page));
+	finish_wait(wq, &wait.wait);
 
-	return __wait_on_bit_lock(page_waitqueue(page), &wait,
-					sleep_on_page_killable, TASK_KILLABLE);
+	return err;
 }
 EXPORT_SYMBOL_GPL(__lock_page_killable);
 
+int  __wait_on_page_locked_killable(struct page *page)
+{
+	int ret = 0;
+	wait_queue_head_t *wq = page_waitqueue(page);
+	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+
+	if (!test_bit(PG_locked, &page->flags))
+		return 0;
+	do {
+		prepare_to_wait(wq, &wait.wait, TASK_KILLABLE);
+		if (!PageWaiters(page))
+			SetPageWaiters(page);
+		if (likely(PageLocked(page)))
+			ret = sleep_on_page_killable(page);
+		finish_wait(wq, &wait.wait);
+	} while (PageLocked(page) && !ret);
+
+	/* Clean up a potentially dangling PG_waiters */
+	if (unlikely(PageWaiters(page)))
+		__wake_page_waiters(page);
+
+	return ret;
+}
+EXPORT_SYMBOL(__wait_on_page_locked_killable);
+
+void  __wait_on_page_locked(struct page *page)
+{
+	wait_queue_head_t *wq = page_waitqueue(page);
+	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+
+	do {
+		prepare_to_wait(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
+		if (!PageWaiters(page))
+			SetPageWaiters(page);
+		if (likely(PageLocked(page)))
+			sleep_on_page(page);
+	} while (PageLocked(page));
+	finish_wait(wq, &wait.wait);
+
+	/* Clean up a potentially dangling PG_waiters */
+	if (unlikely(PageWaiters(page)))
+		__wake_page_waiters(page);
+}
+EXPORT_SYMBOL(__wait_on_page_locked);
+
 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 			 unsigned int flags)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 94c5d06..0e0e9f7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6533,6 +6533,7 @@ static const struct trace_print_flags pageflag_names[] = {
 	{1UL << PG_private_2,		"private_2"	},
 	{1UL << PG_writeback,		"writeback"	},
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
+	{1UL << PG_waiters,		"waiters"	},
 	{1UL << PG_head,		"head"		},
 	{1UL << PG_tail,		"tail"		},
 #else
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/17] mm: page_alloc: Do not update zlc unless the zlc is active
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-01 13:25     ` Johannes Weiner
  -1 siblings, 0 replies; 113+ messages in thread
From: Johannes Weiner @ 2014-05-01 13:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 09:44:32AM +0100, Mel Gorman wrote:
> The zlc is used on NUMA machines to quickly skip over zones that are full.
> However it is always updated, even for the first zone scanned when the
> zlc might not even be active. As it's a write to a bitmap that potentially
> bounces cache line it's deceptively expensive and most machines will not
> care. Only update the zlc if it was active.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/17] mm: page_alloc: Do not update zlc unless the zlc is active
@ 2014-05-01 13:25     ` Johannes Weiner
  0 siblings, 0 replies; 113+ messages in thread
From: Johannes Weiner @ 2014-05-01 13:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 09:44:32AM +0100, Mel Gorman wrote:
> The zlc is used on NUMA machines to quickly skip over zones that are full.
> However it is always updated, even for the first zone scanned when the
> zlc might not even be active. As it's a write to a bitmap that potentially
> bounces cache line it's deceptively expensive and most machines will not
> care. Only update the zlc if it was active.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 14/17] mm: Do not use atomic operations when releasing pages
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-01 13:29     ` Johannes Weiner
  -1 siblings, 0 replies; 113+ messages in thread
From: Johannes Weiner @ 2014-05-01 13:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 09:44:45AM +0100, Mel Gorman wrote:
> There should be no references to it any more and a parallel mark should
> not be reordered against us. Use non-locked varient to clear page active.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/swap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index f2228b7..7a5bdd7 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -854,7 +854,7 @@ void release_pages(struct page **pages, int nr, bool cold)
>  		}
>  
>  		/* Clear Active bit in case of parallel mark_page_accessed */
> -		ClearPageActive(page);
> +		__ClearPageActive(page);

Shouldn't this comment be removed also?

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 14/17] mm: Do not use atomic operations when releasing pages
@ 2014-05-01 13:29     ` Johannes Weiner
  0 siblings, 0 replies; 113+ messages in thread
From: Johannes Weiner @ 2014-05-01 13:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 09:44:45AM +0100, Mel Gorman wrote:
> There should be no references to it any more and a parallel mark should
> not be reordered against us. Use non-locked varient to clear page active.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/swap.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/swap.c b/mm/swap.c
> index f2228b7..7a5bdd7 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -854,7 +854,7 @@ void release_pages(struct page **pages, int nr, bool cold)
>  		}
>  
>  		/* Clear Active bit in case of parallel mark_page_accessed */
> -		ClearPageActive(page);
> +		__ClearPageActive(page);

Shouldn't this comment be removed also?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 15/17] mm: Do not use unnecessary atomic operations when adding pages to the LRU
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-01 13:33     ` Johannes Weiner
  -1 siblings, 0 replies; 113+ messages in thread
From: Johannes Weiner @ 2014-05-01 13:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 09:44:46AM +0100, Mel Gorman wrote:
> When adding pages to the LRU we clear the active bit unconditionally. As the
> page could be reachable from other paths we cannot use unlocked operations
> without risk of corruption such as a parallel mark_page_accessed. This
> patch test if is necessary to clear the atomic flag before using an atomic
> operation. In the unlikely even this races with mark_page_accesssed the

                             event

> consequences are simply that the page may be promoted to the active list
> that might have been left on the inactive list before the patch. This is
> a marginal consequence.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 15/17] mm: Do not use unnecessary atomic operations when adding pages to the LRU
@ 2014-05-01 13:33     ` Johannes Weiner
  0 siblings, 0 replies; 113+ messages in thread
From: Johannes Weiner @ 2014-05-01 13:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 09:44:46AM +0100, Mel Gorman wrote:
> When adding pages to the LRU we clear the active bit unconditionally. As the
> page could be reachable from other paths we cannot use unlocked operations
> without risk of corruption such as a parallel mark_page_accessed. This
> patch test if is necessary to clear the atomic flag before using an atomic
> operation. In the unlikely even this races with mark_page_accesssed the

                             event

> consequences are simply that the page may be promoted to the active list
> that might have been left on the inactive list before the patch. This is
> a marginal consequence.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 14/17] mm: Do not use atomic operations when releasing pages
  2014-05-01 13:29     ` Johannes Weiner
@ 2014-05-01 13:39       ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01 13:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linux-MM, Linux-FSDevel, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 09:29:22AM -0400, Johannes Weiner wrote:
> On Thu, May 01, 2014 at 09:44:45AM +0100, Mel Gorman wrote:
> > There should be no references to it any more and a parallel mark should
> > not be reordered against us. Use non-locked varient to clear page active.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/swap.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/swap.c b/mm/swap.c
> > index f2228b7..7a5bdd7 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -854,7 +854,7 @@ void release_pages(struct page **pages, int nr, bool cold)
> >  		}
> >  
> >  		/* Clear Active bit in case of parallel mark_page_accessed */
> > -		ClearPageActive(page);
> > +		__ClearPageActive(page);
> 
> Shouldn't this comment be removed also?

Why? We're still clearing the active bit.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 14/17] mm: Do not use atomic operations when releasing pages
@ 2014-05-01 13:39       ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01 13:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linux-MM, Linux-FSDevel, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 09:29:22AM -0400, Johannes Weiner wrote:
> On Thu, May 01, 2014 at 09:44:45AM +0100, Mel Gorman wrote:
> > There should be no references to it any more and a parallel mark should
> > not be reordered against us. Use non-locked varient to clear page active.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/swap.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/mm/swap.c b/mm/swap.c
> > index f2228b7..7a5bdd7 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -854,7 +854,7 @@ void release_pages(struct page **pages, int nr, bool cold)
> >  		}
> >  
> >  		/* Clear Active bit in case of parallel mark_page_accessed */
> > -		ClearPageActive(page);
> > +		__ClearPageActive(page);
> 
> Shouldn't this comment be removed also?

Why? We're still clearing the active bit.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 15/17] mm: Do not use unnecessary atomic operations when adding pages to the LRU
  2014-05-01 13:33     ` Johannes Weiner
@ 2014-05-01 13:40       ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01 13:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linux-MM, Linux-FSDevel, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 09:33:40AM -0400, Johannes Weiner wrote:
> On Thu, May 01, 2014 at 09:44:46AM +0100, Mel Gorman wrote:
> > When adding pages to the LRU we clear the active bit unconditionally. As the
> > page could be reachable from other paths we cannot use unlocked operations
> > without risk of corruption such as a parallel mark_page_accessed. This
> > patch test if is necessary to clear the atomic flag before using an atomic
> > operation. In the unlikely even this races with mark_page_accesssed the
> 
>                              event
> 

Will be corrected in v3. Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 15/17] mm: Do not use unnecessary atomic operations when adding pages to the LRU
@ 2014-05-01 13:40       ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01 13:40 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linux-MM, Linux-FSDevel, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 09:33:40AM -0400, Johannes Weiner wrote:
> On Thu, May 01, 2014 at 09:44:46AM +0100, Mel Gorman wrote:
> > When adding pages to the LRU we clear the active bit unconditionally. As the
> > page could be reachable from other paths we cannot use unlocked operations
> > without risk of corruption such as a parallel mark_page_accessed. This
> > patch test if is necessary to clear the atomic flag before using an atomic
> > operation. In the unlikely even this races with mark_page_accesssed the
> 
>                              event
> 

Will be corrected in v3. Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 14/17] mm: Do not use atomic operations when releasing pages
  2014-05-01 13:39       ` Mel Gorman
@ 2014-05-01 13:47         ` Johannes Weiner
  -1 siblings, 0 replies; 113+ messages in thread
From: Johannes Weiner @ 2014-05-01 13:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 02:39:38PM +0100, Mel Gorman wrote:
> On Thu, May 01, 2014 at 09:29:22AM -0400, Johannes Weiner wrote:
> > On Thu, May 01, 2014 at 09:44:45AM +0100, Mel Gorman wrote:
> > > There should be no references to it any more and a parallel mark should
> > > not be reordered against us. Use non-locked varient to clear page active.
> > > 
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > ---
> > >  mm/swap.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/swap.c b/mm/swap.c
> > > index f2228b7..7a5bdd7 100644
> > > --- a/mm/swap.c
> > > +++ b/mm/swap.c
> > > @@ -854,7 +854,7 @@ void release_pages(struct page **pages, int nr, bool cold)
> > >  		}
> > >  
> > >  		/* Clear Active bit in case of parallel mark_page_accessed */
> > > -		ClearPageActive(page);
> > > +		__ClearPageActive(page);
> > 
> > Shouldn't this comment be removed also?
> 
> Why? We're still clearing the active bit.

Ah, I was just confused by the "parallel mark_page_accessed" part.  It
means parallel to release_pages(), but before the put_page_testzero(),
not parallel to the active bit clearing.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 14/17] mm: Do not use atomic operations when releasing pages
@ 2014-05-01 13:47         ` Johannes Weiner
  0 siblings, 0 replies; 113+ messages in thread
From: Johannes Weiner @ 2014-05-01 13:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 02:39:38PM +0100, Mel Gorman wrote:
> On Thu, May 01, 2014 at 09:29:22AM -0400, Johannes Weiner wrote:
> > On Thu, May 01, 2014 at 09:44:45AM +0100, Mel Gorman wrote:
> > > There should be no references to it any more and a parallel mark should
> > > not be reordered against us. Use non-locked varient to clear page active.
> > > 
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > ---
> > >  mm/swap.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/swap.c b/mm/swap.c
> > > index f2228b7..7a5bdd7 100644
> > > --- a/mm/swap.c
> > > +++ b/mm/swap.c
> > > @@ -854,7 +854,7 @@ void release_pages(struct page **pages, int nr, bool cold)
> > >  		}
> > >  
> > >  		/* Clear Active bit in case of parallel mark_page_accessed */
> > > -		ClearPageActive(page);
> > > +		__ClearPageActive(page);
> > 
> > Shouldn't this comment be removed also?
> 
> Why? We're still clearing the active bit.

Ah, I was just confused by the "parallel mark_page_accessed" part.  It
means parallel to release_pages(), but before the put_page_testzero(),
not parallel to the active bit clearing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 11/17] mm: page_alloc: Use unsigned int for order in more places
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-01 14:35     ` Dave Hansen
  -1 siblings, 0 replies; 113+ messages in thread
From: Dave Hansen @ 2014-05-01 14:35 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 01:44 AM, Mel Gorman wrote:
> X86 prefers the use of unsigned types for iterators and there is a
> tendency to mix whether a signed or unsigned type if used for page
> order. This converts a number of sites in mm/page_alloc.c to use
> unsigned int for order where possible.

Does this actually generate any different code?  I'd actually expect
something like 'order' to be one of the easiest things for the compiler
to figure out an absolute range on.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 11/17] mm: page_alloc: Use unsigned int for order in more places
@ 2014-05-01 14:35     ` Dave Hansen
  0 siblings, 0 replies; 113+ messages in thread
From: Dave Hansen @ 2014-05-01 14:35 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 01:44 AM, Mel Gorman wrote:
> X86 prefers the use of unsigned types for iterators and there is a
> tendency to mix whether a signed or unsigned type if used for page
> order. This converts a number of sites in mm/page_alloc.c to use
> unsigned int for order where possible.

Does this actually generate any different code?  I'd actually expect
something like 'order' to be one of the easiest things for the compiler
to figure out an absolute range on.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 11/17] mm: page_alloc: Use unsigned int for order in more places
  2014-05-01 14:35     ` Dave Hansen
@ 2014-05-01 15:11       ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01 15:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 07:35:47AM -0700, Dave Hansen wrote:
> On 05/01/2014 01:44 AM, Mel Gorman wrote:
> > X86 prefers the use of unsigned types for iterators and there is a
> > tendency to mix whether a signed or unsigned type if used for page
> > order. This converts a number of sites in mm/page_alloc.c to use
> > unsigned int for order where possible.
> 
> Does this actually generate any different code?  I'd actually expect
> something like 'order' to be one of the easiest things for the compiler
> to figure out an absolute range on.
> 

Yeah, it generates different code. Considering that this patch affects an
API that can be called external to the code block how would the compiler
know what the range of order would be in all cases?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 11/17] mm: page_alloc: Use unsigned int for order in more places
@ 2014-05-01 15:11       ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-01 15:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 07:35:47AM -0700, Dave Hansen wrote:
> On 05/01/2014 01:44 AM, Mel Gorman wrote:
> > X86 prefers the use of unsigned types for iterators and there is a
> > tendency to mix whether a signed or unsigned type if used for page
> > order. This converts a number of sites in mm/page_alloc.c to use
> > unsigned int for order where possible.
> 
> Does this actually generate any different code?  I'd actually expect
> something like 'order' to be one of the easiest things for the compiler
> to figure out an absolute range on.
> 

Yeah, it generates different code. Considering that this patch affects an
API that can be called external to the code block how would the compiler
know what the range of order would be in all cases?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 11/17] mm: page_alloc: Use unsigned int for order in more places
  2014-05-01 15:11       ` Mel Gorman
@ 2014-05-01 15:38         ` Dave Hansen
  -1 siblings, 0 replies; 113+ messages in thread
From: Dave Hansen @ 2014-05-01 15:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On 05/01/2014 08:11 AM, Mel Gorman wrote:
> On Thu, May 01, 2014 at 07:35:47AM -0700, Dave Hansen wrote:
>> On 05/01/2014 01:44 AM, Mel Gorman wrote:
>>> X86 prefers the use of unsigned types for iterators and there is a
>>> tendency to mix whether a signed or unsigned type if used for page
>>> order. This converts a number of sites in mm/page_alloc.c to use
>>> unsigned int for order where possible.
>>
>> Does this actually generate any different code?  I'd actually expect
>> something like 'order' to be one of the easiest things for the compiler
>> to figure out an absolute range on.
> 
> Yeah, it generates different code. Considering that this patch affects an
> API that can be called external to the code block how would the compiler
> know what the range of order would be in all cases?

The compiler comprehends that if you do a check against a constant like
MAX_ORDER early in the function that the the variable now has a limited
range, like the check we do first-thing in __alloc_pages_slowpath().

The more I think about it, at least in page_alloc.c, I don't see any
checks for order<0, which means the compiler isn't free to do this
anyway.  Your move over to an unsigned type gives that check for free
essentially.

So this makes a lot of sense in any case.  I was just curious if it
affected the code.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 11/17] mm: page_alloc: Use unsigned int for order in more places
@ 2014-05-01 15:38         ` Dave Hansen
  0 siblings, 0 replies; 113+ messages in thread
From: Dave Hansen @ 2014-05-01 15:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On 05/01/2014 08:11 AM, Mel Gorman wrote:
> On Thu, May 01, 2014 at 07:35:47AM -0700, Dave Hansen wrote:
>> On 05/01/2014 01:44 AM, Mel Gorman wrote:
>>> X86 prefers the use of unsigned types for iterators and there is a
>>> tendency to mix whether a signed or unsigned type if used for page
>>> order. This converts a number of sites in mm/page_alloc.c to use
>>> unsigned int for order where possible.
>>
>> Does this actually generate any different code?  I'd actually expect
>> something like 'order' to be one of the easiest things for the compiler
>> to figure out an absolute range on.
> 
> Yeah, it generates different code. Considering that this patch affects an
> API that can be called external to the code block how would the compiler
> know what the range of order would be in all cases?

The compiler comprehends that if you do a check against a constant like
MAX_ORDER early in the function that the the variable now has a limited
range, like the check we do first-thing in __alloc_pages_slowpath().

The more I think about it, at least in page_alloc.c, I don't see any
checks for order<0, which means the compiler isn't free to do this
anyway.  Your move over to an unsigned type gives that check for free
essentially.

So this makes a lot of sense in any case.  I was just curious if it
affected the code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-02 22:34     ` Sasha Levin
  -1 siblings, 0 replies; 113+ messages in thread
From: Sasha Levin @ 2014-05-02 22:34 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

Hi Mel,

Vlastimil Babka suggested I should try this patch to work around a different
issue I'm seeing, and noticed that it doesn't build because:

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> +void set_pageblock_flags_mask(struct page *page,
> +				unsigned long flags,
> +				unsigned long end_bitidx,
> +				unsigned long nr_flag_bits,
> +				unsigned long mask);

set_pageblock_flags_mask() is declared.


> +static inline void set_pageblock_flags_group(struct page *page,
> +					unsigned long flags,
> +					int start_bitidx, int end_bitidx)
> +{
> +	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
> +	unsigned long mask = (1 << nr_flag_bits) - 1;
> +
> +	set_pageblock_flags_mask(page, flags, end_bitidx, nr_flag_bits, mask);
> +}

And used here, but never actually defined.


Thanks,
Sasha

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
@ 2014-05-02 22:34     ` Sasha Levin
  0 siblings, 0 replies; 113+ messages in thread
From: Sasha Levin @ 2014-05-02 22:34 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

Hi Mel,

Vlastimil Babka suggested I should try this patch to work around a different
issue I'm seeing, and noticed that it doesn't build because:

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> +void set_pageblock_flags_mask(struct page *page,
> +				unsigned long flags,
> +				unsigned long end_bitidx,
> +				unsigned long nr_flag_bits,
> +				unsigned long mask);

set_pageblock_flags_mask() is declared.


> +static inline void set_pageblock_flags_group(struct page *page,
> +					unsigned long flags,
> +					int start_bitidx, int end_bitidx)
> +{
> +	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
> +	unsigned long mask = (1 << nr_flag_bits) - 1;
> +
> +	set_pageblock_flags_mask(page, flags, end_bitidx, nr_flag_bits, mask);
> +}

And used here, but never actually defined.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
  2014-05-02 22:34     ` Sasha Levin
@ 2014-05-04 13:14       ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-04 13:14 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On Fri, May 02, 2014 at 06:34:52PM -0400, Sasha Levin wrote:
> Hi Mel,
> 
> Vlastimil Babka suggested I should try this patch to work around a different
> issue I'm seeing, and noticed that it doesn't build because:
> 

Rebasing SNAFU. Can you try this instead?

---8<---
mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps

The test_bit operations in get/set pageblock flags are expensive. This patch
reads the bitmap on a word basis and use shifts and masks to isolate the bits
of interest. Similarly masks are used to set a local copy of the bitmap and then
use cmpxchg to update the bitmap if there have been no other changes made in
parallel.

In a test running dd onto tmpfs the overhead of the pageblock-related
functions went from 1.27% in profiles to 0.5%.

Signed-off-by: Mel Gorman <mgorman@suse.de>

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fac5509..c84703d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -75,9 +75,14 @@ enum {
 
 extern int page_group_by_mobility_disabled;
 
+#define NR_MIGRATETYPE_BITS (PB_migrate_end - PB_migrate + 1)
+#define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1)
+
 static inline int get_pageblock_migratetype(struct page *page)
 {
-	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
+	BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2);
+	return get_pageblock_flags_mask(page, PB_migrate_end,
+					NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK);
 }
 
 struct free_area {
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 2ee8cd2..bc37036 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -30,9 +30,12 @@ enum pageblock_bits {
 	PB_migrate,
 	PB_migrate_end = PB_migrate + 3 - 1,
 			/* 3 bits required for migrate types */
-#ifdef CONFIG_COMPACTION
 	PB_migrate_skip,/* If set the block is skipped by compaction */
-#endif /* CONFIG_COMPACTION */
+
+	/*
+	 * Assume the bits will always align on a word. If this assumption
+	 * changes then get/set pageblock needs updating.
+	 */
 	NR_PAGEBLOCK_BITS
 };
 
@@ -62,11 +65,35 @@ extern int pageblock_order;
 /* Forward declaration */
 struct page;
 
+unsigned long get_pageblock_flags_mask(struct page *page,
+				unsigned long end_bitidx,
+				unsigned long nr_flag_bits,
+				unsigned long mask);
+void set_pageblock_flags_mask(struct page *page,
+				unsigned long flags,
+				unsigned long end_bitidx,
+				unsigned long nr_flag_bits,
+				unsigned long mask);
+
 /* Declarations for getting and setting flags. See mm/page_alloc.c */
-unsigned long get_pageblock_flags_group(struct page *page,
-					int start_bitidx, int end_bitidx);
-void set_pageblock_flags_group(struct page *page, unsigned long flags,
-					int start_bitidx, int end_bitidx);
+static inline unsigned long get_pageblock_flags_group(struct page *page,
+					int start_bitidx, int end_bitidx)
+{
+	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
+	unsigned long mask = (1 << nr_flag_bits) - 1;
+
+	return get_pageblock_flags_mask(page, end_bitidx, nr_flag_bits, mask);
+}
+
+static inline void set_pageblock_flags_group(struct page *page,
+					unsigned long flags,
+					int start_bitidx, int end_bitidx)
+{
+	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
+	unsigned long mask = (1 << nr_flag_bits) - 1;
+
+	set_pageblock_flags_mask(page, flags, end_bitidx, nr_flag_bits, mask);
+}
 
 #ifdef CONFIG_COMPACTION
 #define get_pageblock_skip(page) \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dc123ff..f393b0e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6032,53 +6032,64 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
  * @end_bitidx: The last bit of interest
  * returns pageblock_bits flags
  */
-unsigned long get_pageblock_flags_group(struct page *page,
-					int start_bitidx, int end_bitidx)
+unsigned long get_pageblock_flags_mask(struct page *page,
+					unsigned long end_bitidx,
+					unsigned long nr_flag_bits,
+					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx;
-	unsigned long flags = 0;
-	unsigned long value = 1;
+	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long word;
 
 	zone = page_zone(page);
 	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
+	word_bitidx = bitidx / BITS_PER_LONG;
+	bitidx &= (BITS_PER_LONG-1);
 
-	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
-		if (test_bit(bitidx + start_bitidx, bitmap))
-			flags |= value;
-
-	return flags;
+	word = bitmap[word_bitidx];
+	bitidx += end_bitidx;
+	return (word >> (BITS_PER_LONG - bitidx - 1)) & mask;
 }
 
 /**
- * set_pageblock_flags_group - Set the requested group of flags for a pageblock_nr_pages block of pages
+ * set_pageblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages
  * @page: The page within the block of interest
  * @start_bitidx: The first bit of interest
  * @end_bitidx: The last bit of interest
  * @flags: The flags to set
  */
-void set_pageblock_flags_group(struct page *page, unsigned long flags,
-					int start_bitidx, int end_bitidx)
+void set_pageblock_flags_mask(struct page *page, unsigned long flags,
+					unsigned long end_bitidx,
+					unsigned long nr_flag_bits,
+					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx;
-	unsigned long value = 1;
+	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long old_word, new_word;
+
+	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
 
 	zone = page_zone(page);
 	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
+	word_bitidx = bitidx / BITS_PER_LONG;
+	bitidx &= (BITS_PER_LONG-1);
+
 	VM_BUG_ON_PAGE(!zone_spans_pfn(zone, pfn), page);
 
-	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
-		if (flags & value)
-			__set_bit(bitidx + start_bitidx, bitmap);
-		else
-			__clear_bit(bitidx + start_bitidx, bitmap);
+	bitidx += end_bitidx;
+	mask <<= (BITS_PER_LONG - bitidx - 1);
+	flags <<= (BITS_PER_LONG - bitidx - 1);
+
+	do {
+		old_word = ACCESS_ONCE(bitmap[word_bitidx]);
+		new_word = (old_word & ~mask) | flags;
+	} while (cmpxchg(&bitmap[word_bitidx], old_word, new_word) != old_word);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
@ 2014-05-04 13:14       ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-04 13:14 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On Fri, May 02, 2014 at 06:34:52PM -0400, Sasha Levin wrote:
> Hi Mel,
> 
> Vlastimil Babka suggested I should try this patch to work around a different
> issue I'm seeing, and noticed that it doesn't build because:
> 

Rebasing SNAFU. Can you try this instead?

---8<---
mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps

The test_bit operations in get/set pageblock flags are expensive. This patch
reads the bitmap on a word basis and use shifts and masks to isolate the bits
of interest. Similarly masks are used to set a local copy of the bitmap and then
use cmpxchg to update the bitmap if there have been no other changes made in
parallel.

In a test running dd onto tmpfs the overhead of the pageblock-related
functions went from 1.27% in profiles to 0.5%.

Signed-off-by: Mel Gorman <mgorman@suse.de>

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fac5509..c84703d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -75,9 +75,14 @@ enum {
 
 extern int page_group_by_mobility_disabled;
 
+#define NR_MIGRATETYPE_BITS (PB_migrate_end - PB_migrate + 1)
+#define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1)
+
 static inline int get_pageblock_migratetype(struct page *page)
 {
-	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
+	BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2);
+	return get_pageblock_flags_mask(page, PB_migrate_end,
+					NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK);
 }
 
 struct free_area {
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 2ee8cd2..bc37036 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -30,9 +30,12 @@ enum pageblock_bits {
 	PB_migrate,
 	PB_migrate_end = PB_migrate + 3 - 1,
 			/* 3 bits required for migrate types */
-#ifdef CONFIG_COMPACTION
 	PB_migrate_skip,/* If set the block is skipped by compaction */
-#endif /* CONFIG_COMPACTION */
+
+	/*
+	 * Assume the bits will always align on a word. If this assumption
+	 * changes then get/set pageblock needs updating.
+	 */
 	NR_PAGEBLOCK_BITS
 };
 
@@ -62,11 +65,35 @@ extern int pageblock_order;
 /* Forward declaration */
 struct page;
 
+unsigned long get_pageblock_flags_mask(struct page *page,
+				unsigned long end_bitidx,
+				unsigned long nr_flag_bits,
+				unsigned long mask);
+void set_pageblock_flags_mask(struct page *page,
+				unsigned long flags,
+				unsigned long end_bitidx,
+				unsigned long nr_flag_bits,
+				unsigned long mask);
+
 /* Declarations for getting and setting flags. See mm/page_alloc.c */
-unsigned long get_pageblock_flags_group(struct page *page,
-					int start_bitidx, int end_bitidx);
-void set_pageblock_flags_group(struct page *page, unsigned long flags,
-					int start_bitidx, int end_bitidx);
+static inline unsigned long get_pageblock_flags_group(struct page *page,
+					int start_bitidx, int end_bitidx)
+{
+	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
+	unsigned long mask = (1 << nr_flag_bits) - 1;
+
+	return get_pageblock_flags_mask(page, end_bitidx, nr_flag_bits, mask);
+}
+
+static inline void set_pageblock_flags_group(struct page *page,
+					unsigned long flags,
+					int start_bitidx, int end_bitidx)
+{
+	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
+	unsigned long mask = (1 << nr_flag_bits) - 1;
+
+	set_pageblock_flags_mask(page, flags, end_bitidx, nr_flag_bits, mask);
+}
 
 #ifdef CONFIG_COMPACTION
 #define get_pageblock_skip(page) \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dc123ff..f393b0e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6032,53 +6032,64 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
  * @end_bitidx: The last bit of interest
  * returns pageblock_bits flags
  */
-unsigned long get_pageblock_flags_group(struct page *page,
-					int start_bitidx, int end_bitidx)
+unsigned long get_pageblock_flags_mask(struct page *page,
+					unsigned long end_bitidx,
+					unsigned long nr_flag_bits,
+					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx;
-	unsigned long flags = 0;
-	unsigned long value = 1;
+	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long word;
 
 	zone = page_zone(page);
 	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
+	word_bitidx = bitidx / BITS_PER_LONG;
+	bitidx &= (BITS_PER_LONG-1);
 
-	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
-		if (test_bit(bitidx + start_bitidx, bitmap))
-			flags |= value;
-
-	return flags;
+	word = bitmap[word_bitidx];
+	bitidx += end_bitidx;
+	return (word >> (BITS_PER_LONG - bitidx - 1)) & mask;
 }
 
 /**
- * set_pageblock_flags_group - Set the requested group of flags for a pageblock_nr_pages block of pages
+ * set_pageblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages
  * @page: The page within the block of interest
  * @start_bitidx: The first bit of interest
  * @end_bitidx: The last bit of interest
  * @flags: The flags to set
  */
-void set_pageblock_flags_group(struct page *page, unsigned long flags,
-					int start_bitidx, int end_bitidx)
+void set_pageblock_flags_mask(struct page *page, unsigned long flags,
+					unsigned long end_bitidx,
+					unsigned long nr_flag_bits,
+					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx;
-	unsigned long value = 1;
+	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long old_word, new_word;
+
+	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
 
 	zone = page_zone(page);
 	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
+	word_bitidx = bitidx / BITS_PER_LONG;
+	bitidx &= (BITS_PER_LONG-1);
+
 	VM_BUG_ON_PAGE(!zone_spans_pfn(zone, pfn), page);
 
-	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
-		if (flags & value)
-			__set_bit(bitidx + start_bitidx, bitmap);
-		else
-			__clear_bit(bitidx + start_bitidx, bitmap);
+	bitidx += end_bitidx;
+	mask <<= (BITS_PER_LONG - bitidx - 1);
+	flags <<= (BITS_PER_LONG - bitidx - 1);
+
+	do {
+		old_word = ACCESS_ONCE(bitmap[word_bitidx]);
+		new_word = (old_word & ~mask) | flags;
+	} while (cmpxchg(&bitmap[word_bitidx], old_word, new_word) != old_word);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH 17/17] mm: filemap: Avoid unnecessary barries and waitqueue lookup in unlock_page fastpath
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-05 10:50     ` Jan Kara
  -1 siblings, 0 replies; 113+ messages in thread
From: Jan Kara @ 2014-05-05 10:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On Thu 01-05-14 09:44:48, Mel Gorman wrote:
> From: Nick Piggin <npiggin@suse.de>
> 
> This patch introduces a new page flag for 64-bit capable machines,
> PG_waiters, to signal there are processes waiting on PG_lock and uses it to
> avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.
> 
> This adds a few branches to the fast path but avoids bouncing a dirty
> cache line between CPUs. 32-bit machines always take the slow path but the
> primary motivation for this patch is large machines so I do not think that
> is a concern.
...
>  /* 
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index 7d50f79..fb83fe0 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -304,8 +304,7 @@ int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
>  		= container_of(wait, struct wait_bit_queue, wait);
>  
>  	if (wait_bit->key.flags != key->flags ||
> -			wait_bit->key.bit_nr != key->bit_nr ||
> -			test_bit(key->bit_nr, key->flags))
> +			wait_bit->key.bit_nr != key->bit_nr)
>  		return 0;
>  	else
>  		return autoremove_wake_function(wait, mode, sync, key);
  This change seems to be really unrelated? And it would deserve a comment
on its own I'd think so maybe split that in a separate patch?

> diff --git a/mm/filemap.c b/mm/filemap.c
> index c60ed0f..93e4385 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> +int  __wait_on_page_locked_killable(struct page *page)
> +{
> +	int ret = 0;
> +	wait_queue_head_t *wq = page_waitqueue(page);
> +	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
> +
> +	if (!test_bit(PG_locked, &page->flags))
> +		return 0;
> +	do {
> +		prepare_to_wait(wq, &wait.wait, TASK_KILLABLE);
> +		if (!PageWaiters(page))
> +			SetPageWaiters(page);
> +		if (likely(PageLocked(page)))
> +			ret = sleep_on_page_killable(page);
> +		finish_wait(wq, &wait.wait);
> +	} while (PageLocked(page) && !ret);
  So I'm somewhat wondering why this is the only page waiting variant that
does finish_wait() inside the loop. Everyone else does it outside the while
loop which seems sufficient to me even in this case...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 17/17] mm: filemap: Avoid unnecessary barries and waitqueue lookup in unlock_page fastpath
@ 2014-05-05 10:50     ` Jan Kara
  0 siblings, 0 replies; 113+ messages in thread
From: Jan Kara @ 2014-05-05 10:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On Thu 01-05-14 09:44:48, Mel Gorman wrote:
> From: Nick Piggin <npiggin@suse.de>
> 
> This patch introduces a new page flag for 64-bit capable machines,
> PG_waiters, to signal there are processes waiting on PG_lock and uses it to
> avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.
> 
> This adds a few branches to the fast path but avoids bouncing a dirty
> cache line between CPUs. 32-bit machines always take the slow path but the
> primary motivation for this patch is large machines so I do not think that
> is a concern.
...
>  /* 
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index 7d50f79..fb83fe0 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -304,8 +304,7 @@ int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
>  		= container_of(wait, struct wait_bit_queue, wait);
>  
>  	if (wait_bit->key.flags != key->flags ||
> -			wait_bit->key.bit_nr != key->bit_nr ||
> -			test_bit(key->bit_nr, key->flags))
> +			wait_bit->key.bit_nr != key->bit_nr)
>  		return 0;
>  	else
>  		return autoremove_wake_function(wait, mode, sync, key);
  This change seems to be really unrelated? And it would deserve a comment
on its own I'd think so maybe split that in a separate patch?

> diff --git a/mm/filemap.c b/mm/filemap.c
> index c60ed0f..93e4385 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> +int  __wait_on_page_locked_killable(struct page *page)
> +{
> +	int ret = 0;
> +	wait_queue_head_t *wq = page_waitqueue(page);
> +	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
> +
> +	if (!test_bit(PG_locked, &page->flags))
> +		return 0;
> +	do {
> +		prepare_to_wait(wq, &wait.wait, TASK_KILLABLE);
> +		if (!PageWaiters(page))
> +			SetPageWaiters(page);
> +		if (likely(PageLocked(page)))
> +			ret = sleep_on_page_killable(page);
> +		finish_wait(wq, &wait.wait);
> +	} while (PageLocked(page) && !ret);
  So I'm somewhat wondering why this is the only page waiting variant that
does finish_wait() inside the loop. Everyone else does it outside the while
loop which seems sufficient to me even in this case...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
  2014-05-04 13:14       ` Mel Gorman
@ 2014-05-05 12:40         ` Vlastimil Babka
  -1 siblings, 0 replies; 113+ messages in thread
From: Vlastimil Babka @ 2014-05-05 12:40 UTC (permalink / raw)
  To: Mel Gorman, Sasha Levin
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/04/2014 03:14 PM, Mel Gorman wrote:
> On Fri, May 02, 2014 at 06:34:52PM -0400, Sasha Levin wrote:
>> Hi Mel,
>>
>> Vlastimil Babka suggested I should try this patch to work around a different
>> issue I'm seeing, and noticed that it doesn't build because:
>>
>
> Rebasing SNAFU. Can you try this instead?
>
> ---8<---
> mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
>
> The test_bit operations in get/set pageblock flags are expensive. This patch
> reads the bitmap on a word basis and use shifts and masks to isolate the bits
> of interest. Similarly masks are used to set a local copy of the bitmap and then
> use cmpxchg to update the bitmap if there have been no other changes made in
> parallel.
>
> In a test running dd onto tmpfs the overhead of the pageblock-related
> functions went from 1.27% in profiles to 0.5%.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fac5509..c84703d 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -75,9 +75,14 @@ enum {
>
>   extern int page_group_by_mobility_disabled;
>
> +#define NR_MIGRATETYPE_BITS (PB_migrate_end - PB_migrate + 1)
> +#define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1)
> +
>   static inline int get_pageblock_migratetype(struct page *page)
>   {
> -	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
> +	BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2);
> +	return get_pageblock_flags_mask(page, PB_migrate_end,
> +					NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK);
>   }
>
>   struct free_area {
> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
> index 2ee8cd2..bc37036 100644
> --- a/include/linux/pageblock-flags.h
> +++ b/include/linux/pageblock-flags.h
> @@ -30,9 +30,12 @@ enum pageblock_bits {
>   	PB_migrate,
>   	PB_migrate_end = PB_migrate + 3 - 1,
>   			/* 3 bits required for migrate types */
> -#ifdef CONFIG_COMPACTION
>   	PB_migrate_skip,/* If set the block is skipped by compaction */
> -#endif /* CONFIG_COMPACTION */
> +
> +	/*
> +	 * Assume the bits will always align on a word. If this assumption
> +	 * changes then get/set pageblock needs updating.
> +	 */
>   	NR_PAGEBLOCK_BITS
>   };
>
> @@ -62,11 +65,35 @@ extern int pageblock_order;
>   /* Forward declaration */
>   struct page;
>
> +unsigned long get_pageblock_flags_mask(struct page *page,
> +				unsigned long end_bitidx,
> +				unsigned long nr_flag_bits,
> +				unsigned long mask);
> +void set_pageblock_flags_mask(struct page *page,
> +				unsigned long flags,
> +				unsigned long end_bitidx,
> +				unsigned long nr_flag_bits,
> +				unsigned long mask);
> +

The nr_flag_bits parameter is not used anymore and can be dropped.

>   /* Declarations for getting and setting flags. See mm/page_alloc.c */
> -unsigned long get_pageblock_flags_group(struct page *page,
> -					int start_bitidx, int end_bitidx);
> -void set_pageblock_flags_group(struct page *page, unsigned long flags,
> -					int start_bitidx, int end_bitidx);
> +static inline unsigned long get_pageblock_flags_group(struct page *page,
> +					int start_bitidx, int end_bitidx)
> +{
> +	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
> +	unsigned long mask = (1 << nr_flag_bits) - 1;
> +
> +	return get_pageblock_flags_mask(page, end_bitidx, nr_flag_bits, mask);
> +}
> +
> +static inline void set_pageblock_flags_group(struct page *page,
> +					unsigned long flags,
> +					int start_bitidx, int end_bitidx)
> +{
> +	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
> +	unsigned long mask = (1 << nr_flag_bits) - 1;
> +
> +	set_pageblock_flags_mask(page, flags, end_bitidx, nr_flag_bits, mask);
> +}
>
>   #ifdef CONFIG_COMPACTION
>   #define get_pageblock_skip(page) \
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dc123ff..f393b0e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6032,53 +6032,64 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
>    * @end_bitidx: The last bit of interest
>    * returns pageblock_bits flags
>    */
> -unsigned long get_pageblock_flags_group(struct page *page,
> -					int start_bitidx, int end_bitidx)
> +unsigned long get_pageblock_flags_mask(struct page *page,
> +					unsigned long end_bitidx,
> +					unsigned long nr_flag_bits,
> +					unsigned long mask)
>   {
>   	struct zone *zone;
>   	unsigned long *bitmap;
> -	unsigned long pfn, bitidx;
> -	unsigned long flags = 0;
> -	unsigned long value = 1;
> +	unsigned long pfn, bitidx, word_bitidx;
> +	unsigned long word;
>
>   	zone = page_zone(page);
>   	pfn = page_to_pfn(page);
>   	bitmap = get_pageblock_bitmap(zone, pfn);
>   	bitidx = pfn_to_bitidx(zone, pfn);
> +	word_bitidx = bitidx / BITS_PER_LONG;
> +	bitidx &= (BITS_PER_LONG-1);
>
> -	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
> -		if (test_bit(bitidx + start_bitidx, bitmap))
> -			flags |= value;
> -
> -	return flags;
> +	word = bitmap[word_bitidx];

I wonder if on some architecture this may result in inconsistent word 
when racing with set(), i.e. cmpxchg? We need consistency at least on 
the granularity of byte to prevent the problem with bogus migratetype 
values being read.

> +	bitidx += end_bitidx;
> +	return (word >> (BITS_PER_LONG - bitidx - 1)) & mask;

Yes that looks correct to me, bits don't seem to overlap anymore.

>   }
>
>   /**
> - * set_pageblock_flags_group - Set the requested group of flags for a pageblock_nr_pages block of pages
> + * set_pageblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages
>    * @page: The page within the block of interest
>    * @start_bitidx: The first bit of interest
>    * @end_bitidx: The last bit of interest
>    * @flags: The flags to set
>    */
> -void set_pageblock_flags_group(struct page *page, unsigned long flags,
> -					int start_bitidx, int end_bitidx)
> +void set_pageblock_flags_mask(struct page *page, unsigned long flags,
> +					unsigned long end_bitidx,
> +					unsigned long nr_flag_bits,
> +					unsigned long mask)
>   {
>   	struct zone *zone;
>   	unsigned long *bitmap;
> -	unsigned long pfn, bitidx;
> -	unsigned long value = 1;
> +	unsigned long pfn, bitidx, word_bitidx;
> +	unsigned long old_word, new_word;
> +
> +	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
>
>   	zone = page_zone(page);
>   	pfn = page_to_pfn(page);
>   	bitmap = get_pageblock_bitmap(zone, pfn);
>   	bitidx = pfn_to_bitidx(zone, pfn);
> +	word_bitidx = bitidx / BITS_PER_LONG;
> +	bitidx &= (BITS_PER_LONG-1);
> +
>   	VM_BUG_ON_PAGE(!zone_spans_pfn(zone, pfn), page);
>
> -	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
> -		if (flags & value)
> -			__set_bit(bitidx + start_bitidx, bitmap);
> -		else
> -			__clear_bit(bitidx + start_bitidx, bitmap);
> +	bitidx += end_bitidx;
> +	mask <<= (BITS_PER_LONG - bitidx - 1);
> +	flags <<= (BITS_PER_LONG - bitidx - 1);
> +
> +	do {
> +		old_word = ACCESS_ONCE(bitmap[word_bitidx]);
> +		new_word = (old_word & ~mask) | flags;
> +	} while (cmpxchg(&bitmap[word_bitidx], old_word, new_word) != old_word);

The bitfield logic here seems fine as well.

>   }
>
>   /*
>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
@ 2014-05-05 12:40         ` Vlastimil Babka
  0 siblings, 0 replies; 113+ messages in thread
From: Vlastimil Babka @ 2014-05-05 12:40 UTC (permalink / raw)
  To: Mel Gorman, Sasha Levin
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/04/2014 03:14 PM, Mel Gorman wrote:
> On Fri, May 02, 2014 at 06:34:52PM -0400, Sasha Levin wrote:
>> Hi Mel,
>>
>> Vlastimil Babka suggested I should try this patch to work around a different
>> issue I'm seeing, and noticed that it doesn't build because:
>>
>
> Rebasing SNAFU. Can you try this instead?
>
> ---8<---
> mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
>
> The test_bit operations in get/set pageblock flags are expensive. This patch
> reads the bitmap on a word basis and use shifts and masks to isolate the bits
> of interest. Similarly masks are used to set a local copy of the bitmap and then
> use cmpxchg to update the bitmap if there have been no other changes made in
> parallel.
>
> In a test running dd onto tmpfs the overhead of the pageblock-related
> functions went from 1.27% in profiles to 0.5%.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fac5509..c84703d 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -75,9 +75,14 @@ enum {
>
>   extern int page_group_by_mobility_disabled;
>
> +#define NR_MIGRATETYPE_BITS (PB_migrate_end - PB_migrate + 1)
> +#define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1)
> +
>   static inline int get_pageblock_migratetype(struct page *page)
>   {
> -	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
> +	BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2);
> +	return get_pageblock_flags_mask(page, PB_migrate_end,
> +					NR_MIGRATETYPE_BITS, MIGRATETYPE_MASK);
>   }
>
>   struct free_area {
> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
> index 2ee8cd2..bc37036 100644
> --- a/include/linux/pageblock-flags.h
> +++ b/include/linux/pageblock-flags.h
> @@ -30,9 +30,12 @@ enum pageblock_bits {
>   	PB_migrate,
>   	PB_migrate_end = PB_migrate + 3 - 1,
>   			/* 3 bits required for migrate types */
> -#ifdef CONFIG_COMPACTION
>   	PB_migrate_skip,/* If set the block is skipped by compaction */
> -#endif /* CONFIG_COMPACTION */
> +
> +	/*
> +	 * Assume the bits will always align on a word. If this assumption
> +	 * changes then get/set pageblock needs updating.
> +	 */
>   	NR_PAGEBLOCK_BITS
>   };
>
> @@ -62,11 +65,35 @@ extern int pageblock_order;
>   /* Forward declaration */
>   struct page;
>
> +unsigned long get_pageblock_flags_mask(struct page *page,
> +				unsigned long end_bitidx,
> +				unsigned long nr_flag_bits,
> +				unsigned long mask);
> +void set_pageblock_flags_mask(struct page *page,
> +				unsigned long flags,
> +				unsigned long end_bitidx,
> +				unsigned long nr_flag_bits,
> +				unsigned long mask);
> +

The nr_flag_bits parameter is not used anymore and can be dropped.

>   /* Declarations for getting and setting flags. See mm/page_alloc.c */
> -unsigned long get_pageblock_flags_group(struct page *page,
> -					int start_bitidx, int end_bitidx);
> -void set_pageblock_flags_group(struct page *page, unsigned long flags,
> -					int start_bitidx, int end_bitidx);
> +static inline unsigned long get_pageblock_flags_group(struct page *page,
> +					int start_bitidx, int end_bitidx)
> +{
> +	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
> +	unsigned long mask = (1 << nr_flag_bits) - 1;
> +
> +	return get_pageblock_flags_mask(page, end_bitidx, nr_flag_bits, mask);
> +}
> +
> +static inline void set_pageblock_flags_group(struct page *page,
> +					unsigned long flags,
> +					int start_bitidx, int end_bitidx)
> +{
> +	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
> +	unsigned long mask = (1 << nr_flag_bits) - 1;
> +
> +	set_pageblock_flags_mask(page, flags, end_bitidx, nr_flag_bits, mask);
> +}
>
>   #ifdef CONFIG_COMPACTION
>   #define get_pageblock_skip(page) \
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dc123ff..f393b0e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6032,53 +6032,64 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
>    * @end_bitidx: The last bit of interest
>    * returns pageblock_bits flags
>    */
> -unsigned long get_pageblock_flags_group(struct page *page,
> -					int start_bitidx, int end_bitidx)
> +unsigned long get_pageblock_flags_mask(struct page *page,
> +					unsigned long end_bitidx,
> +					unsigned long nr_flag_bits,
> +					unsigned long mask)
>   {
>   	struct zone *zone;
>   	unsigned long *bitmap;
> -	unsigned long pfn, bitidx;
> -	unsigned long flags = 0;
> -	unsigned long value = 1;
> +	unsigned long pfn, bitidx, word_bitidx;
> +	unsigned long word;
>
>   	zone = page_zone(page);
>   	pfn = page_to_pfn(page);
>   	bitmap = get_pageblock_bitmap(zone, pfn);
>   	bitidx = pfn_to_bitidx(zone, pfn);
> +	word_bitidx = bitidx / BITS_PER_LONG;
> +	bitidx &= (BITS_PER_LONG-1);
>
> -	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
> -		if (test_bit(bitidx + start_bitidx, bitmap))
> -			flags |= value;
> -
> -	return flags;
> +	word = bitmap[word_bitidx];

I wonder if on some architecture this may result in inconsistent word 
when racing with set(), i.e. cmpxchg? We need consistency at least on 
the granularity of byte to prevent the problem with bogus migratetype 
values being read.

> +	bitidx += end_bitidx;
> +	return (word >> (BITS_PER_LONG - bitidx - 1)) & mask;

Yes that looks correct to me, bits don't seem to overlap anymore.

>   }
>
>   /**
> - * set_pageblock_flags_group - Set the requested group of flags for a pageblock_nr_pages block of pages
> + * set_pageblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages
>    * @page: The page within the block of interest
>    * @start_bitidx: The first bit of interest
>    * @end_bitidx: The last bit of interest
>    * @flags: The flags to set
>    */
> -void set_pageblock_flags_group(struct page *page, unsigned long flags,
> -					int start_bitidx, int end_bitidx)
> +void set_pageblock_flags_mask(struct page *page, unsigned long flags,
> +					unsigned long end_bitidx,
> +					unsigned long nr_flag_bits,
> +					unsigned long mask)
>   {
>   	struct zone *zone;
>   	unsigned long *bitmap;
> -	unsigned long pfn, bitidx;
> -	unsigned long value = 1;
> +	unsigned long pfn, bitidx, word_bitidx;
> +	unsigned long old_word, new_word;
> +
> +	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
>
>   	zone = page_zone(page);
>   	pfn = page_to_pfn(page);
>   	bitmap = get_pageblock_bitmap(zone, pfn);
>   	bitidx = pfn_to_bitidx(zone, pfn);
> +	word_bitidx = bitidx / BITS_PER_LONG;
> +	bitidx &= (BITS_PER_LONG-1);
> +
>   	VM_BUG_ON_PAGE(!zone_spans_pfn(zone, pfn), page);
>
> -	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
> -		if (flags & value)
> -			__set_bit(bitidx + start_bitidx, bitmap);
> -		else
> -			__clear_bit(bitidx + start_bitidx, bitmap);
> +	bitidx += end_bitidx;
> +	mask <<= (BITS_PER_LONG - bitidx - 1);
> +	flags <<= (BITS_PER_LONG - bitidx - 1);
> +
> +	do {
> +		old_word = ACCESS_ONCE(bitmap[word_bitidx]);
> +		new_word = (old_word & ~mask) | flags;
> +	} while (cmpxchg(&bitmap[word_bitidx], old_word, new_word) != old_word);

The bitfield logic here seems fine as well.

>   }
>
>   /*
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
  2014-05-05 12:40         ` Vlastimil Babka
@ 2014-05-06  9:13           ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-06  9:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sasha Levin, Linux-MM, Linux-FSDevel, Johannes Weiner, Jan Kara,
	Michal Hocko, Hugh Dickins, Linux Kernel

On Mon, May 05, 2014 at 02:40:38PM +0200, Vlastimil Babka wrote:
> >@@ -62,11 +65,35 @@ extern int pageblock_order;
> >  /* Forward declaration */
> >  struct page;
> >
> >+unsigned long get_pageblock_flags_mask(struct page *page,
> >+				unsigned long end_bitidx,
> >+				unsigned long nr_flag_bits,
> >+				unsigned long mask);
> >+void set_pageblock_flags_mask(struct page *page,
> >+				unsigned long flags,
> >+				unsigned long end_bitidx,
> >+				unsigned long nr_flag_bits,
> >+				unsigned long mask);
> >+
> 
> The nr_flag_bits parameter is not used anymore and can be dropped.
> 

Fixed

> >  /* Declarations for getting and setting flags. See mm/page_alloc.c */
> >-unsigned long get_pageblock_flags_group(struct page *page,
> >-					int start_bitidx, int end_bitidx);
> >-void set_pageblock_flags_group(struct page *page, unsigned long flags,
> >-					int start_bitidx, int end_bitidx);
> >+static inline unsigned long get_pageblock_flags_group(struct page *page,
> >+					int start_bitidx, int end_bitidx)
> >+{
> >+	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
> >+	unsigned long mask = (1 << nr_flag_bits) - 1;
> >+
> >+	return get_pageblock_flags_mask(page, end_bitidx, nr_flag_bits, mask);
> >+}
> >+
> >+static inline void set_pageblock_flags_group(struct page *page,
> >+					unsigned long flags,
> >+					int start_bitidx, int end_bitidx)
> >+{
> >+	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
> >+	unsigned long mask = (1 << nr_flag_bits) - 1;
> >+
> >+	set_pageblock_flags_mask(page, flags, end_bitidx, nr_flag_bits, mask);
> >+}
> >
> >  #ifdef CONFIG_COMPACTION
> >  #define get_pageblock_skip(page) \
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index dc123ff..f393b0e 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -6032,53 +6032,64 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
> >   * @end_bitidx: The last bit of interest
> >   * returns pageblock_bits flags
> >   */
> >-unsigned long get_pageblock_flags_group(struct page *page,
> >-					int start_bitidx, int end_bitidx)
> >+unsigned long get_pageblock_flags_mask(struct page *page,
> >+					unsigned long end_bitidx,
> >+					unsigned long nr_flag_bits,
> >+					unsigned long mask)
> >  {
> >  	struct zone *zone;
> >  	unsigned long *bitmap;
> >-	unsigned long pfn, bitidx;
> >-	unsigned long flags = 0;
> >-	unsigned long value = 1;
> >+	unsigned long pfn, bitidx, word_bitidx;
> >+	unsigned long word;
> >
> >  	zone = page_zone(page);
> >  	pfn = page_to_pfn(page);
> >  	bitmap = get_pageblock_bitmap(zone, pfn);
> >  	bitidx = pfn_to_bitidx(zone, pfn);
> >+	word_bitidx = bitidx / BITS_PER_LONG;
> >+	bitidx &= (BITS_PER_LONG-1);
> >
> >-	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
> >-		if (test_bit(bitidx + start_bitidx, bitmap))
> >-			flags |= value;
> >-
> >-	return flags;
> >+	word = bitmap[word_bitidx];
> 
> I wonder if on some architecture this may result in inconsistent
> word when racing with set(), i.e. cmpxchg? We need consistency at
> least on the granularity of byte to prevent the problem with bogus
> migratetype values being read.
> 

The number of bits align on the byte boundary so I do not think there is
a problem there. There is a BUILD_BUG_ON check in set_pageblock_flags_mask
in case this changes so it can be revisited if necessary.

> >+	bitidx += end_bitidx;
> >+	return (word >> (BITS_PER_LONG - bitidx - 1)) & mask;
> 
> Yes that looks correct to me, bits don't seem to overlap anymore.
> 

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
@ 2014-05-06  9:13           ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-06  9:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sasha Levin, Linux-MM, Linux-FSDevel, Johannes Weiner, Jan Kara,
	Michal Hocko, Hugh Dickins, Linux Kernel

On Mon, May 05, 2014 at 02:40:38PM +0200, Vlastimil Babka wrote:
> >@@ -62,11 +65,35 @@ extern int pageblock_order;
> >  /* Forward declaration */
> >  struct page;
> >
> >+unsigned long get_pageblock_flags_mask(struct page *page,
> >+				unsigned long end_bitidx,
> >+				unsigned long nr_flag_bits,
> >+				unsigned long mask);
> >+void set_pageblock_flags_mask(struct page *page,
> >+				unsigned long flags,
> >+				unsigned long end_bitidx,
> >+				unsigned long nr_flag_bits,
> >+				unsigned long mask);
> >+
> 
> The nr_flag_bits parameter is not used anymore and can be dropped.
> 

Fixed

> >  /* Declarations for getting and setting flags. See mm/page_alloc.c */
> >-unsigned long get_pageblock_flags_group(struct page *page,
> >-					int start_bitidx, int end_bitidx);
> >-void set_pageblock_flags_group(struct page *page, unsigned long flags,
> >-					int start_bitidx, int end_bitidx);
> >+static inline unsigned long get_pageblock_flags_group(struct page *page,
> >+					int start_bitidx, int end_bitidx)
> >+{
> >+	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
> >+	unsigned long mask = (1 << nr_flag_bits) - 1;
> >+
> >+	return get_pageblock_flags_mask(page, end_bitidx, nr_flag_bits, mask);
> >+}
> >+
> >+static inline void set_pageblock_flags_group(struct page *page,
> >+					unsigned long flags,
> >+					int start_bitidx, int end_bitidx)
> >+{
> >+	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
> >+	unsigned long mask = (1 << nr_flag_bits) - 1;
> >+
> >+	set_pageblock_flags_mask(page, flags, end_bitidx, nr_flag_bits, mask);
> >+}
> >
> >  #ifdef CONFIG_COMPACTION
> >  #define get_pageblock_skip(page) \
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index dc123ff..f393b0e 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -6032,53 +6032,64 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
> >   * @end_bitidx: The last bit of interest
> >   * returns pageblock_bits flags
> >   */
> >-unsigned long get_pageblock_flags_group(struct page *page,
> >-					int start_bitidx, int end_bitidx)
> >+unsigned long get_pageblock_flags_mask(struct page *page,
> >+					unsigned long end_bitidx,
> >+					unsigned long nr_flag_bits,
> >+					unsigned long mask)
> >  {
> >  	struct zone *zone;
> >  	unsigned long *bitmap;
> >-	unsigned long pfn, bitidx;
> >-	unsigned long flags = 0;
> >-	unsigned long value = 1;
> >+	unsigned long pfn, bitidx, word_bitidx;
> >+	unsigned long word;
> >
> >  	zone = page_zone(page);
> >  	pfn = page_to_pfn(page);
> >  	bitmap = get_pageblock_bitmap(zone, pfn);
> >  	bitidx = pfn_to_bitidx(zone, pfn);
> >+	word_bitidx = bitidx / BITS_PER_LONG;
> >+	bitidx &= (BITS_PER_LONG-1);
> >
> >-	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
> >-		if (test_bit(bitidx + start_bitidx, bitmap))
> >-			flags |= value;
> >-
> >-	return flags;
> >+	word = bitmap[word_bitidx];
> 
> I wonder if on some architecture this may result in inconsistent
> word when racing with set(), i.e. cmpxchg? We need consistency at
> least on the granularity of byte to prevent the problem with bogus
> migratetype values being read.
> 

The number of bits align on the byte boundary so I do not think there is
a problem there. There is a BUILD_BUG_ON check in set_pageblock_flags_mask
in case this changes so it can be revisited if necessary.

> >+	bitidx += end_bitidx;
> >+	return (word >> (BITS_PER_LONG - bitidx - 1)) & mask;
> 
> Yes that looks correct to me, bits don't seem to overlap anymore.
> 

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
  2014-05-06  9:13           ` Mel Gorman
@ 2014-05-06 14:42             ` Vlastimil Babka
  -1 siblings, 0 replies; 113+ messages in thread
From: Vlastimil Babka @ 2014-05-06 14:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Sasha Levin, Linux-MM, Linux-FSDevel, Johannes Weiner, Jan Kara,
	Michal Hocko, Hugh Dickins, Linux Kernel

On 05/06/2014 11:13 AM, Mel Gorman wrote:
> On Mon, May 05, 2014 at 02:40:38PM +0200, Vlastimil Babka wrote:
>>> @@ -62,11 +65,35 @@ extern int pageblock_order;
>>>   /* Forward declaration */
>>>   struct page;
>>>
>>> +unsigned long get_pageblock_flags_mask(struct page *page,
>>> +				unsigned long end_bitidx,
>>> +				unsigned long nr_flag_bits,
>>> +				unsigned long mask);
>>> +void set_pageblock_flags_mask(struct page *page,
>>> +				unsigned long flags,
>>> +				unsigned long end_bitidx,
>>> +				unsigned long nr_flag_bits,
>>> +				unsigned long mask);
>>> +
>>
>> The nr_flag_bits parameter is not used anymore and can be dropped.
>>
>
> Fixed
>
>>>   /* Declarations for getting and setting flags. See mm/page_alloc.c */
>>> -unsigned long get_pageblock_flags_group(struct page *page,
>>> -					int start_bitidx, int end_bitidx);
>>> -void set_pageblock_flags_group(struct page *page, unsigned long flags,
>>> -					int start_bitidx, int end_bitidx);
>>> +static inline unsigned long get_pageblock_flags_group(struct page *page,
>>> +					int start_bitidx, int end_bitidx)
>>> +{
>>> +	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
>>> +	unsigned long mask = (1 << nr_flag_bits) - 1;
>>> +
>>> +	return get_pageblock_flags_mask(page, end_bitidx, nr_flag_bits, mask);
>>> +}
>>> +
>>> +static inline void set_pageblock_flags_group(struct page *page,
>>> +					unsigned long flags,
>>> +					int start_bitidx, int end_bitidx)
>>> +{
>>> +	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
>>> +	unsigned long mask = (1 << nr_flag_bits) - 1;
>>> +
>>> +	set_pageblock_flags_mask(page, flags, end_bitidx, nr_flag_bits, mask);
>>> +}
>>>
>>>   #ifdef CONFIG_COMPACTION
>>>   #define get_pageblock_skip(page) \
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index dc123ff..f393b0e 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -6032,53 +6032,64 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
>>>    * @end_bitidx: The last bit of interest
>>>    * returns pageblock_bits flags
>>>    */
>>> -unsigned long get_pageblock_flags_group(struct page *page,
>>> -					int start_bitidx, int end_bitidx)
>>> +unsigned long get_pageblock_flags_mask(struct page *page,
>>> +					unsigned long end_bitidx,
>>> +					unsigned long nr_flag_bits,
>>> +					unsigned long mask)
>>>   {
>>>   	struct zone *zone;
>>>   	unsigned long *bitmap;
>>> -	unsigned long pfn, bitidx;
>>> -	unsigned long flags = 0;
>>> -	unsigned long value = 1;
>>> +	unsigned long pfn, bitidx, word_bitidx;
>>> +	unsigned long word;
>>>
>>>   	zone = page_zone(page);
>>>   	pfn = page_to_pfn(page);
>>>   	bitmap = get_pageblock_bitmap(zone, pfn);
>>>   	bitidx = pfn_to_bitidx(zone, pfn);
>>> +	word_bitidx = bitidx / BITS_PER_LONG;
>>> +	bitidx &= (BITS_PER_LONG-1);
>>>
>>> -	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
>>> -		if (test_bit(bitidx + start_bitidx, bitmap))
>>> -			flags |= value;
>>> -
>>> -	return flags;
>>> +	word = bitmap[word_bitidx];
>>
>> I wonder if on some architecture this may result in inconsistent
>> word when racing with set(), i.e. cmpxchg? We need consistency at
>> least on the granularity of byte to prevent the problem with bogus
>> migratetype values being read.
>>fix:
>
> The number of bits align on the byte boundary so I do not think there is
> a problem there. There is a BUILD_BUG_ON check in set_pageblock_flags_mask
> in case this changes so it can be revisited if necessary.

I was wondering about hardware guarantees in that case (e.g. consistency 
at least on the granularity of byte when a simple memory read races with 
write) but after some discussion in the office I understand that 
hardware without such guarantees wouldn't be able to run Linux anyway :)

Still I wonder if ACCESS_ONCE would be safer in the 'word' variable 
assignment to protect against compiler trying to be too smart?

Anyway with the nr_flag_bits removed:

Acked-by: Vlastimil Babka <vbabka@suse.cz>

>>> +	bitidx += end_bitidx;
>>> +	return (word >> (BITS_PER_LONG - bitidx - 1)) & mask;
>>
>> Yes that looks correct to me, bits don't seem to overlap anymore.
>>
>
> Thanks.
>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
@ 2014-05-06 14:42             ` Vlastimil Babka
  0 siblings, 0 replies; 113+ messages in thread
From: Vlastimil Babka @ 2014-05-06 14:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Sasha Levin, Linux-MM, Linux-FSDevel, Johannes Weiner, Jan Kara,
	Michal Hocko, Hugh Dickins, Linux Kernel

On 05/06/2014 11:13 AM, Mel Gorman wrote:
> On Mon, May 05, 2014 at 02:40:38PM +0200, Vlastimil Babka wrote:
>>> @@ -62,11 +65,35 @@ extern int pageblock_order;
>>>   /* Forward declaration */
>>>   struct page;
>>>
>>> +unsigned long get_pageblock_flags_mask(struct page *page,
>>> +				unsigned long end_bitidx,
>>> +				unsigned long nr_flag_bits,
>>> +				unsigned long mask);
>>> +void set_pageblock_flags_mask(struct page *page,
>>> +				unsigned long flags,
>>> +				unsigned long end_bitidx,
>>> +				unsigned long nr_flag_bits,
>>> +				unsigned long mask);
>>> +
>>
>> The nr_flag_bits parameter is not used anymore and can be dropped.
>>
>
> Fixed
>
>>>   /* Declarations for getting and setting flags. See mm/page_alloc.c */
>>> -unsigned long get_pageblock_flags_group(struct page *page,
>>> -					int start_bitidx, int end_bitidx);
>>> -void set_pageblock_flags_group(struct page *page, unsigned long flags,
>>> -					int start_bitidx, int end_bitidx);
>>> +static inline unsigned long get_pageblock_flags_group(struct page *page,
>>> +					int start_bitidx, int end_bitidx)
>>> +{
>>> +	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
>>> +	unsigned long mask = (1 << nr_flag_bits) - 1;
>>> +
>>> +	return get_pageblock_flags_mask(page, end_bitidx, nr_flag_bits, mask);
>>> +}
>>> +
>>> +static inline void set_pageblock_flags_group(struct page *page,
>>> +					unsigned long flags,
>>> +					int start_bitidx, int end_bitidx)
>>> +{
>>> +	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
>>> +	unsigned long mask = (1 << nr_flag_bits) - 1;
>>> +
>>> +	set_pageblock_flags_mask(page, flags, end_bitidx, nr_flag_bits, mask);
>>> +}
>>>
>>>   #ifdef CONFIG_COMPACTION
>>>   #define get_pageblock_skip(page) \
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index dc123ff..f393b0e 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -6032,53 +6032,64 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
>>>    * @end_bitidx: The last bit of interest
>>>    * returns pageblock_bits flags
>>>    */
>>> -unsigned long get_pageblock_flags_group(struct page *page,
>>> -					int start_bitidx, int end_bitidx)
>>> +unsigned long get_pageblock_flags_mask(struct page *page,
>>> +					unsigned long end_bitidx,
>>> +					unsigned long nr_flag_bits,
>>> +					unsigned long mask)
>>>   {
>>>   	struct zone *zone;
>>>   	unsigned long *bitmap;
>>> -	unsigned long pfn, bitidx;
>>> -	unsigned long flags = 0;
>>> -	unsigned long value = 1;
>>> +	unsigned long pfn, bitidx, word_bitidx;
>>> +	unsigned long word;
>>>
>>>   	zone = page_zone(page);
>>>   	pfn = page_to_pfn(page);
>>>   	bitmap = get_pageblock_bitmap(zone, pfn);
>>>   	bitidx = pfn_to_bitidx(zone, pfn);
>>> +	word_bitidx = bitidx / BITS_PER_LONG;
>>> +	bitidx &= (BITS_PER_LONG-1);
>>>
>>> -	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
>>> -		if (test_bit(bitidx + start_bitidx, bitmap))
>>> -			flags |= value;
>>> -
>>> -	return flags;
>>> +	word = bitmap[word_bitidx];
>>
>> I wonder if on some architecture this may result in inconsistent
>> word when racing with set(), i.e. cmpxchg? We need consistency at
>> least on the granularity of byte to prevent the problem with bogus
>> migratetype values being read.
>>fix:
>
> The number of bits align on the byte boundary so I do not think there is
> a problem there. There is a BUILD_BUG_ON check in set_pageblock_flags_mask
> in case this changes so it can be revisited if necessary.

I was wondering about hardware guarantees in that case (e.g. consistency 
at least on the granularity of byte when a simple memory read races with 
write) but after some discussion in the office I understand that 
hardware without such guarantees wouldn't be able to run Linux anyway :)

Still I wonder if ACCESS_ONCE would be safer in the 'word' variable 
assignment to protect against compiler trying to be too smart?

Anyway with the nr_flag_bits removed:

Acked-by: Vlastimil Babka <vbabka@suse.cz>

>>> +	bitidx += end_bitidx;
>>> +	return (word >> (BITS_PER_LONG - bitidx - 1)) & mask;
>>
>> Yes that looks correct to me, bits don't seem to overlap anymore.
>>
>
> Thanks.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/17] mm: page_alloc: Do not update zlc unless the zlc is active
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-06 15:04     ` Rik van Riel
  -1 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 15:04 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> The zlc is used on NUMA machines to quickly skip over zones that are full.
> However it is always updated, even for the first zone scanned when the
> zlc might not even be active. As it's a write to a bitmap that potentially
> bounces cache line it's deceptively expensive and most machines will not
> care. Only update the zlc if it was active.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 01/17] mm: page_alloc: Do not update zlc unless the zlc is active
@ 2014-05-06 15:04     ` Rik van Riel
  0 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 15:04 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> The zlc is used on NUMA machines to quickly skip over zones that are full.
> However it is always updated, even for the first zone scanned when the
> zlc might not even be active. As it's a write to a bitmap that potentially
> bounces cache line it's deceptively expensive and most machines will not
> care. Only update the zlc if it was active.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/17] mm: page_alloc: Do not treat a zone that cannot be used for dirty pages as "full"
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-06 15:09     ` Rik van Riel
  -1 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 15:09 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> If a zone cannot be used for a dirty page then it gets marked "full"
> which is cached in the zlc and later potentially skipped by allocation
> requests that have nothing to do with dirty zones.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 02/17] mm: page_alloc: Do not treat a zone that cannot be used for dirty pages as "full"
@ 2014-05-06 15:09     ` Rik van Riel
  0 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 15:09 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> If a zone cannot be used for a dirty page then it gets marked "full"
> which is cached in the zlc and later potentially skipped by allocation
> requests that have nothing to do with dirty zones.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 03/17] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-06 15:10     ` Rik van Riel
  -1 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 15:10 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> If cpusets are not in use then we still check a global variable on every
> page allocation. Use jump labels to avoid the overhead.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 03/17] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets
@ 2014-05-06 15:10     ` Rik van Riel
  0 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 15:10 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> If cpusets are not in use then we still check a global variable on every
> page allocation. Use jump labels to avoid the overhead.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
  2014-05-06 14:42             ` Vlastimil Babka
@ 2014-05-06 15:12               ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-06 15:12 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sasha Levin, Linux-MM, Linux-FSDevel, Johannes Weiner, Jan Kara,
	Michal Hocko, Hugh Dickins, Linux Kernel

On Tue, May 06, 2014 at 04:42:18PM +0200, Vlastimil Babka wrote:
> >>>+unsigned long get_pageblock_flags_mask(struct page *page,
> >>>+					unsigned long end_bitidx,
> >>>+					unsigned long nr_flag_bits,
> >>>+					unsigned long mask)
> >>>  {
> >>>  	struct zone *zone;
> >>>  	unsigned long *bitmap;
> >>>-	unsigned long pfn, bitidx;
> >>>-	unsigned long flags = 0;
> >>>-	unsigned long value = 1;
> >>>+	unsigned long pfn, bitidx, word_bitidx;
> >>>+	unsigned long word;
> >>>
> >>>  	zone = page_zone(page);
> >>>  	pfn = page_to_pfn(page);
> >>>  	bitmap = get_pageblock_bitmap(zone, pfn);
> >>>  	bitidx = pfn_to_bitidx(zone, pfn);
> >>>+	word_bitidx = bitidx / BITS_PER_LONG;
> >>>+	bitidx &= (BITS_PER_LONG-1);
> >>>
> >>>-	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
> >>>-		if (test_bit(bitidx + start_bitidx, bitmap))
> >>>-			flags |= value;
> >>>-
> >>>-	return flags;
> >>>+	word = bitmap[word_bitidx];
> >>
> >>I wonder if on some architecture this may result in inconsistent
> >>word when racing with set(), i.e. cmpxchg? We need consistency at
> >>least on the granularity of byte to prevent the problem with bogus
> >>migratetype values being read.
> >>fix:
> >
> >The number of bits align on the byte boundary so I do not think there is
> >a problem there. There is a BUILD_BUG_ON check in set_pageblock_flags_mask
> >in case this changes so it can be revisited if necessary.
> 
> I was wondering about hardware guarantees in that case (e.g.
> consistency at least on the granularity of byte when a simple memory
> read races with write) but after some discussion in the office I
> understand that hardware without such guarantees wouldn't be able to
> run Linux anyway :)
> 
> Still I wonder if ACCESS_ONCE would be safer in the 'word' variable
> assignment to protect against compiler trying to be too smart?
> 

I couldn't see a case in the get path where it would matter. I put an
ACCESS_ONCE in the set path in case the compiler accidentally determined
that old_word was invariant in that loop.

> Anyway with the nr_flag_bits removed:
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
@ 2014-05-06 15:12               ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-06 15:12 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sasha Levin, Linux-MM, Linux-FSDevel, Johannes Weiner, Jan Kara,
	Michal Hocko, Hugh Dickins, Linux Kernel

On Tue, May 06, 2014 at 04:42:18PM +0200, Vlastimil Babka wrote:
> >>>+unsigned long get_pageblock_flags_mask(struct page *page,
> >>>+					unsigned long end_bitidx,
> >>>+					unsigned long nr_flag_bits,
> >>>+					unsigned long mask)
> >>>  {
> >>>  	struct zone *zone;
> >>>  	unsigned long *bitmap;
> >>>-	unsigned long pfn, bitidx;
> >>>-	unsigned long flags = 0;
> >>>-	unsigned long value = 1;
> >>>+	unsigned long pfn, bitidx, word_bitidx;
> >>>+	unsigned long word;
> >>>
> >>>  	zone = page_zone(page);
> >>>  	pfn = page_to_pfn(page);
> >>>  	bitmap = get_pageblock_bitmap(zone, pfn);
> >>>  	bitidx = pfn_to_bitidx(zone, pfn);
> >>>+	word_bitidx = bitidx / BITS_PER_LONG;
> >>>+	bitidx &= (BITS_PER_LONG-1);
> >>>
> >>>-	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
> >>>-		if (test_bit(bitidx + start_bitidx, bitmap))
> >>>-			flags |= value;
> >>>-
> >>>-	return flags;
> >>>+	word = bitmap[word_bitidx];
> >>
> >>I wonder if on some architecture this may result in inconsistent
> >>word when racing with set(), i.e. cmpxchg? We need consistency at
> >>least on the granularity of byte to prevent the problem with bogus
> >>migratetype values being read.
> >>fix:
> >
> >The number of bits align on the byte boundary so I do not think there is
> >a problem there. There is a BUILD_BUG_ON check in set_pageblock_flags_mask
> >in case this changes so it can be revisited if necessary.
> 
> I was wondering about hardware guarantees in that case (e.g.
> consistency at least on the granularity of byte when a simple memory
> read races with write) but after some discussion in the office I
> understand that hardware without such guarantees wouldn't be able to
> run Linux anyway :)
> 
> Still I wonder if ACCESS_ONCE would be safer in the 'word' variable
> assignment to protect against compiler trying to be too smart?
> 

I couldn't see a case in the get path where it would matter. I put an
ACCESS_ONCE in the set path in case the compiler accidentally determined
that old_word was invariant in that loop.

> Anyway with the nr_flag_bits removed:
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 15/17] mm: Do not use unnecessary atomic operations when adding pages to the LRU
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-06 15:30     ` Vlastimil Babka
  -1 siblings, 0 replies; 113+ messages in thread
From: Vlastimil Babka @ 2014-05-06 15:30 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On 05/01/2014 10:44 AM, Mel Gorman wrote:
> When adding pages to the LRU we clear the active bit unconditionally. As the
> page could be reachable from other paths we cannot use unlocked operations
> without risk of corruption such as a parallel mark_page_accessed. This
> patch test if is necessary to clear the atomic flag before using an atomic

                                           active

> operation. In the unlikely even this races with mark_page_accesssed the
> consequences are simply that the page may be promoted to the active list
> that might have been left on the inactive list before the patch. This is
> a marginal consequence.

Well if this is racy, then even before the patch, mark_page_accessed 
might have come right after ClearPageActive(page) anyway? Or is the 
changelog saying that this change only extended the race window that 
already existed? If yes it could be more explicit, as now it might sound 
as if the race was introduced.

> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>   include/linux/swap.h | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index da8a250..395dcab 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -329,13 +329,15 @@ extern void add_page_to_unevictable_list(struct page *page);
>    */
>   static inline void lru_cache_add_anon(struct page *page)
>   {
> -	ClearPageActive(page);
> +	if (PageActive(page))
> +		ClearPageActive(page);
>   	__lru_cache_add(page);
>   }
>
>   static inline void lru_cache_add_file(struct page *page)
>   {
> -	ClearPageActive(page);
> +	if (PageActive(page))
> +		ClearPageActive(page);
>   	__lru_cache_add(page);
>   }
>
>


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 15/17] mm: Do not use unnecessary atomic operations when adding pages to the LRU
@ 2014-05-06 15:30     ` Vlastimil Babka
  0 siblings, 0 replies; 113+ messages in thread
From: Vlastimil Babka @ 2014-05-06 15:30 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On 05/01/2014 10:44 AM, Mel Gorman wrote:
> When adding pages to the LRU we clear the active bit unconditionally. As the
> page could be reachable from other paths we cannot use unlocked operations
> without risk of corruption such as a parallel mark_page_accessed. This
> patch test if is necessary to clear the atomic flag before using an atomic

                                           active

> operation. In the unlikely even this races with mark_page_accesssed the
> consequences are simply that the page may be promoted to the active list
> that might have been left on the inactive list before the patch. This is
> a marginal consequence.

Well if this is racy, then even before the patch, mark_page_accessed 
might have come right after ClearPageActive(page) anyway? Or is the 
changelog saying that this change only extended the race window that 
already existed? If yes it could be more explicit, as now it might sound 
as if the race was introduced.

> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>   include/linux/swap.h | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index da8a250..395dcab 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -329,13 +329,15 @@ extern void add_page_to_unevictable_list(struct page *page);
>    */
>   static inline void lru_cache_add_anon(struct page *page)
>   {
> -	ClearPageActive(page);
> +	if (PageActive(page))
> +		ClearPageActive(page);
>   	__lru_cache_add(page);
>   }
>
>   static inline void lru_cache_add_file(struct page *page)
>   {
> -	ClearPageActive(page);
> +	if (PageActive(page))
> +		ClearPageActive(page);
>   	__lru_cache_add(page);
>   }
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 15/17] mm: Do not use unnecessary atomic operations when adding pages to the LRU
  2014-05-06 15:30     ` Vlastimil Babka
@ 2014-05-06 15:55       ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-06 15:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On Tue, May 06, 2014 at 05:30:53PM +0200, Vlastimil Babka wrote:
> On 05/01/2014 10:44 AM, Mel Gorman wrote:
> >When adding pages to the LRU we clear the active bit unconditionally. As the
> >page could be reachable from other paths we cannot use unlocked operations
> >without risk of corruption such as a parallel mark_page_accessed. This
> >patch test if is necessary to clear the atomic flag before using an atomic
> 
>                                           active
> 

Thanks. Clearly I had atomic on the brain.

> >operation. In the unlikely even this races with mark_page_accesssed the
> >consequences are simply that the page may be promoted to the active list
> >that might have been left on the inactive list before the patch. This is
> >a marginal consequence.
> 
> Well if this is racy, then even before the patch, mark_page_accessed
> might have come right after ClearPageActive(page) anyway?
> Or is the
> changelog saying that this change only extended the race window that
> already existed? If yes it could be more explicit, as now it might
> sound as if the race was introduced.
> 

When adding pages to the LRU we clear the active bit unconditionally. As the
page could be reachable from other paths we cannot use unlocked operations
without risk of corruption such as a parallel mark_page_accessed. This
patch tests if is necessary to clear the active flag before using an atomic
operation. This potentially opens a tiny race when PageActive is checked
as mark_page_accessed could be called after PageActive was checked. The
race already exists but this patch changes it slightly. The consequence
is that that the page may be promoted to the active list that might have
been left on the inactive list before the patch. It's too tiny a race and
too marginal a consequence to always use atomic operations for.

?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 15/17] mm: Do not use unnecessary atomic operations when adding pages to the LRU
@ 2014-05-06 15:55       ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-06 15:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On Tue, May 06, 2014 at 05:30:53PM +0200, Vlastimil Babka wrote:
> On 05/01/2014 10:44 AM, Mel Gorman wrote:
> >When adding pages to the LRU we clear the active bit unconditionally. As the
> >page could be reachable from other paths we cannot use unlocked operations
> >without risk of corruption such as a parallel mark_page_accessed. This
> >patch test if is necessary to clear the atomic flag before using an atomic
> 
>                                           active
> 

Thanks. Clearly I had atomic on the brain.

> >operation. In the unlikely even this races with mark_page_accesssed the
> >consequences are simply that the page may be promoted to the active list
> >that might have been left on the inactive list before the patch. This is
> >a marginal consequence.
> 
> Well if this is racy, then even before the patch, mark_page_accessed
> might have come right after ClearPageActive(page) anyway?
> Or is the
> changelog saying that this change only extended the race window that
> already existed? If yes it could be more explicit, as now it might
> sound as if the race was introduced.
> 

When adding pages to the LRU we clear the active bit unconditionally. As the
page could be reachable from other paths we cannot use unlocked operations
without risk of corruption such as a parallel mark_page_accessed. This
patch tests if is necessary to clear the active flag before using an atomic
operation. This potentially opens a tiny race when PageActive is checked
as mark_page_accessed could be called after PageActive was checked. The
race already exists but this patch changes it slightly. The consequence
is that that the page may be promoted to the active list that might have
been left on the inactive list before the patch. It's too tiny a race and
too marginal a consequence to always use atomic operations for.

?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 04/17] mm: page_alloc: Calculate classzone_idx once from the zonelist ref
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-06 16:01     ` Rik van Riel
  -1 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 16:01 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> There is no need to calculate zone_idx(preferred_zone) multiple times
> or use the pgdat to figure it out.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 04/17] mm: page_alloc: Calculate classzone_idx once from the zonelist ref
@ 2014-05-06 16:01     ` Rik van Riel
  0 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 16:01 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> There is no need to calculate zone_idx(preferred_zone) multiple times
> or use the pgdat to figure it out.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 05/17] mm: page_alloc: Only check the zone id check if pages are buddies
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-06 16:48     ` Rik van Riel
  -1 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 16:48 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> A node/zone index is used to check if pages are compatible for merging
> but this happens unconditionally even if the buddy page is not free. Defer
> the calculation as long as possible. Ideally we would check the zone boundary
> but nodes can overlap.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 05/17] mm: page_alloc: Only check the zone id check if pages are buddies
@ 2014-05-06 16:48     ` Rik van Riel
  0 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 16:48 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> A node/zone index is used to check if pages are compatible for merging
> but this happens unconditionally even if the buddy page is not free. Defer
> the calculation as long as possible. Ideally we would check the zone boundary
> but nodes can overlap.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 06/17] mm: page_alloc: Only check the alloc flags and gfp_mask for dirty once
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-06 17:24     ` Rik van Riel
  -1 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 17:24 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> Currently it's calculated once per zone in the zonelist.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 06/17] mm: page_alloc: Only check the alloc flags and gfp_mask for dirty once
@ 2014-05-06 17:24     ` Rik van Riel
  0 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 17:24 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> Currently it's calculated once per zone in the zonelist.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 07/17] mm: page_alloc: Take the ALLOC_NO_WATERMARK check out of the fast path
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-06 17:25     ` Rik van Riel
  -1 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 17:25 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for
> __GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these cases
> are relatively rare events but the ALLOC_NO_WATERMARK check is an unlikely
> branch in the fast path.  This patch moves the check out of the fast path
> and after it has been determined that the watermarks have not been met. This
> helps the common fast path at the cost of making the slow path slower and
> hitting kswapd with a performance cost. It's a reasonable tradeoff.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 07/17] mm: page_alloc: Take the ALLOC_NO_WATERMARK check out of the fast path
@ 2014-05-06 17:25     ` Rik van Riel
  0 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 17:25 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for
> __GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these cases
> are relatively rare events but the ALLOC_NO_WATERMARK check is an unlikely
> branch in the fast path.  This patch moves the check out of the fast path
> and after it has been determined that the watermarks have not been met. This
> helps the common fast path at the cost of making the slow path slower and
> hitting kswapd with a performance cost. It's a reasonable tradeoff.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 09/17] mm: page_alloc: Reduce number of times page_to_pfn is called
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-06 18:47     ` Rik van Riel
  -1 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 18:47 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> In the free path we calculate page_to_pfn multiple times. Reduce that.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 09/17] mm: page_alloc: Reduce number of times page_to_pfn is called
@ 2014-05-06 18:47     ` Rik van Riel
  0 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 18:47 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> In the free path we calculate page_to_pfn multiple times. Reduce that.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 10/17] mm: page_alloc: Lookup pageblock migratetype with IRQs enabled during free
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-06 18:48     ` Rik van Riel
  -1 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 18:48 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> get_pageblock_migratetype() is called during free with IRQs disabled. This
> is unnecessary and disables IRQs for longer than necessary.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 10/17] mm: page_alloc: Lookup pageblock migratetype with IRQs enabled during free
@ 2014-05-06 18:48     ` Rik van Riel
  0 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 18:48 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> get_pageblock_migratetype() is called during free with IRQs disabled. This
> is unnecessary and disables IRQs for longer than necessary.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 11/17] mm: page_alloc: Use unsigned int for order in more places
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-06 18:49     ` Rik van Riel
  -1 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 18:49 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> X86 prefers the use of unsigned types for iterators and there is a
> tendency to mix whether a signed or unsigned type if used for page
> order. This converts a number of sites in mm/page_alloc.c to use
> unsigned int for order where possible.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 11/17] mm: page_alloc: Use unsigned int for order in more places
@ 2014-05-06 18:49     ` Rik van Riel
  0 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 18:49 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> X86 prefers the use of unsigned types for iterators and there is a
> tendency to mix whether a signed or unsigned type if used for page
> order. This converts a number of sites in mm/page_alloc.c to use
> unsigned int for order where possible.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 12/17] mm: page_alloc: Convert hot/cold parameter and immediate callers to bool
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-06 18:49     ` Rik van Riel
  -1 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 18:49 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> cold is a bool, make it one. Make the likely case the "if" part of the
> block instead of the else as according to the optimisation manual this
> is preferred.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 12/17] mm: page_alloc: Convert hot/cold parameter and immediate callers to bool
@ 2014-05-06 18:49     ` Rik van Riel
  0 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 18:49 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> cold is a bool, make it one. Make the likely case the "if" part of the
> block instead of the else as according to the optimisation manual this
> is preferred.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 13/17] mm: shmem: Avoid atomic operation during shmem_getpage_gfp
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-06 18:53     ` Rik van Riel
  -1 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 18:53 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
> before it's even added to the LRU or visible. This is unnecessary as what
> could it possible race against?  Use an unlocked variant.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 13/17] mm: shmem: Avoid atomic operation during shmem_getpage_gfp
@ 2014-05-06 18:53     ` Rik van Riel
  0 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 18:53 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
> before it's even added to the LRU or visible. This is unnecessary as what
> could it possible race against?  Use an unlocked variant.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 14/17] mm: Do not use atomic operations when releasing pages
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-06 18:54     ` Rik van Riel
  -1 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 18:54 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> There should be no references to it any more and a parallel mark should
> not be reordered against us. Use non-locked varient to clear page active.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 14/17] mm: Do not use atomic operations when releasing pages
@ 2014-05-06 18:54     ` Rik van Riel
  0 siblings, 0 replies; 113+ messages in thread
From: Rik van Riel @ 2014-05-06 18:54 UTC (permalink / raw)
  To: Mel Gorman, Linux-MM, Linux-FSDevel
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Linux Kernel

On 05/01/2014 04:44 AM, Mel Gorman wrote:
> There should be no references to it any more and a parallel mark should
> not be reordered against us. Use non-locked varient to clear page active.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 03/17] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-06 20:23     ` Peter Zijlstra
  -1 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2014-05-06 20:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 09:44:34AM +0100, Mel Gorman wrote:
> If cpusets are not in use then we still check a global variable on every
> page allocation. Use jump labels to avoid the overhead.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  include/linux/cpuset.h | 31 +++++++++++++++++++++++++++++++
>  kernel/cpuset.c        |  8 ++++++--
>  mm/page_alloc.c        |  3 ++-
>  3 files changed, 39 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index b19d3dc..2b89e07 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -17,6 +17,35 @@
>  
>  extern int number_of_cpusets;	/* How many cpusets are defined in system? */
>  
> +#ifdef HAVE_JUMP_LABEL
> +extern struct static_key cpusets_enabled_key;
> +static inline bool cpusets_enabled(void)
> +{
> +	return static_key_false(&cpusets_enabled_key);
> +}
> +#else
> +static inline bool cpusets_enabled(void)
> +{
> +	return number_of_cpusets > 1;
> +}
> +#endif
> +
> +static inline void cpuset_inc(void)
> +{
> +	number_of_cpusets++;
> +#ifdef HAVE_JUMP_LABEL
> +	static_key_slow_inc(&cpusets_enabled_key);
> +#endif
> +}
> +
> +static inline void cpuset_dec(void)
> +{
> +	number_of_cpusets--;
> +#ifdef HAVE_JUMP_LABEL
> +	static_key_slow_dec(&cpusets_enabled_key);
> +#endif
> +}

Why the HAVE_JUMP_LABEL and number_of_cpusets thing? When
!HAVE_JUMP_LABEL the static_key thing reverts to an atomic_t and
static_key_false() becomes:

 return unlikely(atomic_read(&key->enabled) > 0);


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 03/17] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets
@ 2014-05-06 20:23     ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2014-05-06 20:23 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 09:44:34AM +0100, Mel Gorman wrote:
> If cpusets are not in use then we still check a global variable on every
> page allocation. Use jump labels to avoid the overhead.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  include/linux/cpuset.h | 31 +++++++++++++++++++++++++++++++
>  kernel/cpuset.c        |  8 ++++++--
>  mm/page_alloc.c        |  3 ++-
>  3 files changed, 39 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index b19d3dc..2b89e07 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -17,6 +17,35 @@
>  
>  extern int number_of_cpusets;	/* How many cpusets are defined in system? */
>  
> +#ifdef HAVE_JUMP_LABEL
> +extern struct static_key cpusets_enabled_key;
> +static inline bool cpusets_enabled(void)
> +{
> +	return static_key_false(&cpusets_enabled_key);
> +}
> +#else
> +static inline bool cpusets_enabled(void)
> +{
> +	return number_of_cpusets > 1;
> +}
> +#endif
> +
> +static inline void cpuset_inc(void)
> +{
> +	number_of_cpusets++;
> +#ifdef HAVE_JUMP_LABEL
> +	static_key_slow_inc(&cpusets_enabled_key);
> +#endif
> +}
> +
> +static inline void cpuset_dec(void)
> +{
> +	number_of_cpusets--;
> +#ifdef HAVE_JUMP_LABEL
> +	static_key_slow_dec(&cpusets_enabled_key);
> +#endif
> +}

Why the HAVE_JUMP_LABEL and number_of_cpusets thing? When
!HAVE_JUMP_LABEL the static_key thing reverts to an atomic_t and
static_key_false() becomes:

 return unlikely(atomic_read(&key->enabled) > 0);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 17/17] mm: filemap: Avoid unnecessary barries and waitqueue lookup in unlock_page fastpath
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-06 20:30     ` Peter Zijlstra
  -1 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2014-05-06 20:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 09:44:48AM +0100, Mel Gorman wrote:
> +/*
> + * If PageWaiters was found to be set at unlock time, __wake_page_waiters
> + * should be called to actually perform the wakeup of waiters.
> + */
> +static inline void __wake_page_waiters(struct page *page)
> +{
> +	ClearPageWaiters(page);

-ENOCOMMENT

barriers should always come with a comment that explain the memory
ordering and reference the pairing barrier.

Also, FWIW, there's a mass rename queued for .16 that'll make this:

  smp_mb__after_atomic();

but for now it will also still provide the old names with a __deprecated
tag on, so no real harm.

> +	smp_mb__after_clear_bit();
> +	wake_up_page(page, PG_locked);
> +}

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 17/17] mm: filemap: Avoid unnecessary barries and waitqueue lookup in unlock_page fastpath
@ 2014-05-06 20:30     ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2014-05-06 20:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 09:44:48AM +0100, Mel Gorman wrote:
> +/*
> + * If PageWaiters was found to be set at unlock time, __wake_page_waiters
> + * should be called to actually perform the wakeup of waiters.
> + */
> +static inline void __wake_page_waiters(struct page *page)
> +{
> +	ClearPageWaiters(page);

-ENOCOMMENT

barriers should always come with a comment that explain the memory
ordering and reference the pairing barrier.

Also, FWIW, there's a mass rename queued for .16 that'll make this:

  smp_mb__after_atomic();

but for now it will also still provide the old names with a __deprecated
tag on, so no real harm.

> +	smp_mb__after_clear_bit();
> +	wake_up_page(page, PG_locked);
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
  2014-05-01  8:44   ` Mel Gorman
@ 2014-05-06 20:34     ` Peter Zijlstra
  -1 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2014-05-06 20:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 09:44:39AM +0100, Mel Gorman wrote:
> +void set_pfnblock_flags_group(struct page *page, unsigned long flags,
> +					unsigned long end_bitidx,
> +					unsigned long nr_flag_bits,
> +					unsigned long mask)
>  {
>  	struct zone *zone;
>  	unsigned long *bitmap;
> +	unsigned long pfn, bitidx, word_bitidx;
> +	unsigned long old_word, new_word;
> +
> +	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
>  
>  	zone = page_zone(page);
>  	pfn = page_to_pfn(page);
>  	bitmap = get_pageblock_bitmap(zone, pfn);
>  	bitidx = pfn_to_bitidx(zone, pfn);
> +	word_bitidx = bitidx / BITS_PER_LONG;
> +	bitidx &= (BITS_PER_LONG-1);
> +
>  	VM_BUG_ON_PAGE(!zone_spans_pfn(zone, pfn), page);
>  
> +	bitidx += end_bitidx;
> +	mask <<= (BITS_PER_LONG - bitidx - 1);
> +	flags <<= (BITS_PER_LONG - bitidx - 1);
> +
> +	do {
> +		old_word = ACCESS_ONCE(bitmap[word_bitidx]);
> +		new_word = (old_word & ~mask) | flags;
> +	} while (cmpxchg(&bitmap[word_bitidx], old_word, new_word) != old_word);
>  }

You could write it like:

	word = ACCESS_ONCE(bitmap[word_bitidx]);
	for (;;) {
		old_word = cmpxchg(&bitmap[word_bitidx], word, (word & ~mask) | flags);
		if (word == old_word);
			break;
		word = old_word;
	}

It has a slightly tighter loop by avoiding the read being included.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
@ 2014-05-06 20:34     ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2014-05-06 20:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On Thu, May 01, 2014 at 09:44:39AM +0100, Mel Gorman wrote:
> +void set_pfnblock_flags_group(struct page *page, unsigned long flags,
> +					unsigned long end_bitidx,
> +					unsigned long nr_flag_bits,
> +					unsigned long mask)
>  {
>  	struct zone *zone;
>  	unsigned long *bitmap;
> +	unsigned long pfn, bitidx, word_bitidx;
> +	unsigned long old_word, new_word;
> +
> +	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
>  
>  	zone = page_zone(page);
>  	pfn = page_to_pfn(page);
>  	bitmap = get_pageblock_bitmap(zone, pfn);
>  	bitidx = pfn_to_bitidx(zone, pfn);
> +	word_bitidx = bitidx / BITS_PER_LONG;
> +	bitidx &= (BITS_PER_LONG-1);
> +
>  	VM_BUG_ON_PAGE(!zone_spans_pfn(zone, pfn), page);
>  
> +	bitidx += end_bitidx;
> +	mask <<= (BITS_PER_LONG - bitidx - 1);
> +	flags <<= (BITS_PER_LONG - bitidx - 1);
> +
> +	do {
> +		old_word = ACCESS_ONCE(bitmap[word_bitidx]);
> +		new_word = (old_word & ~mask) | flags;
> +	} while (cmpxchg(&bitmap[word_bitidx], old_word, new_word) != old_word);
>  }

You could write it like:

	word = ACCESS_ONCE(bitmap[word_bitidx]);
	for (;;) {
		old_word = cmpxchg(&bitmap[word_bitidx], word, (word & ~mask) | flags);
		if (word == old_word);
			break;
		word = old_word;
	}

It has a slightly tighter loop by avoiding the read being included.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 03/17] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets
  2014-05-06 20:23     ` Peter Zijlstra
@ 2014-05-06 22:21       ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-06 22:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On Tue, May 06, 2014 at 10:23:50PM +0200, Peter Zijlstra wrote:
> On Thu, May 01, 2014 at 09:44:34AM +0100, Mel Gorman wrote:
> > If cpusets are not in use then we still check a global variable on every
> > page allocation. Use jump labels to avoid the overhead.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  include/linux/cpuset.h | 31 +++++++++++++++++++++++++++++++
> >  kernel/cpuset.c        |  8 ++++++--
> >  mm/page_alloc.c        |  3 ++-
> >  3 files changed, 39 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> > index b19d3dc..2b89e07 100644
> > --- a/include/linux/cpuset.h
> > +++ b/include/linux/cpuset.h
> > @@ -17,6 +17,35 @@
> >  
> >  extern int number_of_cpusets;	/* How many cpusets are defined in system? */
> >  
> > +#ifdef HAVE_JUMP_LABEL
> > +extern struct static_key cpusets_enabled_key;
> > +static inline bool cpusets_enabled(void)
> > +{
> > +	return static_key_false(&cpusets_enabled_key);
> > +}
> > +#else
> > +static inline bool cpusets_enabled(void)
> > +{
> > +	return number_of_cpusets > 1;
> > +}
> > +#endif
> > +
> > +static inline void cpuset_inc(void)
> > +{
> > +	number_of_cpusets++;
> > +#ifdef HAVE_JUMP_LABEL
> > +	static_key_slow_inc(&cpusets_enabled_key);
> > +#endif
> > +}
> > +
> > +static inline void cpuset_dec(void)
> > +{
> > +	number_of_cpusets--;
> > +#ifdef HAVE_JUMP_LABEL
> > +	static_key_slow_dec(&cpusets_enabled_key);
> > +#endif
> > +}
> 
> Why the HAVE_JUMP_LABEL and number_of_cpusets thing? When
> !HAVE_JUMP_LABEL the static_key thing reverts to an atomic_t and
> static_key_false() becomes:
> 

Because number_of_cpusets is used to size a kmalloc(). Potentially I could
abuse the internals of static keys and use the value of key->enabled but
that felt like abuse of the API.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 03/17] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets
@ 2014-05-06 22:21       ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-06 22:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On Tue, May 06, 2014 at 10:23:50PM +0200, Peter Zijlstra wrote:
> On Thu, May 01, 2014 at 09:44:34AM +0100, Mel Gorman wrote:
> > If cpusets are not in use then we still check a global variable on every
> > page allocation. Use jump labels to avoid the overhead.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  include/linux/cpuset.h | 31 +++++++++++++++++++++++++++++++
> >  kernel/cpuset.c        |  8 ++++++--
> >  mm/page_alloc.c        |  3 ++-
> >  3 files changed, 39 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> > index b19d3dc..2b89e07 100644
> > --- a/include/linux/cpuset.h
> > +++ b/include/linux/cpuset.h
> > @@ -17,6 +17,35 @@
> >  
> >  extern int number_of_cpusets;	/* How many cpusets are defined in system? */
> >  
> > +#ifdef HAVE_JUMP_LABEL
> > +extern struct static_key cpusets_enabled_key;
> > +static inline bool cpusets_enabled(void)
> > +{
> > +	return static_key_false(&cpusets_enabled_key);
> > +}
> > +#else
> > +static inline bool cpusets_enabled(void)
> > +{
> > +	return number_of_cpusets > 1;
> > +}
> > +#endif
> > +
> > +static inline void cpuset_inc(void)
> > +{
> > +	number_of_cpusets++;
> > +#ifdef HAVE_JUMP_LABEL
> > +	static_key_slow_inc(&cpusets_enabled_key);
> > +#endif
> > +}
> > +
> > +static inline void cpuset_dec(void)
> > +{
> > +	number_of_cpusets--;
> > +#ifdef HAVE_JUMP_LABEL
> > +	static_key_slow_dec(&cpusets_enabled_key);
> > +#endif
> > +}
> 
> Why the HAVE_JUMP_LABEL and number_of_cpusets thing? When
> !HAVE_JUMP_LABEL the static_key thing reverts to an atomic_t and
> static_key_false() becomes:
> 

Because number_of_cpusets is used to size a kmalloc(). Potentially I could
abuse the internals of static keys and use the value of key->enabled but
that felt like abuse of the API.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
  2014-05-06 20:34     ` Peter Zijlstra
@ 2014-05-06 22:24       ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-06 22:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On Tue, May 06, 2014 at 10:34:49PM +0200, Peter Zijlstra wrote:
> On Thu, May 01, 2014 at 09:44:39AM +0100, Mel Gorman wrote:
> > +void set_pfnblock_flags_group(struct page *page, unsigned long flags,
> > +					unsigned long end_bitidx,
> > +					unsigned long nr_flag_bits,
> > +					unsigned long mask)
> >  {
> >  	struct zone *zone;
> >  	unsigned long *bitmap;
> > +	unsigned long pfn, bitidx, word_bitidx;
> > +	unsigned long old_word, new_word;
> > +
> > +	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
> >  
> >  	zone = page_zone(page);
> >  	pfn = page_to_pfn(page);
> >  	bitmap = get_pageblock_bitmap(zone, pfn);
> >  	bitidx = pfn_to_bitidx(zone, pfn);
> > +	word_bitidx = bitidx / BITS_PER_LONG;
> > +	bitidx &= (BITS_PER_LONG-1);
> > +
> >  	VM_BUG_ON_PAGE(!zone_spans_pfn(zone, pfn), page);
> >  
> > +	bitidx += end_bitidx;
> > +	mask <<= (BITS_PER_LONG - bitidx - 1);
> > +	flags <<= (BITS_PER_LONG - bitidx - 1);
> > +
> > +	do {
> > +		old_word = ACCESS_ONCE(bitmap[word_bitidx]);
> > +		new_word = (old_word & ~mask) | flags;
> > +	} while (cmpxchg(&bitmap[word_bitidx], old_word, new_word) != old_word);
> >  }
> 
> You could write it like:
> 
> 	word = ACCESS_ONCE(bitmap[word_bitidx]);
> 	for (;;) {
> 		old_word = cmpxchg(&bitmap[word_bitidx], word, (word & ~mask) | flags);
> 		if (word == old_word);
> 			break;
> 		word = old_word;
> 	}
> 
> It has a slightly tighter loop by avoiding the read being included.

Thanks, I'll use that.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
@ 2014-05-06 22:24       ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-06 22:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On Tue, May 06, 2014 at 10:34:49PM +0200, Peter Zijlstra wrote:
> On Thu, May 01, 2014 at 09:44:39AM +0100, Mel Gorman wrote:
> > +void set_pfnblock_flags_group(struct page *page, unsigned long flags,
> > +					unsigned long end_bitidx,
> > +					unsigned long nr_flag_bits,
> > +					unsigned long mask)
> >  {
> >  	struct zone *zone;
> >  	unsigned long *bitmap;
> > +	unsigned long pfn, bitidx, word_bitidx;
> > +	unsigned long old_word, new_word;
> > +
> > +	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
> >  
> >  	zone = page_zone(page);
> >  	pfn = page_to_pfn(page);
> >  	bitmap = get_pageblock_bitmap(zone, pfn);
> >  	bitidx = pfn_to_bitidx(zone, pfn);
> > +	word_bitidx = bitidx / BITS_PER_LONG;
> > +	bitidx &= (BITS_PER_LONG-1);
> > +
> >  	VM_BUG_ON_PAGE(!zone_spans_pfn(zone, pfn), page);
> >  
> > +	bitidx += end_bitidx;
> > +	mask <<= (BITS_PER_LONG - bitidx - 1);
> > +	flags <<= (BITS_PER_LONG - bitidx - 1);
> > +
> > +	do {
> > +		old_word = ACCESS_ONCE(bitmap[word_bitidx]);
> > +		new_word = (old_word & ~mask) | flags;
> > +	} while (cmpxchg(&bitmap[word_bitidx], old_word, new_word) != old_word);
> >  }
> 
> You could write it like:
> 
> 	word = ACCESS_ONCE(bitmap[word_bitidx]);
> 	for (;;) {
> 		old_word = cmpxchg(&bitmap[word_bitidx], word, (word & ~mask) | flags);
> 		if (word == old_word);
> 			break;
> 		word = old_word;
> 	}
> 
> It has a slightly tighter loop by avoiding the read being included.

Thanks, I'll use that.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 17/17] mm: filemap: Avoid unnecessary barries and waitqueue lookup in unlock_page fastpath
  2014-05-05 10:50     ` Jan Kara
@ 2014-05-07  9:03       ` Mel Gorman
  -1 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-07  9:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Michal Hocko, Hugh Dickins, Linux Kernel

On Mon, May 05, 2014 at 12:50:54PM +0200, Jan Kara wrote:
> On Thu 01-05-14 09:44:48, Mel Gorman wrote:
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > This patch introduces a new page flag for 64-bit capable machines,
> > PG_waiters, to signal there are processes waiting on PG_lock and uses it to
> > avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.
> > 
> > This adds a few branches to the fast path but avoids bouncing a dirty
> > cache line between CPUs. 32-bit machines always take the slow path but the
> > primary motivation for this patch is large machines so I do not think that
> > is a concern.
> ...
> >  /* 
> > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> > index 7d50f79..fb83fe0 100644
> > --- a/kernel/sched/wait.c
> > +++ b/kernel/sched/wait.c
> > @@ -304,8 +304,7 @@ int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
> >  		= container_of(wait, struct wait_bit_queue, wait);
> >  
> >  	if (wait_bit->key.flags != key->flags ||
> > -			wait_bit->key.bit_nr != key->bit_nr ||
> > -			test_bit(key->bit_nr, key->flags))
> > +			wait_bit->key.bit_nr != key->bit_nr)
> >  		return 0;
> >  	else
> >  		return autoremove_wake_function(wait, mode, sync, key);
>   This change seems to be really unrelated? And it would deserve a comment
> on its own I'd think so maybe split that in a separate patch?
> 

Without it processes can sleep forever on the lock bit and hang due to
races between when the PG_waiters is set and cleared. I'll investigate
if this can be done a better way.

> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index c60ed0f..93e4385 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > +int  __wait_on_page_locked_killable(struct page *page)
> > +{
> > +	int ret = 0;
> > +	wait_queue_head_t *wq = page_waitqueue(page);
> > +	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
> > +
> > +	if (!test_bit(PG_locked, &page->flags))
> > +		return 0;
> > +	do {
> > +		prepare_to_wait(wq, &wait.wait, TASK_KILLABLE);
> > +		if (!PageWaiters(page))
> > +			SetPageWaiters(page);
> > +		if (likely(PageLocked(page)))
> > +			ret = sleep_on_page_killable(page);
> > +		finish_wait(wq, &wait.wait);
> > +	} while (PageLocked(page) && !ret);
>   So I'm somewhat wondering why this is the only page waiting variant that
> does finish_wait() inside the loop. Everyone else does it outside the while
> loop which seems sufficient to me even in this case...
> 

No reason. The finish_wait can always be outside. I'll fix it up.
Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 17/17] mm: filemap: Avoid unnecessary barries and waitqueue lookup in unlock_page fastpath
@ 2014-05-07  9:03       ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-07  9:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Michal Hocko, Hugh Dickins, Linux Kernel

On Mon, May 05, 2014 at 12:50:54PM +0200, Jan Kara wrote:
> On Thu 01-05-14 09:44:48, Mel Gorman wrote:
> > From: Nick Piggin <npiggin@suse.de>
> > 
> > This patch introduces a new page flag for 64-bit capable machines,
> > PG_waiters, to signal there are processes waiting on PG_lock and uses it to
> > avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.
> > 
> > This adds a few branches to the fast path but avoids bouncing a dirty
> > cache line between CPUs. 32-bit machines always take the slow path but the
> > primary motivation for this patch is large machines so I do not think that
> > is a concern.
> ...
> >  /* 
> > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> > index 7d50f79..fb83fe0 100644
> > --- a/kernel/sched/wait.c
> > +++ b/kernel/sched/wait.c
> > @@ -304,8 +304,7 @@ int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
> >  		= container_of(wait, struct wait_bit_queue, wait);
> >  
> >  	if (wait_bit->key.flags != key->flags ||
> > -			wait_bit->key.bit_nr != key->bit_nr ||
> > -			test_bit(key->bit_nr, key->flags))
> > +			wait_bit->key.bit_nr != key->bit_nr)
> >  		return 0;
> >  	else
> >  		return autoremove_wake_function(wait, mode, sync, key);
>   This change seems to be really unrelated? And it would deserve a comment
> on its own I'd think so maybe split that in a separate patch?
> 

Without it processes can sleep forever on the lock bit and hang due to
races between when the PG_waiters is set and cleared. I'll investigate
if this can be done a better way.

> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index c60ed0f..93e4385 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > +int  __wait_on_page_locked_killable(struct page *page)
> > +{
> > +	int ret = 0;
> > +	wait_queue_head_t *wq = page_waitqueue(page);
> > +	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
> > +
> > +	if (!test_bit(PG_locked, &page->flags))
> > +		return 0;
> > +	do {
> > +		prepare_to_wait(wq, &wait.wait, TASK_KILLABLE);
> > +		if (!PageWaiters(page))
> > +			SetPageWaiters(page);
> > +		if (likely(PageLocked(page)))
> > +			ret = sleep_on_page_killable(page);
> > +		finish_wait(wq, &wait.wait);
> > +	} while (PageLocked(page) && !ret);
>   So I'm somewhat wondering why this is the only page waiting variant that
> does finish_wait() inside the loop. Everyone else does it outside the while
> loop which seems sufficient to me even in this case...
> 

No reason. The finish_wait can always be outside. I'll fix it up.
Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 03/17] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets
  2014-05-06 22:21       ` Mel Gorman
  (?)
@ 2014-05-07  9:04       ` Peter Zijlstra
  2014-05-07  9:43           ` Mel Gorman
  -1 siblings, 1 reply; 113+ messages in thread
From: Peter Zijlstra @ 2014-05-07  9:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 1550 bytes --]

On Tue, May 06, 2014 at 11:21:18PM +0100, Mel Gorman wrote:
> On Tue, May 06, 2014 at 10:23:50PM +0200, Peter Zijlstra wrote:

> > Why the HAVE_JUMP_LABEL and number_of_cpusets thing? When
> > !HAVE_JUMP_LABEL the static_key thing reverts to an atomic_t and
> > static_key_false() becomes:
> > 
> 
> Because number_of_cpusets is used to size a kmalloc(). Potentially I could
> abuse the internals of static keys and use the value of key->enabled but
> that felt like abuse of the API.

But are those ifdefs worth the saving of 4 bytes of .data?

That said, I see no real problem adding static_key_count(). Static keys
(jump labels back then) were specifically designed to include the count
and act as 'ref/usage' counter. Its just that so far everybody only
cared about the boolean 'are there users' question, but there is no
reason not to also return the full count.

Maybe I should also do a patch that renames the static_key::enabled
field to static_key::count to better reflect this.

---
 include/linux/jump_label.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 5c1dfb2a9e73..1a48d16622aa 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -197,4 +197,9 @@ static inline bool static_key_enabled(struct static_key *key)
 	return (atomic_read(&key->enabled) > 0);
 }
 
+static inline int static_key_count(struct static_key *key)
+{
+	return atomic_read(&key->enabled);
+}
+
 #endif	/* _LINUX_JUMP_LABEL_H */

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH 03/17] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets
  2014-05-07  9:04       ` Peter Zijlstra
@ 2014-05-07  9:43           ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-07  9:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On Wed, May 07, 2014 at 11:04:21AM +0200, Peter Zijlstra wrote:
> On Tue, May 06, 2014 at 11:21:18PM +0100, Mel Gorman wrote:
> > On Tue, May 06, 2014 at 10:23:50PM +0200, Peter Zijlstra wrote:
> 
> > > Why the HAVE_JUMP_LABEL and number_of_cpusets thing? When
> > > !HAVE_JUMP_LABEL the static_key thing reverts to an atomic_t and
> > > static_key_false() becomes:
> > > 
> > 
> > Because number_of_cpusets is used to size a kmalloc(). Potentially I could
> > abuse the internals of static keys and use the value of key->enabled but
> > that felt like abuse of the API.
> 
> But are those ifdefs worth the saving of 4 bytes of .data?
> 
> That said, I see no real problem adding static_key_count().

I thought it would be considered API abuse as I always viewed the labels
as being a enabled/disabled thing with the existence of the ref count
being an internal implementation detail. I'll take this approach.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 03/17] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets
@ 2014-05-07  9:43           ` Mel Gorman
  0 siblings, 0 replies; 113+ messages in thread
From: Mel Gorman @ 2014-05-07  9:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, Linux-FSDevel, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Linux Kernel

On Wed, May 07, 2014 at 11:04:21AM +0200, Peter Zijlstra wrote:
> On Tue, May 06, 2014 at 11:21:18PM +0100, Mel Gorman wrote:
> > On Tue, May 06, 2014 at 10:23:50PM +0200, Peter Zijlstra wrote:
> 
> > > Why the HAVE_JUMP_LABEL and number_of_cpusets thing? When
> > > !HAVE_JUMP_LABEL the static_key thing reverts to an atomic_t and
> > > static_key_false() becomes:
> > > 
> > 
> > Because number_of_cpusets is used to size a kmalloc(). Potentially I could
> > abuse the internals of static keys and use the value of key->enabled but
> > that felt like abuse of the API.
> 
> But are those ifdefs worth the saving of 4 bytes of .data?
> 
> That said, I see no real problem adding static_key_count().

I thought it would be considered API abuse as I always viewed the labels
as being a enabled/disabled thing with the existence of the ref count
being an internal implementation detail. I'll take this approach.

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 113+ messages in thread

end of thread, other threads:[~2014-05-07  9:43 UTC | newest]

Thread overview: 113+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-01  8:44 [PATCH 00/17] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations Mel Gorman
2014-05-01  8:44 ` Mel Gorman
2014-05-01  8:44 ` [PATCH 01/17] mm: page_alloc: Do not update zlc unless the zlc is active Mel Gorman
2014-05-01  8:44   ` Mel Gorman
2014-05-01 13:25   ` Johannes Weiner
2014-05-01 13:25     ` Johannes Weiner
2014-05-06 15:04   ` Rik van Riel
2014-05-06 15:04     ` Rik van Riel
2014-05-01  8:44 ` [PATCH 02/17] mm: page_alloc: Do not treat a zone that cannot be used for dirty pages as "full" Mel Gorman
2014-05-01  8:44   ` Mel Gorman
2014-05-06 15:09   ` Rik van Riel
2014-05-06 15:09     ` Rik van Riel
2014-05-01  8:44 ` [PATCH 03/17] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets Mel Gorman
2014-05-01  8:44   ` Mel Gorman
2014-05-06 15:10   ` Rik van Riel
2014-05-06 15:10     ` Rik van Riel
2014-05-06 20:23   ` Peter Zijlstra
2014-05-06 20:23     ` Peter Zijlstra
2014-05-06 22:21     ` Mel Gorman
2014-05-06 22:21       ` Mel Gorman
2014-05-07  9:04       ` Peter Zijlstra
2014-05-07  9:43         ` Mel Gorman
2014-05-07  9:43           ` Mel Gorman
2014-05-01  8:44 ` [PATCH 04/17] mm: page_alloc: Calculate classzone_idx once from the zonelist ref Mel Gorman
2014-05-01  8:44   ` Mel Gorman
2014-05-06 16:01   ` Rik van Riel
2014-05-06 16:01     ` Rik van Riel
2014-05-01  8:44 ` [PATCH 05/17] mm: page_alloc: Only check the zone id check if pages are buddies Mel Gorman
2014-05-01  8:44   ` Mel Gorman
2014-05-06 16:48   ` Rik van Riel
2014-05-06 16:48     ` Rik van Riel
2014-05-01  8:44 ` [PATCH 06/17] mm: page_alloc: Only check the alloc flags and gfp_mask for dirty once Mel Gorman
2014-05-01  8:44   ` Mel Gorman
2014-05-06 17:24   ` Rik van Riel
2014-05-06 17:24     ` Rik van Riel
2014-05-01  8:44 ` [PATCH 07/17] mm: page_alloc: Take the ALLOC_NO_WATERMARK check out of the fast path Mel Gorman
2014-05-01  8:44   ` Mel Gorman
2014-05-06 17:25   ` Rik van Riel
2014-05-06 17:25     ` Rik van Riel
2014-05-01  8:44 ` [PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps Mel Gorman
2014-05-01  8:44   ` Mel Gorman
2014-05-02 22:34   ` Sasha Levin
2014-05-02 22:34     ` Sasha Levin
2014-05-04 13:14     ` Mel Gorman
2014-05-04 13:14       ` Mel Gorman
2014-05-05 12:40       ` Vlastimil Babka
2014-05-05 12:40         ` Vlastimil Babka
2014-05-06  9:13         ` Mel Gorman
2014-05-06  9:13           ` Mel Gorman
2014-05-06 14:42           ` Vlastimil Babka
2014-05-06 14:42             ` Vlastimil Babka
2014-05-06 15:12             ` Mel Gorman
2014-05-06 15:12               ` Mel Gorman
2014-05-06 20:34   ` Peter Zijlstra
2014-05-06 20:34     ` Peter Zijlstra
2014-05-06 22:24     ` Mel Gorman
2014-05-06 22:24       ` Mel Gorman
2014-05-01  8:44 ` [PATCH 09/17] mm: page_alloc: Reduce number of times page_to_pfn is called Mel Gorman
2014-05-01  8:44   ` Mel Gorman
2014-05-06 18:47   ` Rik van Riel
2014-05-06 18:47     ` Rik van Riel
2014-05-01  8:44 ` [PATCH 10/17] mm: page_alloc: Lookup pageblock migratetype with IRQs enabled during free Mel Gorman
2014-05-01  8:44   ` Mel Gorman
2014-05-06 18:48   ` Rik van Riel
2014-05-06 18:48     ` Rik van Riel
2014-05-01  8:44 ` [PATCH 11/17] mm: page_alloc: Use unsigned int for order in more places Mel Gorman
2014-05-01  8:44   ` Mel Gorman
2014-05-01 14:35   ` Dave Hansen
2014-05-01 14:35     ` Dave Hansen
2014-05-01 15:11     ` Mel Gorman
2014-05-01 15:11       ` Mel Gorman
2014-05-01 15:38       ` Dave Hansen
2014-05-01 15:38         ` Dave Hansen
2014-05-06 18:49   ` Rik van Riel
2014-05-06 18:49     ` Rik van Riel
2014-05-01  8:44 ` [PATCH 12/17] mm: page_alloc: Convert hot/cold parameter and immediate callers to bool Mel Gorman
2014-05-01  8:44   ` Mel Gorman
2014-05-06 18:49   ` Rik van Riel
2014-05-06 18:49     ` Rik van Riel
2014-05-01  8:44 ` [PATCH 13/17] mm: shmem: Avoid atomic operation during shmem_getpage_gfp Mel Gorman
2014-05-01  8:44   ` Mel Gorman
2014-05-06 18:53   ` Rik van Riel
2014-05-06 18:53     ` Rik van Riel
2014-05-01  8:44 ` [PATCH 14/17] mm: Do not use atomic operations when releasing pages Mel Gorman
2014-05-01  8:44   ` Mel Gorman
2014-05-01 13:29   ` Johannes Weiner
2014-05-01 13:29     ` Johannes Weiner
2014-05-01 13:39     ` Mel Gorman
2014-05-01 13:39       ` Mel Gorman
2014-05-01 13:47       ` Johannes Weiner
2014-05-01 13:47         ` Johannes Weiner
2014-05-06 18:54   ` Rik van Riel
2014-05-06 18:54     ` Rik van Riel
2014-05-01  8:44 ` [PATCH 15/17] mm: Do not use unnecessary atomic operations when adding pages to the LRU Mel Gorman
2014-05-01  8:44   ` Mel Gorman
2014-05-01 13:33   ` Johannes Weiner
2014-05-01 13:33     ` Johannes Weiner
2014-05-01 13:40     ` Mel Gorman
2014-05-01 13:40       ` Mel Gorman
2014-05-06 15:30   ` Vlastimil Babka
2014-05-06 15:30     ` Vlastimil Babka
2014-05-06 15:55     ` Mel Gorman
2014-05-06 15:55       ` Mel Gorman
2014-05-01  8:44 ` [PATCH 16/17] mm: Non-atomically mark page accessed during page cache allocation where possible Mel Gorman
2014-05-01  8:44   ` Mel Gorman
2014-05-01  8:44 ` [PATCH 17/17] mm: filemap: Avoid unnecessary barries and waitqueue lookup in unlock_page fastpath Mel Gorman
2014-05-01  8:44   ` Mel Gorman
2014-05-05 10:50   ` Jan Kara
2014-05-05 10:50     ` Jan Kara
2014-05-07  9:03     ` Mel Gorman
2014-05-07  9:03       ` Mel Gorman
2014-05-06 20:30   ` Peter Zijlstra
2014-05-06 20:30     ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.