linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33
@ 2014-05-13  9:45 Mel Gorman
  2014-05-13  9:45 ` [PATCH 01/19] mm: page_alloc: Do not update zlc unless the zlc is active Mel Gorman
                   ` (19 more replies)
  0 siblings, 20 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

Changelog since V2
o Fewer atomic operations in buffer discards				(mgorman)
o Remove number_of_cpusets and use ref count in jump labels		(peterz)
o Optimise set loop for pageblock flags further				(peterz)
o Remove unnecessary parameters when setting pageblock flags		(vbabka)
o Rework how PG_waiters are set/cleared to avoid changing wait.c	(mgorman)

I was investigating a performance bug that looked like dd to tmpfs
had regressed.  The bulk of the problem turned out to be a difference
in Kconfig but it got me looking at the unnecessary overhead in tmpfs,
mark_page_accessed and parts of the allocator. This series is the result.

The patches themselves have details of the performance results but here
are a few showing the impact of the whole series. This is the result of
dd'ing to a file multiple times on tmpfs

sync DD to tmpfs
Throughput           3.15.0-rc4            3.15.0-rc4
                        vanilla         fullseries-v3
Min         4096.0000 (  0.00%)   4300.8000 (  5.00%)
Mean        4785.4933 (  0.00%)   5003.9467 (  4.56%)
TrimMean    4812.8000 (  0.00%)   5028.5714 (  4.48%)
Stddev       147.0509 (  0.00%)    191.9981 ( 30.57%)
Max         5017.6000 (  0.00%)   5324.8000 (  6.12%)

sync DD to tmpfs
Elapsed Time                3.15.0-rc4            3.15.0-rc4
                               vanilla         fullseries-v3
Min      elapsed      0.4200 (  0.00%)      0.3900 (  7.14%)
Mean     elapsed      0.4947 (  0.00%)      0.4527 (  8.49%)
TrimMean elapsed      0.4968 (  0.00%)      0.4539 (  8.63%)
Stddev   elapsed      0.0255 (  0.00%)      0.0340 (-33.02%)
Max      elapsed      0.5200 (  0.00%)      0.4800 (  7.69%)

TrimMean elapsed      0.4796 (  0.00%)      0.4179 ( 12.88%)
Stddev   elapsed      0.0353 (  0.00%)      0.0379 ( -7.23%)
Max      elapsed      0.5100 (  0.00%)      0.4800 (  5.88%)

sync DD to ext4
Throughput           3.15.0-rc4            3.15.0-rc4
                        vanilla         fullseries-v3
Min          113.0000 (  0.00%)    117.0000 (  3.54%)
Mean         116.3000 (  0.00%)    119.6667 (  2.89%)
TrimMean     116.2857 (  0.00%)    119.5714 (  2.83%)
Stddev         1.6961 (  0.00%)      1.1643 (-31.35%)
Max          120.0000 (  0.00%)    122.0000 (  1.67%)

sync DD to ext4
Elapsed time                3.15.0-rc4            3.15.0-rc4
                               vanilla         fullseries-v3
Min      elapsed     13.9500 (  0.00%)     13.6900 (  1.86%)
Mean     elapsed     14.4253 (  0.00%)     14.0010 (  2.94%)
TrimMean elapsed     14.4321 (  0.00%)     14.0161 (  2.88%)
Stddev   elapsed      0.2047 (  0.00%)      0.1423 ( 30.46%)
Max      elapsed     14.8300 (  0.00%)     14.3100 (  3.51%)

async DD to ext4 
Elapsed time                3.15.0-rc4            3.15.0-rc4
                               vanilla         fullseries-v3
Min      elapsed      0.7900 (  0.00%)      0.7800 (  1.27%)
Mean     elapsed     12.4023 (  0.00%)     12.2957 (  0.86%)
TrimMean elapsed     13.2036 (  0.00%)     13.0918 (  0.85%)
Stddev   elapsed      3.3286 (  0.00%)      2.9842 ( 10.35%)
Max      elapsed     18.6000 (  0.00%)     13.4300 ( 27.80%)



This table shows the latency in usecs of accessing ext4-backed
mappings of various sizes

lat_mmap
                       3.15.0-rc4            3.15.0-rc4
                          vanilla         fullseries-v3
Procs 107M     564.0000 (  0.00%)    546.0000 (  3.19%)
Procs 214M    1123.0000 (  0.00%)   1090.0000 (  2.94%)
Procs 322M    1636.0000 (  0.00%)   1395.0000 ( 14.73%)
Procs 429M    2076.0000 (  0.00%)   2051.0000 (  1.20%)
Procs 536M    2518.0000 (  0.00%)   2482.0000 (  1.43%)
Procs 644M    3008.0000 (  0.00%)   2978.0000 (  1.00%)
Procs 751M    3506.0000 (  0.00%)   3450.0000 (  1.60%)
Procs 859M    3988.0000 (  0.00%)   3756.0000 (  5.82%)
Procs 966M    4544.0000 (  0.00%)   4310.0000 (  5.15%)
Procs 1073M   4960.0000 (  0.00%)   4928.0000 (  0.65%)
Procs 1181M   5342.0000 (  0.00%)   5144.0000 (  3.71%)
Procs 1288M   5573.0000 (  0.00%)   5427.0000 (  2.62%)
Procs 1395M   5777.0000 (  0.00%)   6056.0000 ( -4.83%)
Procs 1503M   6141.0000 (  0.00%)   5963.0000 (  2.90%)
Procs 1610M   6689.0000 (  0.00%)   6331.0000 (  5.35%)
Procs 1717M   8839.0000 (  0.00%)   6807.0000 ( 22.99%)
Procs 1825M   8399.0000 (  0.00%)   9062.0000 ( -7.89%)
Procs 1932M   7871.0000 (  0.00%)   8778.0000 (-11.52%)
Procs 2040M   8235.0000 (  0.00%)   8081.0000 (  1.87%)
Procs 2147M   8861.0000 (  0.00%)   8337.0000 (  5.91%)

In general the system CPU overhead is lower.

 arch/tile/mm/homecache.c        |   2 +-
 fs/btrfs/extent_io.c            |  11 +-
 fs/btrfs/file.c                 |   5 +-
 fs/buffer.c                     |  21 ++-
 fs/ext4/mballoc.c               |  14 +-
 fs/f2fs/checkpoint.c            |   3 -
 fs/f2fs/node.c                  |   2 -
 fs/fuse/dev.c                   |   2 +-
 fs/fuse/file.c                  |   2 -
 fs/gfs2/aops.c                  |   1 -
 fs/gfs2/meta_io.c               |   4 +-
 fs/ntfs/attrib.c                |   1 -
 fs/ntfs/file.c                  |   1 -
 include/linux/buffer_head.h     |   5 +
 include/linux/cpuset.h          |  46 +++++
 include/linux/gfp.h             |   4 +-
 include/linux/jump_label.h      |  20 ++-
 include/linux/mmzone.h          |  21 ++-
 include/linux/page-flags.h      |  20 +++
 include/linux/pageblock-flags.h |  30 +++-
 include/linux/pagemap.h         | 115 +++++++++++-
 include/linux/swap.h            |   9 +-
 kernel/cpuset.c                 |  10 +-
 mm/filemap.c                    | 380 +++++++++++++++++++++++++---------------
 mm/page_alloc.c                 | 229 ++++++++++++++----------
 mm/shmem.c                      |   8 +-
 mm/swap.c                       |  27 ++-
 mm/swap_state.c                 |   2 +-
 mm/vmscan.c                     |   9 +-
 29 files changed, 686 insertions(+), 318 deletions(-)

-- 
1.8.4.5


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 01/19] mm: page_alloc: Do not update zlc unless the zlc is active
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
@ 2014-05-13  9:45 ` Mel Gorman
  2014-05-13  9:45 ` [PATCH 02/19] mm: page_alloc: Do not treat a zone that cannot be used for dirty pages as "full" Mel Gorman
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

The zlc is used on NUMA machines to quickly skip over zones that are full.
However it is always updated, even for the first zone scanned when the
zlc might not even be active. As it's a write to a bitmap that potentially
bounces cache line it's deceptively expensive and most machines will not
care. Only update the zlc if it was active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5dba293..f8b80c3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2044,7 +2044,7 @@ try_this_zone:
 		if (page)
 			break;
 this_zone_full:
-		if (IS_ENABLED(CONFIG_NUMA))
+		if (IS_ENABLED(CONFIG_NUMA) && zlc_active)
 			zlc_mark_zone_full(zonelist, z);
 	}
 
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 02/19] mm: page_alloc: Do not treat a zone that cannot be used for dirty pages as "full"
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
  2014-05-13  9:45 ` [PATCH 01/19] mm: page_alloc: Do not update zlc unless the zlc is active Mel Gorman
@ 2014-05-13  9:45 ` Mel Gorman
  2014-05-13  9:45 ` [PATCH 03/19] jump_label: Expose the reference count Mel Gorman
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

If a zone cannot be used for a dirty page then it gets marked "full"
which is cached in the zlc and later potentially skipped by allocation
requests that have nothing to do with dirty zones.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f8b80c3..5c559e3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1976,7 +1976,7 @@ zonelist_scan:
 		 */
 		if ((alloc_flags & ALLOC_WMARK_LOW) &&
 		    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
-			goto this_zone_full;
+			continue;
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 		if (!zone_watermark_ok(zone, order, mark,
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 03/19] jump_label: Expose the reference count
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
  2014-05-13  9:45 ` [PATCH 01/19] mm: page_alloc: Do not update zlc unless the zlc is active Mel Gorman
  2014-05-13  9:45 ` [PATCH 02/19] mm: page_alloc: Do not treat a zone that cannot be used for dirty pages as "full" Mel Gorman
@ 2014-05-13  9:45 ` Mel Gorman
  2014-05-13  9:45 ` [PATCH 04/19] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets Mel Gorman
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

This patch exposes the jump_label reference count in preparation for the
next patch. cpusets cares about both the jump_label being enabled and how
many users of the cpusets there currently are.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/jump_label.h | 20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/linux/jump_label.h b/include/linux/jump_label.h
index 5c1dfb2..784304b 100644
--- a/include/linux/jump_label.h
+++ b/include/linux/jump_label.h
@@ -69,6 +69,10 @@ struct static_key {
 
 # include <asm/jump_label.h>
 # define HAVE_JUMP_LABEL
+#else
+struct static_key {
+	atomic_t enabled;
+};
 #endif	/* CC_HAVE_ASM_GOTO && CONFIG_JUMP_LABEL */
 
 enum jump_label_type {
@@ -79,6 +83,12 @@ enum jump_label_type {
 struct module;
 
 #include <linux/atomic.h>
+
+static inline int static_key_count(struct static_key *key)
+{
+	return atomic_read(&key->enabled);
+}
+
 #ifdef HAVE_JUMP_LABEL
 
 #define JUMP_LABEL_TYPE_FALSE_BRANCH	0UL
@@ -134,10 +144,6 @@ extern void jump_label_apply_nops(struct module *mod);
 
 #else  /* !HAVE_JUMP_LABEL */
 
-struct static_key {
-	atomic_t enabled;
-};
-
 static __always_inline void jump_label_init(void)
 {
 	static_key_initialized = true;
@@ -145,14 +151,14 @@ static __always_inline void jump_label_init(void)
 
 static __always_inline bool static_key_false(struct static_key *key)
 {
-	if (unlikely(atomic_read(&key->enabled) > 0))
+	if (unlikely(static_key_count(key) > 0))
 		return true;
 	return false;
 }
 
 static __always_inline bool static_key_true(struct static_key *key)
 {
-	if (likely(atomic_read(&key->enabled) > 0))
+	if (likely(static_key_count(key) > 0))
 		return true;
 	return false;
 }
@@ -194,7 +200,7 @@ static inline int jump_label_apply_nops(struct module *mod)
 
 static inline bool static_key_enabled(struct static_key *key)
 {
-	return (atomic_read(&key->enabled) > 0);
+	return static_key_count(key) > 0;
 }
 
 #endif	/* _LINUX_JUMP_LABEL_H */
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 04/19] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
                   ` (2 preceding siblings ...)
  2014-05-13  9:45 ` [PATCH 03/19] jump_label: Expose the reference count Mel Gorman
@ 2014-05-13  9:45 ` Mel Gorman
  2014-05-13 10:58   ` Peter Zijlstra
  2014-05-13  9:45 ` [PATCH 05/19] mm: page_alloc: Calculate classzone_idx once from the zonelist ref Mel Gorman
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

If cpusets are not in use then we still check a global variable on every
page allocation. Use jump labels to avoid the overhead.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/cpuset.h | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/cpuset.c        | 10 +++++++---
 mm/page_alloc.c        |  3 ++-
 3 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index b19d3dc..561cdb1 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -15,8 +15,52 @@
 
 #ifdef CONFIG_CPUSETS
 
+#ifdef HAVE_JUMP_LABEL
+extern struct static_key cpusets_enabled_key;
+static inline bool cpusets_enabled(void)
+{
+	return static_key_false(&cpusets_enabled_key);
+}
+
+/* jump label reference count + the top-level cpuset */
+#define number_of_cpusets (static_key_count(&cpusets_enabled_key) + 1)
+
+static inline void cpuset_inc(void)
+{
+	static_key_slow_inc(&cpusets_enabled_key);
+}
+
+static inline void cpuset_dec(void)
+{
+	static_key_slow_dec(&cpusets_enabled_key);
+}
+
+static inline void cpuset_init_count(void) { }
+
+#else
 extern int number_of_cpusets;	/* How many cpusets are defined in system? */
 
+static inline bool cpusets_enabled(void)
+{
+	return number_of_cpusets > 1;
+}
+
+static inline void cpuset_inc(void)
+{
+	number_of_cpusets++;
+}
+
+static inline void cpuset_dec(void)
+{
+	number_of_cpusets--;
+}
+
+static inline void cpuset_init_count(void)
+{
+	number_of_cpusets = 1;
+}
+#endif /* HAVE_JUMP_LABEL */
+
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
 extern void cpuset_update_active_cpus(bool cpu_online);
@@ -124,6 +168,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 
 #else /* !CONFIG_CPUSETS */
 
+static inline bool cpusets_enabled(void) { return false; }
+
 static inline int cpuset_init(void) { return 0; }
 static inline void cpuset_init_smp(void) {}
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 3d54c41..d503f26 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -61,12 +61,16 @@
 #include <linux/cgroup.h>
 #include <linux/wait.h>
 
+#ifdef HAVE_JUMP_LABEL
+struct static_key cpusets_enabled_key = STATIC_KEY_INIT_FALSE;
+#else
 /*
  * Tracks how many cpusets are currently defined in system.
  * When there is only one cpuset (the root cpuset) we can
  * short circuit some hooks.
  */
 int number_of_cpusets __read_mostly;
+#endif
 
 /* See "Frequency meter" comments, below. */
 
@@ -1888,7 +1892,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	if (is_spread_slab(parent))
 		set_bit(CS_SPREAD_SLAB, &cs->flags);
 
-	number_of_cpusets++;
+	cpuset_inc();
 
 	if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
 		goto out_unlock;
@@ -1939,7 +1943,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 	if (is_sched_load_balance(cs))
 		update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
 
-	number_of_cpusets--;
+	cpuset_dec();
 	clear_bit(CS_ONLINE, &cs->flags);
 
 	mutex_unlock(&cpuset_mutex);
@@ -1992,7 +1996,7 @@ int __init cpuset_init(void)
 	if (!alloc_cpumask_var(&cpus_attach, GFP_KERNEL))
 		BUG();
 
-	number_of_cpusets = 1;
+	cpuset_init_count();
 	return 0;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5c559e3..cb12b9a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1930,7 +1930,8 @@ zonelist_scan:
 		if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
-		if ((alloc_flags & ALLOC_CPUSET) &&
+		if (cpusets_enabled() &&
+			(alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				continue;
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 05/19] mm: page_alloc: Calculate classzone_idx once from the zonelist ref
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
                   ` (3 preceding siblings ...)
  2014-05-13  9:45 ` [PATCH 04/19] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets Mel Gorman
@ 2014-05-13  9:45 ` Mel Gorman
  2014-05-13 22:25   ` Andrew Morton
  2014-05-13  9:45 ` [PATCH 06/19] mm: page_alloc: Only check the zone id check if pages are buddies Mel Gorman
                   ` (14 subsequent siblings)
  19 siblings, 1 reply; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

There is no need to calculate zone_idx(preferred_zone) multiple times
or use the pgdat to figure it out.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/page_alloc.c | 55 ++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb12b9a..3b6ae9d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1907,17 +1907,15 @@ static inline void init_zone_allows_reclaim(int nid)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone, int migratetype)
+		struct zone *preferred_zone, int classzone_idx, int migratetype)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
-	int classzone_idx;
 	struct zone *zone;
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
-	classzone_idx = zone_idx(preferred_zone);
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -2174,7 +2172,7 @@ static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int classzone_idx, int migratetype)
 {
 	struct page *page;
 
@@ -2192,7 +2190,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone, migratetype);
+		preferred_zone, classzone_idx, migratetype);
 	if (page)
 		goto out;
 
@@ -2227,7 +2225,7 @@ static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, bool sync_migration,
+	int classzone_idx, int migratetype, bool sync_migration,
 	bool *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
@@ -2255,7 +2253,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		page = get_page_from_freelist(gfp_mask, nodemask,
 				order, zonelist, high_zoneidx,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
-				preferred_zone, migratetype);
+				preferred_zone, classzone_idx, migratetype);
 		if (page) {
 			preferred_zone->compact_blockskip_flush = false;
 			compaction_defer_reset(preferred_zone, order, true);
@@ -2287,7 +2285,7 @@ static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, bool sync_migration,
+	int classzone_idx, int migratetype, bool sync_migration,
 	bool *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
@@ -2328,7 +2326,7 @@ static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, unsigned long *did_some_progress)
+	int classzone_idx, int migratetype, unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	bool drained = false;
@@ -2346,7 +2344,8 @@ retry:
 	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags & ~ALLOC_NO_WATERMARKS,
-					preferred_zone, migratetype);
+					preferred_zone, classzone_idx,
+					migratetype);
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
@@ -2369,14 +2368,14 @@ static inline struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int classzone_idx, int migratetype)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, classzone_idx, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
@@ -2477,7 +2476,7 @@ static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int classzone_idx, int migratetype)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -2526,15 +2525,19 @@ restart:
 	 * Find the true preferred zone if the allocation is unconstrained by
 	 * cpusets.
 	 */
-	if (!(alloc_flags & ALLOC_CPUSET) && !nodemask)
-		first_zones_zonelist(zonelist, high_zoneidx, NULL,
-					&preferred_zone);
+	if (!(alloc_flags & ALLOC_CPUSET) && !nodemask) {
+		struct zoneref *preferred_zoneref;
+		preferred_zoneref = first_zones_zonelist(zonelist, high_zoneidx,
+				nodemask ? : &cpuset_current_mems_allowed,
+				&preferred_zone);
+		classzone_idx = zonelist_zone_idx(preferred_zoneref);
+	}
 
 rebalance:
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, classzone_idx, migratetype);
 	if (page)
 		goto got_pg;
 
@@ -2549,7 +2552,7 @@ rebalance:
 
 		page = __alloc_pages_high_priority(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, classzone_idx, migratetype);
 		if (page) {
 			goto got_pg;
 		}
@@ -2582,6 +2585,7 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
+					classzone_idx,
 					migratetype, sync_migration,
 					&contended_compaction,
 					&deferred_compaction,
@@ -2605,7 +2609,8 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					migratetype, &did_some_progress);
+					classzone_idx, migratetype,
+					&did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -2624,7 +2629,7 @@ rebalance:
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask, preferred_zone,
-					migratetype);
+					classzone_idx, migratetype);
 			if (page)
 				goto got_pg;
 
@@ -2667,6 +2672,7 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
+					classzone_idx,
 					migratetype, sync_migration,
 					&contended_compaction,
 					&deferred_compaction,
@@ -2694,11 +2700,13 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
+	struct zoneref *preferred_zoneref;
 	struct page *page = NULL;
 	int migratetype = allocflags_to_migratetype(gfp_mask);
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
 	struct mem_cgroup *memcg = NULL;
+	int classzone_idx;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2728,11 +2736,12 @@ retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	/* The preferred zone is used for statistics later */
-	first_zones_zonelist(zonelist, high_zoneidx,
+	preferred_zoneref = first_zones_zonelist(zonelist, high_zoneidx,
 				nodemask ? : &cpuset_current_mems_allowed,
 				&preferred_zone);
 	if (!preferred_zone)
 		goto out;
+	classzone_idx = zonelist_zone_idx(preferred_zoneref);
 
 #ifdef CONFIG_CMA
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
@@ -2742,7 +2751,7 @@ retry:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
-			preferred_zone, migratetype);
+			preferred_zone, classzone_idx, migratetype);
 	if (unlikely(!page)) {
 		/*
 		 * The first pass makes sure allocations are spread
@@ -2768,7 +2777,7 @@ retry:
 		gfp_mask = memalloc_noio_flags(gfp_mask);
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, classzone_idx, migratetype);
 	}
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 06/19] mm: page_alloc: Only check the zone id check if pages are buddies
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
                   ` (4 preceding siblings ...)
  2014-05-13  9:45 ` [PATCH 05/19] mm: page_alloc: Calculate classzone_idx once from the zonelist ref Mel Gorman
@ 2014-05-13  9:45 ` Mel Gorman
  2014-05-13  9:45 ` [PATCH 07/19] mm: page_alloc: Only check the alloc flags and gfp_mask for dirty once Mel Gorman
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

A node/zone index is used to check if pages are compatible for merging
but this happens unconditionally even if the buddy page is not free. Defer
the calculation as long as possible. Ideally we would check the zone boundary
but nodes can overlap.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/page_alloc.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b6ae9d..8971953 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -508,16 +508,26 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
 	if (!pfn_valid_within(page_to_pfn(buddy)))
 		return 0;
 
-	if (page_zone_id(page) != page_zone_id(buddy))
-		return 0;
-
 	if (page_is_guard(buddy) && page_order(buddy) == order) {
 		VM_BUG_ON_PAGE(page_count(buddy) != 0, buddy);
+
+		if (page_zone_id(page) != page_zone_id(buddy))
+			return 0;
+
 		return 1;
 	}
 
 	if (PageBuddy(buddy) && page_order(buddy) == order) {
 		VM_BUG_ON_PAGE(page_count(buddy) != 0, buddy);
+
+		/*
+		 * zone check is done late to avoid uselessly
+		 * calculating zone/node ids for pages that could
+		 * never merge.
+		 */
+		if (page_zone_id(page) != page_zone_id(buddy))
+			return 0;
+
 		return 1;
 	}
 	return 0;
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 07/19] mm: page_alloc: Only check the alloc flags and gfp_mask for dirty once
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
                   ` (5 preceding siblings ...)
  2014-05-13  9:45 ` [PATCH 06/19] mm: page_alloc: Only check the zone id check if pages are buddies Mel Gorman
@ 2014-05-13  9:45 ` Mel Gorman
  2014-05-13  9:45 ` [PATCH 08/19] mm: page_alloc: Take the ALLOC_NO_WATERMARK check out of the fast path Mel Gorman
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

Currently it's calculated once per zone in the zonelist.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/page_alloc.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8971953..2e576fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1925,6 +1925,8 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
+	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
+				(gfp_mask & __GFP_WRITE);
 
 zonelist_scan:
 	/*
@@ -1983,8 +1985,7 @@ zonelist_scan:
 		 * will require awareness of zones in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if ((alloc_flags & ALLOC_WMARK_LOW) &&
-		    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
+		if (consider_zone_dirty && !zone_dirty_ok(zone))
 			continue;
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 08/19] mm: page_alloc: Take the ALLOC_NO_WATERMARK check out of the fast path
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
                   ` (6 preceding siblings ...)
  2014-05-13  9:45 ` [PATCH 07/19] mm: page_alloc: Only check the alloc flags and gfp_mask for dirty once Mel Gorman
@ 2014-05-13  9:45 ` Mel Gorman
  2014-05-13  9:45 ` [PATCH 09/19] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps Mel Gorman
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for
__GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these cases
are relatively rare events but the ALLOC_NO_WATERMARK check is an unlikely
branch in the fast path.  This patch moves the check out of the fast path
and after it has been determined that the watermarks have not been met. This
helps the common fast path at the cost of making the slow path slower and
hitting kswapd with a performance cost. It's a reasonable tradeoff.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/page_alloc.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2e576fd..dc123ff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1944,9 +1944,6 @@ zonelist_scan:
 			(alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				continue;
-		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
-		if (unlikely(alloc_flags & ALLOC_NO_WATERMARKS))
-			goto try_this_zone;
 		/*
 		 * Distribute pages in proportion to the individual
 		 * zone size to ensure fair page aging.  The zone a
@@ -1993,6 +1990,11 @@ zonelist_scan:
 				       classzone_idx, alloc_flags)) {
 			int ret;
 
+			/* Checked here to keep the fast path fast */
+			BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
+			if (alloc_flags & ALLOC_NO_WATERMARKS)
+				goto try_this_zone;
+
 			if (IS_ENABLED(CONFIG_NUMA) &&
 					!did_zlc_setup && nr_online_nodes > 1) {
 				/*
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 09/19] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
                   ` (7 preceding siblings ...)
  2014-05-13  9:45 ` [PATCH 08/19] mm: page_alloc: Take the ALLOC_NO_WATERMARK check out of the fast path Mel Gorman
@ 2014-05-13  9:45 ` Mel Gorman
  2014-05-22  9:24   ` Vlastimil Babka
  2014-05-13  9:45 ` [PATCH 10/19] mm: page_alloc: Reduce number of times page_to_pfn is called Mel Gorman
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

The test_bit operations in get/set pageblock flags are expensive. This patch
reads the bitmap on a word basis and use shifts and masks to isolate the bits
of interest. Similarly masks are used to set a local copy of the bitmap and then
use cmpxchg to update the bitmap if there have been no other changes made in
parallel.

In a test running dd onto tmpfs the overhead of the pageblock-related
functions went from 1.27% in profiles to 0.5%.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h          |  6 ++++-
 include/linux/pageblock-flags.h | 37 ++++++++++++++++++++++++-----
 mm/page_alloc.c                 | 52 +++++++++++++++++++++++++----------------
 3 files changed, 68 insertions(+), 27 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fac5509..835aa3d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -75,9 +75,13 @@ enum {
 
 extern int page_group_by_mobility_disabled;
 
+#define NR_MIGRATETYPE_BITS (PB_migrate_end - PB_migrate + 1)
+#define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1)
+
 static inline int get_pageblock_migratetype(struct page *page)
 {
-	return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
+	BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2);
+	return get_pageblock_flags_mask(page, PB_migrate_end, MIGRATETYPE_MASK);
 }
 
 struct free_area {
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 2ee8cd2..c08730c 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -30,9 +30,12 @@ enum pageblock_bits {
 	PB_migrate,
 	PB_migrate_end = PB_migrate + 3 - 1,
 			/* 3 bits required for migrate types */
-#ifdef CONFIG_COMPACTION
 	PB_migrate_skip,/* If set the block is skipped by compaction */
-#endif /* CONFIG_COMPACTION */
+
+	/*
+	 * Assume the bits will always align on a word. If this assumption
+	 * changes then get/set pageblock needs updating.
+	 */
 	NR_PAGEBLOCK_BITS
 };
 
@@ -62,11 +65,33 @@ extern int pageblock_order;
 /* Forward declaration */
 struct page;
 
+unsigned long get_pageblock_flags_mask(struct page *page,
+				unsigned long end_bitidx,
+				unsigned long mask);
+void set_pageblock_flags_mask(struct page *page,
+				unsigned long flags,
+				unsigned long end_bitidx,
+				unsigned long mask);
+
 /* Declarations for getting and setting flags. See mm/page_alloc.c */
-unsigned long get_pageblock_flags_group(struct page *page,
-					int start_bitidx, int end_bitidx);
-void set_pageblock_flags_group(struct page *page, unsigned long flags,
-					int start_bitidx, int end_bitidx);
+static inline unsigned long get_pageblock_flags_group(struct page *page,
+					int start_bitidx, int end_bitidx)
+{
+	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
+	unsigned long mask = (1 << nr_flag_bits) - 1;
+
+	return get_pageblock_flags_mask(page, end_bitidx, mask);
+}
+
+static inline void set_pageblock_flags_group(struct page *page,
+					unsigned long flags,
+					int start_bitidx, int end_bitidx)
+{
+	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
+	unsigned long mask = (1 << nr_flag_bits) - 1;
+
+	set_pageblock_flags_mask(page, flags, end_bitidx, mask);
+}
 
 #ifdef CONFIG_COMPACTION
 #define get_pageblock_skip(page) \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dc123ff..b438eb7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6032,53 +6032,65 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
  * @end_bitidx: The last bit of interest
  * returns pageblock_bits flags
  */
-unsigned long get_pageblock_flags_group(struct page *page,
-					int start_bitidx, int end_bitidx)
+unsigned long get_pageblock_flags_mask(struct page *page,
+					unsigned long end_bitidx,
+					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx;
-	unsigned long flags = 0;
-	unsigned long value = 1;
+	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long word;
 
 	zone = page_zone(page);
 	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
+	word_bitidx = bitidx / BITS_PER_LONG;
+	bitidx &= (BITS_PER_LONG-1);
 
-	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
-		if (test_bit(bitidx + start_bitidx, bitmap))
-			flags |= value;
-
-	return flags;
+	word = bitmap[word_bitidx];
+	bitidx += end_bitidx;
+	return (word >> (BITS_PER_LONG - bitidx - 1)) & mask;
 }
 
 /**
- * set_pageblock_flags_group - Set the requested group of flags for a pageblock_nr_pages block of pages
+ * set_pageblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages
  * @page: The page within the block of interest
  * @start_bitidx: The first bit of interest
  * @end_bitidx: The last bit of interest
  * @flags: The flags to set
  */
-void set_pageblock_flags_group(struct page *page, unsigned long flags,
-					int start_bitidx, int end_bitidx)
+void set_pageblock_flags_mask(struct page *page, unsigned long flags,
+					unsigned long end_bitidx,
+					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx;
-	unsigned long value = 1;
+	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long old_word, word;
+
+	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
 
 	zone = page_zone(page);
 	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
+	word_bitidx = bitidx / BITS_PER_LONG;
+	bitidx &= (BITS_PER_LONG-1);
+
 	VM_BUG_ON_PAGE(!zone_spans_pfn(zone, pfn), page);
 
-	for (; start_bitidx <= end_bitidx; start_bitidx++, value <<= 1)
-		if (flags & value)
-			__set_bit(bitidx + start_bitidx, bitmap);
-		else
-			__clear_bit(bitidx + start_bitidx, bitmap);
+	bitidx += end_bitidx;
+	mask <<= (BITS_PER_LONG - bitidx - 1);
+	flags <<= (BITS_PER_LONG - bitidx - 1);
+
+	word = ACCESS_ONCE(bitmap[word_bitidx]);
+	for (;;) {
+		old_word = cmpxchg(&bitmap[word_bitidx], word, (word & ~mask) | flags);
+		if (word == old_word)
+			break;
+		word = old_word;
+	}
 }
 
 /*
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 10/19] mm: page_alloc: Reduce number of times page_to_pfn is called
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
                   ` (8 preceding siblings ...)
  2014-05-13  9:45 ` [PATCH 09/19] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps Mel Gorman
@ 2014-05-13  9:45 ` Mel Gorman
  2014-05-13 13:27   ` Vlastimil Babka
  2014-05-13  9:45 ` [PATCH 11/19] mm: page_alloc: Lookup pageblock migratetype with IRQs enabled during free Mel Gorman
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

In the free path we calculate page_to_pfn multiple times. Reduce that.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mmzone.h          |  9 +++++++--
 include/linux/pageblock-flags.h | 33 +++++++++++++--------------------
 mm/page_alloc.c                 | 34 +++++++++++++++++++---------------
 3 files changed, 39 insertions(+), 37 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 835aa3d..bd6f504 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -78,10 +78,15 @@ extern int page_group_by_mobility_disabled;
 #define NR_MIGRATETYPE_BITS (PB_migrate_end - PB_migrate + 1)
 #define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1)
 
-static inline int get_pageblock_migratetype(struct page *page)
+#define get_pageblock_migratetype(page)					\
+	get_pfnblock_flags_mask(page, page_to_pfn(page),		\
+			PB_migrate_end, MIGRATETYPE_MASK)
+
+static inline int get_pfnblock_migratetype(struct page *page, unsigned long pfn)
 {
 	BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2);
-	return get_pageblock_flags_mask(page, PB_migrate_end, MIGRATETYPE_MASK);
+	return get_pfnblock_flags_mask(page, pfn, PB_migrate_end,
+					MIGRATETYPE_MASK);
 }
 
 struct free_area {
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index c08730c..2baeee1 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -65,33 +65,26 @@ extern int pageblock_order;
 /* Forward declaration */
 struct page;
 
-unsigned long get_pageblock_flags_mask(struct page *page,
+unsigned long get_pfnblock_flags_mask(struct page *page,
+				unsigned long pfn,
 				unsigned long end_bitidx,
 				unsigned long mask);
-void set_pageblock_flags_mask(struct page *page,
+
+void set_pfnblock_flags_mask(struct page *page,
 				unsigned long flags,
+				unsigned long pfn,
 				unsigned long end_bitidx,
 				unsigned long mask);
 
 /* Declarations for getting and setting flags. See mm/page_alloc.c */
-static inline unsigned long get_pageblock_flags_group(struct page *page,
-					int start_bitidx, int end_bitidx)
-{
-	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
-	unsigned long mask = (1 << nr_flag_bits) - 1;
-
-	return get_pageblock_flags_mask(page, end_bitidx, mask);
-}
-
-static inline void set_pageblock_flags_group(struct page *page,
-					unsigned long flags,
-					int start_bitidx, int end_bitidx)
-{
-	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
-	unsigned long mask = (1 << nr_flag_bits) - 1;
-
-	set_pageblock_flags_mask(page, flags, end_bitidx, mask);
-}
+#define get_pageblock_flags_group(page, start_bitidx, end_bitidx) \
+	get_pfnblock_flags_mask(page, page_to_pfn(page),		\
+			end_bitidx,					\
+			(1 << (end_bitidx - start_bitidx + 1)) - 1)
+#define set_pageblock_flags_group(page, flags, start_bitidx, end_bitidx) \
+	set_pfnblock_flags_mask(page, flags, page_to_pfn(page),		\
+			end_bitidx,					\
+			(1 << (end_bitidx - start_bitidx + 1)) - 1)
 
 #ifdef CONFIG_COMPACTION
 #define get_pageblock_skip(page) \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b438eb7..3948f0a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -559,6 +559,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
  */
 
 static inline void __free_one_page(struct page *page,
+		unsigned long pfn,
 		struct zone *zone, unsigned int order,
 		int migratetype)
 {
@@ -575,7 +576,7 @@ static inline void __free_one_page(struct page *page,
 
 	VM_BUG_ON(migratetype == -1);
 
-	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
+	page_idx = pfn & ((1 << MAX_ORDER) - 1);
 
 	VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
@@ -710,7 +711,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			list_del(&page->lru);
 			mt = get_freepage_migratetype(page);
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
-			__free_one_page(page, zone, 0, mt);
+			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
 			trace_mm_page_pcpu_drain(page, 0, mt);
 			if (likely(!is_migrate_isolate_page(page))) {
 				__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
@@ -722,13 +723,15 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	spin_unlock(&zone->lock);
 }
 
-static void free_one_page(struct zone *zone, struct page *page, int order,
+static void free_one_page(struct zone *zone,
+				struct page *page, unsigned long pfn,
+				int order,
 				int migratetype)
 {
 	spin_lock(&zone->lock);
 	zone->pages_scanned = 0;
 
-	__free_one_page(page, zone, order, migratetype);
+	__free_one_page(page, pfn, zone, order, migratetype);
 	if (unlikely(!is_migrate_isolate(migratetype)))
 		__mod_zone_freepage_state(zone, 1 << order, migratetype);
 	spin_unlock(&zone->lock);
@@ -765,15 +768,16 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
 	int migratetype;
+	unsigned long pfn = page_to_pfn(page);
 
 	if (!free_pages_prepare(page, order))
 		return;
 
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	migratetype = get_pageblock_migratetype(page);
+	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_freepage_migratetype(page, migratetype);
-	free_one_page(page_zone(page), page, order, migratetype);
+	free_one_page(page_zone(page), page, pfn, order, migratetype);
 	local_irq_restore(flags);
 }
 
@@ -1376,12 +1380,13 @@ void free_hot_cold_page(struct page *page, int cold)
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
+	unsigned long pfn = page_to_pfn(page);
 	int migratetype;
 
 	if (!free_pages_prepare(page, 0))
 		return;
 
-	migratetype = get_pageblock_migratetype(page);
+	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_freepage_migratetype(page, migratetype);
 	local_irq_save(flags);
 	__count_vm_event(PGFREE);
@@ -1395,7 +1400,7 @@ void free_hot_cold_page(struct page *page, int cold)
 	 */
 	if (migratetype >= MIGRATE_PCPTYPES) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(zone, page, 0, migratetype);
+			free_one_page(zone, page, pfn, 0, migratetype);
 			goto out;
 		}
 		migratetype = MIGRATE_MOVABLE;
@@ -6032,17 +6037,16 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
  * @end_bitidx: The last bit of interest
  * returns pageblock_bits flags
  */
-unsigned long get_pageblock_flags_mask(struct page *page,
+unsigned long get_pfnblock_flags_mask(struct page *page, unsigned long pfn,
 					unsigned long end_bitidx,
 					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long bitidx, word_bitidx;
 	unsigned long word;
 
 	zone = page_zone(page);
-	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
 	word_bitidx = bitidx / BITS_PER_LONG;
@@ -6054,25 +6058,25 @@ unsigned long get_pageblock_flags_mask(struct page *page,
 }
 
 /**
- * set_pageblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages
+ * set_pfnblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages
  * @page: The page within the block of interest
  * @start_bitidx: The first bit of interest
  * @end_bitidx: The last bit of interest
  * @flags: The flags to set
  */
-void set_pageblock_flags_mask(struct page *page, unsigned long flags,
+void set_pfnblock_flags_mask(struct page *page, unsigned long flags,
+					unsigned long pfn,
 					unsigned long end_bitidx,
 					unsigned long mask)
 {
 	struct zone *zone;
 	unsigned long *bitmap;
-	unsigned long pfn, bitidx, word_bitidx;
+	unsigned long bitidx, word_bitidx;
 	unsigned long old_word, word;
 
 	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
 
 	zone = page_zone(page);
-	pfn = page_to_pfn(page);
 	bitmap = get_pageblock_bitmap(zone, pfn);
 	bitidx = pfn_to_bitidx(zone, pfn);
 	word_bitidx = bitidx / BITS_PER_LONG;
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 11/19] mm: page_alloc: Lookup pageblock migratetype with IRQs enabled during free
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
                   ` (9 preceding siblings ...)
  2014-05-13  9:45 ` [PATCH 10/19] mm: page_alloc: Reduce number of times page_to_pfn is called Mel Gorman
@ 2014-05-13  9:45 ` Mel Gorman
  2014-05-13 13:36   ` Vlastimil Babka
  2014-05-13  9:45 ` [PATCH 12/19] mm: page_alloc: Use unsigned int for order in more places Mel Gorman
                   ` (8 subsequent siblings)
  19 siblings, 1 reply; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

get_pageblock_migratetype() is called during free with IRQs disabled. This
is unnecessary and disables IRQs for longer than necessary.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3948f0a..fcbf637 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -773,9 +773,9 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	if (!free_pages_prepare(page, order))
 		return;
 
+	migratetype = get_pfnblock_migratetype(page, pfn);
 	local_irq_save(flags);
 	__count_vm_events(PGFREE, 1 << order);
-	migratetype = get_pfnblock_migratetype(page, pfn);
 	set_freepage_migratetype(page, migratetype);
 	free_one_page(page_zone(page), page, pfn, order, migratetype);
 	local_irq_restore(flags);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 12/19] mm: page_alloc: Use unsigned int for order in more places
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
                   ` (10 preceding siblings ...)
  2014-05-13  9:45 ` [PATCH 11/19] mm: page_alloc: Lookup pageblock migratetype with IRQs enabled during free Mel Gorman
@ 2014-05-13  9:45 ` Mel Gorman
  2014-05-13  9:45 ` [PATCH 13/19] mm: page_alloc: Convert hot/cold parameter and immediate callers to bool Mel Gorman
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

X86 prefers the use of unsigned types for iterators and there is a
tendency to mix whether a signed or unsigned type if used for page
order. This converts a number of sites in mm/page_alloc.c to use
unsigned int for order where possible.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/mmzone.h |  8 ++++----
 mm/page_alloc.c        | 43 +++++++++++++++++++++++--------------------
 2 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bd6f504..974a4ef 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -817,10 +817,10 @@ static inline bool pgdat_is_empty(pg_data_t *pgdat)
 extern struct mutex zonelists_mutex;
 void build_all_zonelists(pg_data_t *pgdat, struct zone *zone);
 void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
-bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
-		int classzone_idx, int alloc_flags);
-bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
-		int classzone_idx, int alloc_flags);
+bool zone_watermark_ok(struct zone *z, unsigned int order,
+		unsigned long mark, int classzone_idx, int alloc_flags);
+bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
+		unsigned long mark, int classzone_idx, int alloc_flags);
 enum memmap_context {
 	MEMMAP_EARLY,
 	MEMMAP_HOTPLUG,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fcbf637..a47f1c5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -408,7 +408,8 @@ static int destroy_compound_page(struct page *page, unsigned long order)
 	return bad;
 }
 
-static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags)
+static inline void prep_zero_page(struct page *page, unsigned int order,
+							gfp_t gfp_flags)
 {
 	int i;
 
@@ -452,7 +453,7 @@ static inline void set_page_guard_flag(struct page *page) { }
 static inline void clear_page_guard_flag(struct page *page) { }
 #endif
 
-static inline void set_page_order(struct page *page, int order)
+static inline void set_page_order(struct page *page, unsigned int order)
 {
 	set_page_private(page, order);
 	__SetPageBuddy(page);
@@ -503,7 +504,7 @@ __find_buddy_index(unsigned long page_idx, unsigned int order)
  * For recording page's order, we use page_private(page).
  */
 static inline int page_is_buddy(struct page *page, struct page *buddy,
-								int order)
+							unsigned int order)
 {
 	if (!pfn_valid_within(page_to_pfn(buddy)))
 		return 0;
@@ -725,7 +726,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 
 static void free_one_page(struct zone *zone,
 				struct page *page, unsigned long pfn,
-				int order,
+				unsigned int order,
 				int migratetype)
 {
 	spin_lock(&zone->lock);
@@ -896,7 +897,7 @@ static inline int check_new_page(struct page *page)
 	return 0;
 }
 
-static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
+static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags)
 {
 	int i;
 
@@ -1104,16 +1105,17 @@ static int try_to_steal_freepages(struct zone *zone, struct page *page,
 
 /* Remove an element from the buddy allocator from the fallback list */
 static inline struct page *
-__rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
+__rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
 {
 	struct free_area *area;
-	int current_order;
+	unsigned int current_order;
 	struct page *page;
 	int migratetype, new_type, i;
 
 	/* Find the largest possible block of pages in the other list */
-	for (current_order = MAX_ORDER-1; current_order >= order;
-						--current_order) {
+	for (current_order = MAX_ORDER-1;
+				current_order >= order && current_order <= MAX_ORDER-1;
+				--current_order) {
 		for (i = 0;; i++) {
 			migratetype = fallbacks[start_migratetype][i];
 
@@ -1341,7 +1343,7 @@ void mark_free_pages(struct zone *zone)
 {
 	unsigned long pfn, max_zone_pfn;
 	unsigned long flags;
-	int order, t;
+	unsigned int order, t;
 	struct list_head *curr;
 
 	if (zone_is_empty(zone))
@@ -1537,8 +1539,8 @@ int split_free_page(struct page *page)
  */
 static inline
 struct page *buffered_rmqueue(struct zone *preferred_zone,
-			struct zone *zone, int order, gfp_t gfp_flags,
-			int migratetype)
+			struct zone *zone, unsigned int order,
+			gfp_t gfp_flags, int migratetype)
 {
 	unsigned long flags;
 	struct page *page;
@@ -1687,8 +1689,9 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
  * Return true if free pages are above 'mark'. This takes into account the order
  * of the allocation.
  */
-static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
-		      int classzone_idx, int alloc_flags, long free_pages)
+static bool __zone_watermark_ok(struct zone *z, unsigned int order,
+			unsigned long mark, int classzone_idx, int alloc_flags,
+			long free_pages)
 {
 	/* free_pages my go negative - that's OK */
 	long min = mark;
@@ -1722,15 +1725,15 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 	return true;
 }
 
-bool zone_watermark_ok(struct zone *z, int order, unsigned long mark,
+bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 		      int classzone_idx, int alloc_flags)
 {
 	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
 					zone_page_state(z, NR_FREE_PAGES));
 }
 
-bool zone_watermark_ok_safe(struct zone *z, int order, unsigned long mark,
-		      int classzone_idx, int alloc_flags)
+bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
+			unsigned long mark, int classzone_idx, int alloc_flags)
 {
 	long free_pages = zone_page_state(z, NR_FREE_PAGES);
 
@@ -4123,7 +4126,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 
 static void __meminit zone_init_free_lists(struct zone *zone)
 {
-	int order, t;
+	unsigned int order, t;
 	for_each_migratetype_order(order, t) {
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 		zone->free_area[order].nr_free = 0;
@@ -6448,7 +6451,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 {
 	struct page *page;
 	struct zone *zone;
-	int order, i;
+	unsigned int order, i;
 	unsigned long pfn;
 	unsigned long flags;
 	/* find the first valid pfn */
@@ -6500,7 +6503,7 @@ bool is_free_buddy_page(struct page *page)
 	struct zone *zone = page_zone(page);
 	unsigned long pfn = page_to_pfn(page);
 	unsigned long flags;
-	int order;
+	unsigned int order;
 
 	spin_lock_irqsave(&zone->lock, flags);
 	for (order = 0; order < MAX_ORDER; order++) {
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 13/19] mm: page_alloc: Convert hot/cold parameter and immediate callers to bool
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
                   ` (11 preceding siblings ...)
  2014-05-13  9:45 ` [PATCH 12/19] mm: page_alloc: Use unsigned int for order in more places Mel Gorman
@ 2014-05-13  9:45 ` Mel Gorman
  2014-05-13  9:45 ` [PATCH 14/19] mm: shmem: Avoid atomic operation during shmem_getpage_gfp Mel Gorman
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

cold is a bool, make it one. Make the likely case the "if" part of the
block instead of the else as according to the optimisation manual this
is preferred.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 arch/tile/mm/homecache.c |  2 +-
 fs/fuse/dev.c            |  2 +-
 include/linux/gfp.h      |  4 ++--
 include/linux/pagemap.h  |  2 +-
 include/linux/swap.h     |  2 +-
 mm/page_alloc.c          | 20 ++++++++++----------
 mm/swap.c                |  4 ++--
 mm/swap_state.c          |  2 +-
 mm/vmscan.c              |  6 +++---
 9 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/arch/tile/mm/homecache.c b/arch/tile/mm/homecache.c
index 004ba56..33294fd 100644
--- a/arch/tile/mm/homecache.c
+++ b/arch/tile/mm/homecache.c
@@ -417,7 +417,7 @@ void __homecache_free_pages(struct page *page, unsigned int order)
 	if (put_page_testzero(page)) {
 		homecache_change_page_home(page, order, PAGE_HOME_HASH);
 		if (order == 0) {
-			free_hot_cold_page(page, 0);
+			free_hot_cold_page(page, false);
 		} else {
 			init_page_count(page);
 			__free_pages(page, order);
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index aac71ce..098f97b 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1614,7 +1614,7 @@ out_finish:
 
 static void fuse_retrieve_end(struct fuse_conn *fc, struct fuse_req *req)
 {
-	release_pages(req->pages, req->num_pages, 0);
+	release_pages(req->pages, req->num_pages, false);
 }
 
 static int fuse_retrieve(struct fuse_conn *fc, struct inode *inode,
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 39b81dc..3824ac6 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -369,8 +369,8 @@ void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
 
 extern void __free_pages(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
-extern void free_hot_cold_page(struct page *page, int cold);
-extern void free_hot_cold_page_list(struct list_head *list, int cold);
+extern void free_hot_cold_page(struct page *page, bool cold);
+extern void free_hot_cold_page_list(struct list_head *list, bool cold);
 
 extern void __free_memcg_kmem_pages(struct page *page, unsigned int order);
 extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 45598f1..9175f52 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -110,7 +110,7 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 
 #define page_cache_get(page)		get_page(page)
 #define page_cache_release(page)	put_page(page)
-void release_pages(struct page **pages, int nr, int cold);
+void release_pages(struct page **pages, int nr, bool cold);
 
 /*
  * speculatively take a reference to a page.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3507115..da8a250 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -496,7 +496,7 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
 #define free_page_and_swap_cache(page) \
 	page_cache_release(page)
 #define free_pages_and_swap_cache(pages, nr) \
-	release_pages((pages), (nr), 0);
+	release_pages((pages), (nr), false);
 
 static inline void show_swap_cache_info(void)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a47f1c5..02f3ffc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1189,7 +1189,7 @@ retry_reserve:
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			unsigned long count, struct list_head *list,
-			int migratetype, int cold)
+			int migratetype, bool cold)
 {
 	int mt = migratetype, i;
 
@@ -1208,7 +1208,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 		 * merge IO requests if the physical pages are ordered
 		 * properly.
 		 */
-		if (likely(cold == 0))
+		if (likely(!cold))
 			list_add(&page->lru, list);
 		else
 			list_add_tail(&page->lru, list);
@@ -1375,9 +1375,9 @@ void mark_free_pages(struct zone *zone)
 
 /*
  * Free a 0-order page
- * cold == 1 ? free a cold page : free a hot page
+ * cold == true ? free a cold page : free a hot page
  */
-void free_hot_cold_page(struct page *page, int cold)
+void free_hot_cold_page(struct page *page, bool cold)
 {
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
@@ -1409,10 +1409,10 @@ void free_hot_cold_page(struct page *page, int cold)
 	}
 
 	pcp = &this_cpu_ptr(zone->pageset)->pcp;
-	if (cold)
-		list_add_tail(&page->lru, &pcp->lists[migratetype]);
-	else
+	if (!cold)
 		list_add(&page->lru, &pcp->lists[migratetype]);
+	else
+		list_add_tail(&page->lru, &pcp->lists[migratetype]);
 	pcp->count++;
 	if (pcp->count >= pcp->high) {
 		unsigned long batch = ACCESS_ONCE(pcp->batch);
@@ -1427,7 +1427,7 @@ out:
 /*
  * Free a list of 0-order pages
  */
-void free_hot_cold_page_list(struct list_head *list, int cold)
+void free_hot_cold_page_list(struct list_head *list, bool cold)
 {
 	struct page *page, *next;
 
@@ -1544,7 +1544,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 {
 	unsigned long flags;
 	struct page *page;
-	int cold = !!(gfp_flags & __GFP_COLD);
+	bool cold = ((gfp_flags & __GFP_COLD) != 0);
 
 again:
 	if (likely(order == 0)) {
@@ -2849,7 +2849,7 @@ void __free_pages(struct page *page, unsigned int order)
 {
 	if (put_page_testzero(page)) {
 		if (order == 0)
-			free_hot_cold_page(page, 0);
+			free_hot_cold_page(page, false);
 		else
 			__free_pages_ok(page, order);
 	}
diff --git a/mm/swap.c b/mm/swap.c
index 9ce43ba..f2228b7 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -67,7 +67,7 @@ static void __page_cache_release(struct page *page)
 static void __put_single_page(struct page *page)
 {
 	__page_cache_release(page);
-	free_hot_cold_page(page, 0);
+	free_hot_cold_page(page, false);
 }
 
 static void __put_compound_page(struct page *page)
@@ -813,7 +813,7 @@ void lru_add_drain_all(void)
  * grabbed the page via the LRU.  If it did, give up: shrink_inactive_list()
  * will free it.
  */
-void release_pages(struct page **pages, int nr, int cold)
+void release_pages(struct page **pages, int nr, bool cold)
 {
 	int i;
 	LIST_HEAD(pages_to_free);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index e76ace3..2972eee 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -270,7 +270,7 @@ void free_pages_and_swap_cache(struct page **pages, int nr)
 
 		for (i = 0; i < todo; i++)
 			free_swap_cache(pagep[i]);
-		release_pages(pagep, todo, 0);
+		release_pages(pagep, todo, false);
 		pagep += todo;
 		nr -= todo;
 	}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3f56c8d..8db1318 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1121,7 +1121,7 @@ keep:
 		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
 	}
 
-	free_hot_cold_page_list(&free_pages, 1);
+	free_hot_cold_page_list(&free_pages, true);
 
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
@@ -1519,7 +1519,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	free_hot_cold_page_list(&page_list, 1);
+	free_hot_cold_page_list(&page_list, true);
 
 	/*
 	 * If reclaim is isolating dirty pages under writeback, it implies
@@ -1740,7 +1740,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
 	spin_unlock_irq(&zone->lru_lock);
 
-	free_hot_cold_page_list(&l_hold, 1);
+	free_hot_cold_page_list(&l_hold, true);
 }
 
 #ifdef CONFIG_SWAP
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 14/19] mm: shmem: Avoid atomic operation during shmem_getpage_gfp
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
                   ` (12 preceding siblings ...)
  2014-05-13  9:45 ` [PATCH 13/19] mm: page_alloc: Convert hot/cold parameter and immediate callers to bool Mel Gorman
@ 2014-05-13  9:45 ` Mel Gorman
  2014-05-13  9:45 ` [PATCH 15/19] mm: Do not use atomic operations when releasing pages Mel Gorman
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
before it's even added to the LRU or visible. This is unnecessary as what
could it possible race against?  Use an unlocked variant.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/page-flags.h | 1 +
 mm/shmem.c                 | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index d1fe1a7..4d4b39a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -208,6 +208,7 @@ PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned)	/* Xen */
 PAGEFLAG(SavePinned, savepinned);			/* Xen */
 PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved)
 PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
+	__SETPAGEFLAG(SwapBacked, swapbacked)
 
 __PAGEFLAG(SlobFree, slob_free)
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 9f70e02..f47fb38 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1132,7 +1132,7 @@ repeat:
 			goto decused;
 		}
 
-		SetPageSwapBacked(page);
+		__SetPageSwapBacked(page);
 		__set_page_locked(page);
 		error = mem_cgroup_charge_file(page, current->mm,
 						gfp & GFP_RECLAIM_MASK);
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 15/19] mm: Do not use atomic operations when releasing pages
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
                   ` (13 preceding siblings ...)
  2014-05-13  9:45 ` [PATCH 14/19] mm: shmem: Avoid atomic operation during shmem_getpage_gfp Mel Gorman
@ 2014-05-13  9:45 ` Mel Gorman
  2014-05-13  9:45 ` [PATCH 16/19] mm: Do not use unnecessary atomic operations when adding pages to the LRU Mel Gorman
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

There should be no references to it any more and a parallel mark should
not be reordered against us. Use non-locked varient to clear page active.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/swap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index f2228b7..7a5bdd7 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -854,7 +854,7 @@ void release_pages(struct page **pages, int nr, bool cold)
 		}
 
 		/* Clear Active bit in case of parallel mark_page_accessed */
-		ClearPageActive(page);
+		__ClearPageActive(page);
 
 		list_add(&page->lru, &pages_to_free);
 	}
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 16/19] mm: Do not use unnecessary atomic operations when adding pages to the LRU
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
                   ` (14 preceding siblings ...)
  2014-05-13  9:45 ` [PATCH 15/19] mm: Do not use atomic operations when releasing pages Mel Gorman
@ 2014-05-13  9:45 ` Mel Gorman
  2014-05-13  9:45 ` [PATCH 17/19] fs: buffer: Do not use unnecessary atomic operations when discarding buffers Mel Gorman
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

When adding pages to the LRU we clear the active bit unconditionally. As the
page could be reachable from other paths we cannot use unlocked operations
without risk of corruption such as a parallel mark_page_accessed. This
patch tests if is necessary to clear the active flag before using an atomic
operation. This potentially opens a tiny race when PageActive is checked
as mark_page_accessed could be called after PageActive was checked. The
race already exists but this patch changes it slightly. The consequence
is that that the page may be promoted to the active list that might have
been left on the inactive list before the patch. It's too tiny a race and
too marginal a consequence to always use atomic operations for.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/swap.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index da8a250..395dcab 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -329,13 +329,15 @@ extern void add_page_to_unevictable_list(struct page *page);
  */
 static inline void lru_cache_add_anon(struct page *page)
 {
-	ClearPageActive(page);
+	if (PageActive(page))
+		ClearPageActive(page);
 	__lru_cache_add(page);
 }
 
 static inline void lru_cache_add_file(struct page *page)
 {
-	ClearPageActive(page);
+	if (PageActive(page))
+		ClearPageActive(page);
 	__lru_cache_add(page);
 }
 
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 17/19] fs: buffer: Do not use unnecessary atomic operations when discarding buffers
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
                   ` (15 preceding siblings ...)
  2014-05-13  9:45 ` [PATCH 16/19] mm: Do not use unnecessary atomic operations when adding pages to the LRU Mel Gorman
@ 2014-05-13  9:45 ` Mel Gorman
  2014-05-13 11:09   ` Peter Zijlstra
                     ` (2 more replies)
  2014-05-13  9:45 ` [PATCH 18/19] mm: Non-atomically mark page accessed during page cache allocation where possible Mel Gorman
                   ` (2 subsequent siblings)
  19 siblings, 3 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

Discarding buffers uses a bunch of atomic operations when discarding buffers
because ...... I can't think of a reason. Use a cmpxchg loop to clear all the
necessary flags. In most (all?) cases this will be a single atomic operations.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/buffer.c                 | 14 +++++++++-----
 include/linux/buffer_head.h |  5 +++++
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 9ddb9fc..e80012d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1485,14 +1485,18 @@ EXPORT_SYMBOL(set_bh_page);
  */
 static void discard_buffer(struct buffer_head * bh)
 {
+	unsigned long b_state, b_state_old;
+
 	lock_buffer(bh);
 	clear_buffer_dirty(bh);
 	bh->b_bdev = NULL;
-	clear_buffer_mapped(bh);
-	clear_buffer_req(bh);
-	clear_buffer_new(bh);
-	clear_buffer_delay(bh);
-	clear_buffer_unwritten(bh);
+	b_state = bh->b_state;
+	for (;;) {
+		b_state_old = cmpxchg(&bh->b_state, b_state, (b_state & ~BUFFER_FLAGS_DISCARD));
+		if (b_state_old == b_state)
+			break;
+		b_state = b_state_old;
+	}
 	unlock_buffer(bh);
 }
 
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index c40302f..95f565a 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -77,6 +77,11 @@ struct buffer_head {
 	atomic_t b_count;		/* users using this buffer_head */
 };
 
+/* Bits that are cleared during an invalidate */
+#define BUFFER_FLAGS_DISCARD \
+	(1 << BH_Mapped | 1 << BH_New | 1 << BH_Req | \
+	 1 << BH_Delay | 1 << BH_Unwritten)
+
 /*
  * macro tricks to expand the set_buffer_foo(), clear_buffer_foo()
  * and buffer_foo() functions.
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 18/19] mm: Non-atomically mark page accessed during page cache allocation where possible
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
                   ` (16 preceding siblings ...)
  2014-05-13  9:45 ` [PATCH 17/19] fs: buffer: Do not use unnecessary atomic operations when discarding buffers Mel Gorman
@ 2014-05-13  9:45 ` Mel Gorman
  2014-05-13 14:29   ` Theodore Ts'o
  2014-05-20 15:49   ` [PATCH] mm: non-atomically mark page accessed during page cache allocation where possible -fix Mel Gorman
  2014-05-13  9:45 ` [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath Mel Gorman
  2014-05-19  8:57 ` [PATCH] mm: Avoid unnecessary atomic operations during end_page_writeback Mel Gorman
  19 siblings, 2 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

aops->write_begin may allocate a new page and make it visible only to
have mark_page_accessed called almost immediately after. Once the page is
visible the atomic operations are necessary which is noticable overhead when
writing to an in-memory filesystem like tmpfs but should also be noticable
with fast storage. The objective of the patch is to initialse the accessed
information with non-atomic operations before the page is visible.

The bulk of filesystems directly or indirectly use
grab_cache_page_write_begin or find_or_create_page for the initial allocation
of a page cache page. This patch adds an init_page_accessed() helper which
behaves like the first call to mark_page_accessed() but may called before
the page is visible and can be done non-atomically.

The primary APIs of concern in this care are the following and are used
by most filesystems.

	find_get_page
	find_lock_page
	find_or_create_page
	grab_cache_page_nowait
	grab_cache_page_write_begin

All of them are very similar in detail to the patch creates a core helper
pagecache_get_page() which takes a flags parameter that affects its behavior
such as whether the page should be marked accessed or not. Then old API
is preserved but is basically a thin wrapper around this core function.

Each of the filesystems are then updated to avoid calling mark_page_accessed
when it is known that the VM interfaces have already done the job.
There is a slight snag in that the timing of the mark_page_accessed() has
now changed so in rare cases it's possible a page gets to the end of the LRU
as PageReferenced where as previously it might have been repromoted. This
is expected to be rare but it's worth the filesystem people thinking about
it in case they see a problem with the timing change. It is also the case
that some filesystems may be marking pages accessed that previously did not
but it makes sense that filesystems have consistent behaviour in this regard.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of
the file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or
NUMA artifacts. Throughput and wall times are presented for sync IO, only
wall times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running.

async dd
                                    3.15.0-rc3            3.15.0-rc3
                                       vanilla           accessed-v2
ext3    Max      elapsed     13.9900 (  0.00%)     11.5900 ( 17.16%)
tmpfs	Max      elapsed      0.5100 (  0.00%)      0.4900 (  3.92%)
btrfs   Max      elapsed     12.8100 (  0.00%)     12.7800 (  0.23%)
ext4	Max      elapsed     18.6000 (  0.00%)     13.3400 ( 28.28%)
xfs	Max      elapsed     12.5600 (  0.00%)      2.0900 ( 83.36%)

The XFS figure is a bit as it managed to avoid a worst case by sheer luck
but the average figures looked reasonable.

        samples percentage
ext3       86107    0.9783  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext3       23833    0.2710  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext3        5036    0.0573  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
ext4       64566    0.8961  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
ext4        5322    0.0713  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
ext4        2869    0.0384  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs        62126    1.7675  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
xfs         1904    0.0554  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
xfs          103    0.0030  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
btrfs      10655    0.1338  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
btrfs       2020    0.0273  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
btrfs        587    0.0079  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed
tmpfs      59562    3.2628  vmlinux-3.15.0-rc4-vanilla        mark_page_accessed
tmpfs       1210    0.0696  vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed
tmpfs         94    0.0054  vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/btrfs/extent_io.c       |  11 +--
 fs/btrfs/file.c            |   5 +-
 fs/buffer.c                |   7 +-
 fs/ext4/mballoc.c          |  14 ++--
 fs/f2fs/checkpoint.c       |   3 -
 fs/f2fs/node.c             |   2 -
 fs/fuse/file.c             |   2 -
 fs/gfs2/aops.c             |   1 -
 fs/gfs2/meta_io.c          |   4 +-
 fs/ntfs/attrib.c           |   1 -
 fs/ntfs/file.c             |   1 -
 include/linux/page-flags.h |   1 +
 include/linux/pagemap.h    | 107 ++++++++++++++++++++++--
 include/linux/swap.h       |   1 +
 mm/filemap.c               | 202 +++++++++++++++++----------------------------
 mm/shmem.c                 |   6 +-
 mm/swap.c                  |  11 +++
 17 files changed, 217 insertions(+), 162 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3955e47..158833c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4510,7 +4510,8 @@ static void check_buffer_tree_ref(struct extent_buffer *eb)
 	spin_unlock(&eb->refs_lock);
 }
 
-static void mark_extent_buffer_accessed(struct extent_buffer *eb)
+static void mark_extent_buffer_accessed(struct extent_buffer *eb,
+		struct page *accessed)
 {
 	unsigned long num_pages, i;
 
@@ -4519,7 +4520,8 @@ static void mark_extent_buffer_accessed(struct extent_buffer *eb)
 	num_pages = num_extent_pages(eb->start, eb->len);
 	for (i = 0; i < num_pages; i++) {
 		struct page *p = extent_buffer_page(eb, i);
-		mark_page_accessed(p);
+		if (p != accessed)
+			mark_page_accessed(p);
 	}
 }
 
@@ -4533,7 +4535,7 @@ struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
 			       start >> PAGE_CACHE_SHIFT);
 	if (eb && atomic_inc_not_zero(&eb->refs)) {
 		rcu_read_unlock();
-		mark_extent_buffer_accessed(eb);
+		mark_extent_buffer_accessed(eb, NULL);
 		return eb;
 	}
 	rcu_read_unlock();
@@ -4581,7 +4583,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 				spin_unlock(&mapping->private_lock);
 				unlock_page(p);
 				page_cache_release(p);
-				mark_extent_buffer_accessed(exists);
+				mark_extent_buffer_accessed(exists, p);
 				goto free_eb;
 			}
 
@@ -4596,7 +4598,6 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 		attach_extent_buffer_page(eb, p);
 		spin_unlock(&mapping->private_lock);
 		WARN_ON(PageDirty(p));
-		mark_page_accessed(p);
 		eb->pages[i] = p;
 		if (!PageUptodate(p))
 			uptodate = 0;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index ae6af07..74272a3 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -470,11 +470,12 @@ static void btrfs_drop_pages(struct page **pages, size_t num_pages)
 	for (i = 0; i < num_pages; i++) {
 		/* page checked is some magic around finding pages that
 		 * have been modified without going through btrfs_set_page_dirty
-		 * clear it here
+		 * clear it here. There should be no need to mark the pages
+		 * accessed as prepare_pages should have marked them accessed
+		 * in prepare_pages via find_or_create_page()
 		 */
 		ClearPageChecked(pages[i]);
 		unlock_page(pages[i]);
-		mark_page_accessed(pages[i]);
 		page_cache_release(pages[i]);
 	}
 }
diff --git a/fs/buffer.c b/fs/buffer.c
index e80012d..c541de0 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -227,7 +227,7 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
 	int all_mapped = 1;
 
 	index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits);
-	page = find_get_page(bd_mapping, index);
+	page = find_get_page_flags(bd_mapping, index, FGP_ACCESSED);
 	if (!page)
 		goto out;
 
@@ -1366,12 +1366,13 @@ __find_get_block(struct block_device *bdev, sector_t block, unsigned size)
 	struct buffer_head *bh = lookup_bh_lru(bdev, block, size);
 
 	if (bh == NULL) {
+		/* __find_get_block_slow will mark the page accessed */
 		bh = __find_get_block_slow(bdev, block);
 		if (bh)
 			bh_lru_install(bh);
-	}
-	if (bh)
+	} else
 		touch_buffer(bh);
+
 	return bh;
 }
 EXPORT_SYMBOL(__find_get_block);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index c8238a2..afe8a13 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1044,6 +1044,8 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group)
 	 * allocating. If we are looking at the buddy cache we would
 	 * have taken a reference using ext4_mb_load_buddy and that
 	 * would have pinned buddy page to page cache.
+	 * The call to ext4_mb_get_buddy_page_lock will mark the
+	 * page accessed.
 	 */
 	ret = ext4_mb_get_buddy_page_lock(sb, group, &e4b);
 	if (ret || !EXT4_MB_GRP_NEED_INIT(this_grp)) {
@@ -1062,7 +1064,6 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group)
 		ret = -EIO;
 		goto err;
 	}
-	mark_page_accessed(page);
 
 	if (e4b.bd_buddy_page == NULL) {
 		/*
@@ -1082,7 +1083,6 @@ int ext4_mb_init_group(struct super_block *sb, ext4_group_t group)
 		ret = -EIO;
 		goto err;
 	}
-	mark_page_accessed(page);
 err:
 	ext4_mb_put_buddy_page_lock(&e4b);
 	return ret;
@@ -1141,7 +1141,7 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
 
 	/* we could use find_or_create_page(), but it locks page
 	 * what we'd like to avoid in fast path ... */
-	page = find_get_page(inode->i_mapping, pnum);
+	page = find_get_page_flags(inode->i_mapping, pnum, FGP_ACCESSED);
 	if (page == NULL || !PageUptodate(page)) {
 		if (page)
 			/*
@@ -1176,15 +1176,16 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
 		ret = -EIO;
 		goto err;
 	}
+
+	/* Pages marked accessed already */
 	e4b->bd_bitmap_page = page;
 	e4b->bd_bitmap = page_address(page) + (poff * sb->s_blocksize);
-	mark_page_accessed(page);
 
 	block++;
 	pnum = block / blocks_per_page;
 	poff = block % blocks_per_page;
 
-	page = find_get_page(inode->i_mapping, pnum);
+	page = find_get_page_flags(inode->i_mapping, pnum, FGP_ACCESSED);
 	if (page == NULL || !PageUptodate(page)) {
 		if (page)
 			page_cache_release(page);
@@ -1209,9 +1210,10 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
 		ret = -EIO;
 		goto err;
 	}
+
+	/* Pages marked accessed already */
 	e4b->bd_buddy_page = page;
 	e4b->bd_buddy = page_address(page) + (poff * sb->s_blocksize);
-	mark_page_accessed(page);
 
 	BUG_ON(e4b->bd_bitmap_page == NULL);
 	BUG_ON(e4b->bd_buddy_page == NULL);
diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index 4aa521a..c405b8f 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -69,7 +69,6 @@ repeat:
 		goto repeat;
 	}
 out:
-	mark_page_accessed(page);
 	return page;
 }
 
@@ -137,13 +136,11 @@ int ra_meta_pages(struct f2fs_sb_info *sbi, int start, int nrpages, int type)
 		if (!page)
 			continue;
 		if (PageUptodate(page)) {
-			mark_page_accessed(page);
 			f2fs_put_page(page, 1);
 			continue;
 		}
 
 		f2fs_submit_page_mbio(sbi, page, blk_addr, &fio);
-		mark_page_accessed(page);
 		f2fs_put_page(page, 0);
 	}
 out:
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index a161e95..57caa6e 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -967,7 +967,6 @@ repeat:
 		goto repeat;
 	}
 got_it:
-	mark_page_accessed(page);
 	return page;
 }
 
@@ -1022,7 +1021,6 @@ page_hit:
 		f2fs_put_page(page, 1);
 		return ERR_PTR(-EIO);
 	}
-	mark_page_accessed(page);
 	return page;
 }
 
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 13f8bde..85a3359 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1089,8 +1089,6 @@ static ssize_t fuse_fill_write_pages(struct fuse_req *req,
 		tmp = iov_iter_copy_from_user_atomic(page, ii, offset, bytes);
 		flush_dcache_page(page);
 
-		mark_page_accessed(page);
-
 		if (!tmp) {
 			unlock_page(page);
 			page_cache_release(page);
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index ce62dca..3c1ab7b 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -577,7 +577,6 @@ int gfs2_internal_read(struct gfs2_inode *ip, char *buf, loff_t *pos,
 		p = kmap_atomic(page);
 		memcpy(buf + copied, p + offset, amt);
 		kunmap_atomic(p);
-		mark_page_accessed(page);
 		page_cache_release(page);
 		copied += amt;
 		index++;
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index 2cf09b6..b984a6e 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -136,7 +136,8 @@ struct buffer_head *gfs2_getbuf(struct gfs2_glock *gl, u64 blkno, int create)
 			yield();
 		}
 	} else {
-		page = find_lock_page(mapping, index);
+		page = find_get_page_flags(mapping, index,
+						FGP_LOCK|FGP_ACCESSED);
 		if (!page)
 			return NULL;
 	}
@@ -153,7 +154,6 @@ struct buffer_head *gfs2_getbuf(struct gfs2_glock *gl, u64 blkno, int create)
 		map_bh(bh, sdp->sd_vfs, blkno);
 
 	unlock_page(page);
-	mark_page_accessed(page);
 	page_cache_release(page);
 
 	return bh;
diff --git a/fs/ntfs/attrib.c b/fs/ntfs/attrib.c
index a27e3fe..250ed5b 100644
--- a/fs/ntfs/attrib.c
+++ b/fs/ntfs/attrib.c
@@ -1748,7 +1748,6 @@ int ntfs_attr_make_non_resident(ntfs_inode *ni, const u32 data_size)
 	if (page) {
 		set_page_dirty(page);
 		unlock_page(page);
-		mark_page_accessed(page);
 		page_cache_release(page);
 	}
 	ntfs_debug("Done.");
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index db9bd8a..86ddab9 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2060,7 +2060,6 @@ static ssize_t ntfs_file_buffered_write(struct kiocb *iocb,
 		}
 		do {
 			unlock_page(pages[--do_pages]);
-			mark_page_accessed(pages[do_pages]);
 			page_cache_release(pages[do_pages]);
 		} while (do_pages);
 		if (unlikely(status))
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4d4b39a..2093eb7 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -198,6 +198,7 @@ struct page;	/* forward declaration */
 TESTPAGEFLAG(Locked, locked)
 PAGEFLAG(Error, error) TESTCLEARFLAG(Error, error)
 PAGEFLAG(Referenced, referenced) TESTCLEARFLAG(Referenced, referenced)
+	__SETPAGEFLAG(Referenced, referenced)
 PAGEFLAG(Dirty, dirty) TESTSCFLAG(Dirty, dirty) __CLEARPAGEFLAG(Dirty, dirty)
 PAGEFLAG(LRU, lru) __CLEARPAGEFLAG(LRU, lru)
 PAGEFLAG(Active, active) __CLEARPAGEFLAG(Active, active)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 9175f52..e5ffaa0 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -259,12 +259,109 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
 pgoff_t page_cache_prev_hole(struct address_space *mapping,
 			     pgoff_t index, unsigned long max_scan);
 
+#define FGP_ACCESSED		0x00000001
+#define FGP_LOCK		0x00000002
+#define FGP_CREAT		0x00000004
+#define FGP_WRITE		0x00000008
+#define FGP_NOFS		0x00000010
+#define FGP_NOWAIT		0x00000020
+
+struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
+		int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask);
+
+/**
+ * find_get_page - find and get a page reference
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned with an increased refcount.
+ *
+ * Otherwise, %NULL is returned.
+ */
+static inline struct page *find_get_page(struct address_space *mapping,
+					pgoff_t offset)
+{
+	return pagecache_get_page(mapping, offset, 0, 0, 0);
+}
+
+static inline struct page *find_get_page_flags(struct address_space *mapping,
+					pgoff_t offset, int fgp_flags)
+{
+	return pagecache_get_page(mapping, offset, fgp_flags, 0, 0);
+}
+
+/**
+ * find_lock_page - locate, pin and lock a pagecache page
+ * pagecache_get_page - find and get a page reference
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned locked and with an increased
+ * refcount.
+ *
+ * Otherwise, %NULL is returned.
+ *
+ * find_lock_page() may sleep.
+ */
+static inline struct page *find_lock_page(struct address_space *mapping,
+					pgoff_t offset)
+{
+	return pagecache_get_page(mapping, offset, FGP_LOCK, 0, 0);
+}
+
+/**
+ * find_or_create_page - locate or add a pagecache page
+ * @mapping: the page's address_space
+ * @index: the page's index into the mapping
+ * @gfp_mask: page allocation mode
+ *
+ * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * page cache page, it is returned locked and with an increased
+ * refcount.
+ *
+ * If the page is not present, a new page is allocated using @gfp_mask
+ * and added to the page cache and the VM's LRU list.  The page is
+ * returned locked and with an increased refcount.
+ *
+ * On memory exhaustion, %NULL is returned.
+ *
+ * find_or_create_page() may sleep, even if @gfp_flags specifies an
+ * atomic allocation!
+ */
+static inline struct page *find_or_create_page(struct address_space *mapping,
+					pgoff_t offset, gfp_t gfp_mask)
+{
+	return pagecache_get_page(mapping, offset,
+					FGP_LOCK|FGP_ACCESSED|FGP_CREAT,
+					gfp_mask, gfp_mask & GFP_RECLAIM_MASK);
+}
+
+/**
+ * grab_cache_page_nowait - returns locked page at given index in given cache
+ * @mapping: target address_space
+ * @index: the page index
+ *
+ * Same as grab_cache_page(), but do not wait if the page is unavailable.
+ * This is intended for speculative data generators, where the data can
+ * be regenerated if the page couldn't be grabbed.  This routine should
+ * be safe to call while holding the lock for another page.
+ *
+ * Clear __GFP_FS when allocating the page to avoid recursion into the fs
+ * and deadlock against the caller's locked page.
+ */
+static inline struct page *grab_cache_page_nowait(struct address_space *mapping,
+				pgoff_t index)
+{
+	return pagecache_get_page(mapping, index,
+			FGP_LOCK|FGP_CREAT|FGP_NOFS|FGP_NOWAIT,
+			mapping_gfp_mask(mapping),
+			GFP_NOFS);
+}
+
 struct page *find_get_entry(struct address_space *mapping, pgoff_t offset);
-struct page *find_get_page(struct address_space *mapping, pgoff_t offset);
 struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset);
-struct page *find_lock_page(struct address_space *mapping, pgoff_t offset);
-struct page *find_or_create_page(struct address_space *mapping, pgoff_t index,
-				 gfp_t gfp_mask);
 unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
 			  unsigned int nr_entries, struct page **entries,
 			  pgoff_t *indices);
@@ -287,8 +384,6 @@ static inline struct page *grab_cache_page(struct address_space *mapping,
 	return find_or_create_page(mapping, index, mapping_gfp_mask(mapping));
 }
 
-extern struct page * grab_cache_page_nowait(struct address_space *mapping,
-				pgoff_t index);
 extern struct page * read_cache_page(struct address_space *mapping,
 				pgoff_t index, filler_t *filler, void *data);
 extern struct page * read_cache_page_gfp(struct address_space *mapping,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 395dcab..b570ad5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -314,6 +314,7 @@ extern void lru_add_page_tail(struct page *page, struct page *page_tail,
 			 struct lruvec *lruvec, struct list_head *head);
 extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
+extern void init_page_accessed(struct page *page);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
 extern void lru_add_drain_all(void);
diff --git a/mm/filemap.c b/mm/filemap.c
index 5020b28..c60ed0f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -955,26 +955,6 @@ out:
 EXPORT_SYMBOL(find_get_entry);
 
 /**
- * find_get_page - find and get a page reference
- * @mapping: the address_space to search
- * @offset: the page index
- *
- * Looks up the page cache slot at @mapping & @offset.  If there is a
- * page cache page, it is returned with an increased refcount.
- *
- * Otherwise, %NULL is returned.
- */
-struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
-{
-	struct page *page = find_get_entry(mapping, offset);
-
-	if (radix_tree_exceptional_entry(page))
-		page = NULL;
-	return page;
-}
-EXPORT_SYMBOL(find_get_page);
-
-/**
  * find_lock_entry - locate, pin and lock a page cache entry
  * @mapping: the address_space to search
  * @offset: the page cache index
@@ -1011,66 +991,84 @@ repeat:
 EXPORT_SYMBOL(find_lock_entry);
 
 /**
- * find_lock_page - locate, pin and lock a pagecache page
+ * pagecache_get_page - find and get a page reference
  * @mapping: the address_space to search
  * @offset: the page index
+ * @fgp_flags: PCG flags
+ * @gfp_mask: gfp mask to use if a page is to be allocated
  *
- * Looks up the page cache slot at @mapping & @offset.  If there is a
- * page cache page, it is returned locked and with an increased
- * refcount.
- *
- * Otherwise, %NULL is returned.
- *
- * find_lock_page() may sleep.
- */
-struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
-{
-	struct page *page = find_lock_entry(mapping, offset);
-
-	if (radix_tree_exceptional_entry(page))
-		page = NULL;
-	return page;
-}
-EXPORT_SYMBOL(find_lock_page);
-
-/**
- * find_or_create_page - locate or add a pagecache page
- * @mapping: the page's address_space
- * @index: the page's index into the mapping
- * @gfp_mask: page allocation mode
+ * Looks up the page cache slot at @mapping & @offset.
  *
- * Looks up the page cache slot at @mapping & @offset.  If there is a
- * page cache page, it is returned locked and with an increased
- * refcount.
+ * PCG flags modify how the page is returned
  *
- * If the page is not present, a new page is allocated using @gfp_mask
- * and added to the page cache and the VM's LRU list.  The page is
- * returned locked and with an increased refcount.
+ * FGP_ACCESSED: the page will be marked accessed
+ * FGP_LOCK: Page is return locked
+ * FGP_CREAT: If page is not present then a new page is allocated using
+ *		@gfp_mask and added to the page cache and the VM's LRU
+ *		list. The page is returned locked and with an increased
+ *		refcount. Otherwise, %NULL is returned.
  *
- * On memory exhaustion, %NULL is returned.
+ * If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
+ * if the GFP flags specified for FGP_CREAT are atomic.
  *
- * find_or_create_page() may sleep, even if @gfp_flags specifies an
- * atomic allocation!
+ * If there is a page cache page, it is returned with an increased refcount.
  */
-struct page *find_or_create_page(struct address_space *mapping,
-		pgoff_t index, gfp_t gfp_mask)
+struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
+	int fgp_flags, gfp_t cache_gfp_mask, gfp_t radix_gfp_mask)
 {
 	struct page *page;
-	int err;
+
 repeat:
-	page = find_lock_page(mapping, index);
-	if (!page) {
-		page = __page_cache_alloc(gfp_mask);
+	page = find_get_entry(mapping, offset);
+	if (radix_tree_exceptional_entry(page))
+		page = NULL;
+	if (!page)
+		goto no_page;
+
+	if (fgp_flags & FGP_LOCK) {
+		if (fgp_flags & FGP_NOWAIT) {
+			if (!trylock_page(page)) {
+				page_cache_release(page);
+				return NULL;
+			}
+		} else {
+			lock_page(page);
+		}
+
+		/* Has the page been truncated? */
+		if (unlikely(page->mapping != mapping)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto repeat;
+		}
+		VM_BUG_ON_PAGE(page->index != offset, page);
+	}
+
+	if (page && (fgp_flags & FGP_ACCESSED))
+		mark_page_accessed(page);
+
+no_page:
+	if (!page && (fgp_flags & FGP_CREAT)) {
+		int err;
+		if ((fgp_flags & FGP_WRITE) && mapping_cap_account_dirty(mapping))
+			cache_gfp_mask |= __GFP_WRITE;
+		if (fgp_flags & FGP_NOFS) {
+			cache_gfp_mask &= ~__GFP_FS;
+			radix_gfp_mask &= ~__GFP_FS;
+		}
+
+		page = __page_cache_alloc(cache_gfp_mask);
 		if (!page)
 			return NULL;
-		/*
-		 * We want a regular kernel memory (not highmem or DMA etc)
-		 * allocation for the radix tree nodes, but we need to honour
-		 * the context-specific requirements the caller has asked for.
-		 * GFP_RECLAIM_MASK collects those requirements.
-		 */
-		err = add_to_page_cache_lru(page, mapping, index,
-			(gfp_mask & GFP_RECLAIM_MASK));
+
+		if (WARN_ON_ONCE(!(fgp_flags & FGP_LOCK)))
+			fgp_flags |= FGP_LOCK;
+
+		/* Init accessed so avoit atomic mark_page_accessed later */
+		if (fgp_flags & FGP_ACCESSED)
+			init_page_accessed(page);
+
+		err = add_to_page_cache_lru(page, mapping, offset, radix_gfp_mask);
 		if (unlikely(err)) {
 			page_cache_release(page);
 			page = NULL;
@@ -1078,9 +1076,10 @@ repeat:
 				goto repeat;
 		}
 	}
+
 	return page;
 }
-EXPORT_SYMBOL(find_or_create_page);
+EXPORT_SYMBOL(pagecache_get_page);
 
 /**
  * find_get_entries - gang pagecache lookup
@@ -1370,39 +1369,6 @@ repeat:
 }
 EXPORT_SYMBOL(find_get_pages_tag);
 
-/**
- * grab_cache_page_nowait - returns locked page at given index in given cache
- * @mapping: target address_space
- * @index: the page index
- *
- * Same as grab_cache_page(), but do not wait if the page is unavailable.
- * This is intended for speculative data generators, where the data can
- * be regenerated if the page couldn't be grabbed.  This routine should
- * be safe to call while holding the lock for another page.
- *
- * Clear __GFP_FS when allocating the page to avoid recursion into the fs
- * and deadlock against the caller's locked page.
- */
-struct page *
-grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
-{
-	struct page *page = find_get_page(mapping, index);
-
-	if (page) {
-		if (trylock_page(page))
-			return page;
-		page_cache_release(page);
-		return NULL;
-	}
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
-	if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
-		page_cache_release(page);
-		page = NULL;
-	}
-	return page;
-}
-EXPORT_SYMBOL(grab_cache_page_nowait);
-
 /*
  * CD/DVDs are error prone. When a medium error occurs, the driver may fail
  * a _large_ part of the i/o request. Imagine the worst scenario:
@@ -2372,7 +2338,6 @@ int pagecache_write_end(struct file *file, struct address_space *mapping,
 {
 	const struct address_space_operations *aops = mapping->a_ops;
 
-	mark_page_accessed(page);
 	return aops->write_end(file, mapping, pos, len, copied, page, fsdata);
 }
 EXPORT_SYMBOL(pagecache_write_end);
@@ -2454,34 +2419,18 @@ EXPORT_SYMBOL(generic_file_direct_write);
 struct page *grab_cache_page_write_begin(struct address_space *mapping,
 					pgoff_t index, unsigned flags)
 {
-	int status;
-	gfp_t gfp_mask;
 	struct page *page;
-	gfp_t gfp_notmask = 0;
+	int fgp_flags = FGP_LOCK|FGP_ACCESSED|FGP_WRITE|FGP_CREAT;
 
-	gfp_mask = mapping_gfp_mask(mapping);
-	if (mapping_cap_account_dirty(mapping))
-		gfp_mask |= __GFP_WRITE;
 	if (flags & AOP_FLAG_NOFS)
-		gfp_notmask = __GFP_FS;
-repeat:
-	page = find_lock_page(mapping, index);
+		fgp_flags |= FGP_NOFS;
+
+	page = pagecache_get_page(mapping, index, fgp_flags,
+			mapping_gfp_mask(mapping),
+			GFP_KERNEL);
 	if (page)
-		goto found;
+		wait_for_stable_page(page);
 
-	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
-	if (!page)
-		return NULL;
-	status = add_to_page_cache_lru(page, mapping, index,
-						GFP_KERNEL & ~gfp_notmask);
-	if (unlikely(status)) {
-		page_cache_release(page);
-		if (status == -EEXIST)
-			goto repeat;
-		return NULL;
-	}
-found:
-	wait_for_stable_page(page);
 	return page;
 }
 EXPORT_SYMBOL(grab_cache_page_write_begin);
@@ -2530,7 +2479,7 @@ again:
 
 		status = a_ops->write_begin(file, mapping, pos, bytes, flags,
 						&page, &fsdata);
-		if (unlikely(status))
+		if (unlikely(status < 0))
 			break;
 
 		if (mapping_writably_mapped(mapping))
@@ -2539,7 +2488,6 @@ again:
 		copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
 		flush_dcache_page(page);
 
-		mark_page_accessed(page);
 		status = a_ops->write_end(file, mapping, pos, bytes, copied,
 						page, fsdata);
 		if (unlikely(status < 0))
diff --git a/mm/shmem.c b/mm/shmem.c
index f47fb38..700a4ad 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1372,9 +1372,13 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,
 			struct page **pagep, void **fsdata)
 {
+	int ret;
 	struct inode *inode = mapping->host;
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
-	return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
+	ret = shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
+	if (*pagep)
+		init_page_accessed(*pagep);
+	return ret;
 }
 
 static int
diff --git a/mm/swap.c b/mm/swap.c
index 7a5bdd7..77baa36 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -583,6 +583,17 @@ void mark_page_accessed(struct page *page)
 EXPORT_SYMBOL(mark_page_accessed);
 
 /*
+ * Used to mark_page_accessed(page) that is not visible yet and when it is
+ * still safe to use non-atomic ops
+ */
+void init_page_accessed(struct page *page)
+{
+	if (!PageReferenced(page))
+		__SetPageReferenced(page);
+}
+EXPORT_SYMBOL(init_page_accessed);
+
+/*
  * Queue the page for addition to the LRU via pagevec. The decision on whether
  * to add the page to the [in]active [file|anon] list is deferred until the
  * pagevec is drained. This gives a chance for the caller of __lru_cache_add()
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
                   ` (17 preceding siblings ...)
  2014-05-13  9:45 ` [PATCH 18/19] mm: Non-atomically mark page accessed during page cache allocation where possible Mel Gorman
@ 2014-05-13  9:45 ` Mel Gorman
  2014-05-13 12:53   ` Mel Gorman
  2014-05-13 16:52   ` [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath Peter Zijlstra
  2014-05-19  8:57 ` [PATCH] mm: Avoid unnecessary atomic operations during end_page_writeback Mel Gorman
  19 siblings, 2 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13  9:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Mel Gorman,
	Linux Kernel, Linux-MM, Linux-FSDevel

From: Nick Piggin <npiggin@suse.de>

This patch introduces a new page flag for 64-bit capable machines,
PG_waiters, to signal there are processes waiting on PG_lock and uses it to
avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.

This adds a few branches to the fast path but avoids bouncing a dirty
cache line between CPUs. 32-bit machines always take the slow path but the
primary motivation for this patch is large machines so I do not think that
is a concern.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of
the file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or
NUMA artifacts. Throughput and wall times are presented for sync IO, only
wall times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running. The
kernels being compared are "accessed-v2" which is the patch series up
to this patch where as lockpage-v2 includes this patch.

async dd
                                   3.15.0-rc3            3.15.0-rc3
                                  accessed-v3           lockpage-v3
ext3   Max      elapsed     11.5900 (  0.00%)     11.0000 (  5.09%)
ext4   Max      elapsed     13.3400 (  0.00%)     13.4300 ( -0.67%)
tmpfs  Max      elapsed      0.4900 (  0.00%)      0.4800 (  2.04%)
btrfs  Max      elapsed     12.7800 (  0.00%)     13.8200 ( -8.14%)
xfs    Max      elapsed      2.0900 (  0.00%)      2.1100 ( -0.96%)

The xfs gain is the hardest to explain, it consistent manages to miss the
worst cases. In the other cases, the results are variable due to the async
nature of the test but the min and max figures are consistently better.

     samples percentage
ext3   90049     1.0238  vmlinux-3.15.0-rc4-accessed-v3 __wake_up_bit
ext3   61716     0.7017  vmlinux-3.15.0-rc4-accessed-v3 page_waitqueue
ext3   47529     0.5404  vmlinux-3.15.0-rc4-accessed-v3 unlock_page
ext3   23833     0.2710  vmlinux-3.15.0-rc4-accessed-v3 mark_page_accessed
ext3    9543     0.1085  vmlinux-3.15.0-rc4-accessed-v3 wake_up_bit
ext3    5036     0.0573  vmlinux-3.15.0-rc4-accessed-v3 init_page_accessed
ext3     369     0.0042  vmlinux-3.15.0-rc4-accessed-v3 __lock_page
ext3       1    1.1e-05  vmlinux-3.15.0-rc4-accessed-v3 lock_page
ext3   37376     0.4233  vmlinux-3.15.0-rc4-waitqueue-v3 unlock_page
ext3   11856     0.1343  vmlinux-3.15.0-rc4-waitqueue-v3 __wake_up_bit
ext3   11096     0.1257  vmlinux-3.15.0-rc4-waitqueue-v3 wake_up_bit
ext3     107     0.0012  vmlinux-3.15.0-rc4-waitqueue-v3 page_waitqueue
ext3      34    3.9e-04  vmlinux-3.15.0-rc4-waitqueue-v3 __lock_page
ext3       4    4.5e-05  vmlinux-3.15.0-rc4-waitqueue-v3 lock_page

There is a similar story told for each of the filesystems -- much less
time spend in page_waitqueue and __wake_up_bit due to the fact that they
now rarely need to be called. Note that for workloads that contend heavily
on the page lock that unlock_page will *increase* in cost as it has to
clear PG_waiters so while the typical case should be much faster, the worst
case costs are now higher.

The Intel vm-scalability tests tell a similar story. The ones measured here
are broadly based on dd of files 10 times the size of memory with one dd per
CPU in the system

                                              3.15.0-rc3            3.15.0-rc3
                                             accessed-v3           lockpage-v3
ext3  lru-file-readonce    elapsed      3.6300 (  0.00%)      3.6300 (  0.00%)
ext3 lru-file-readtwice    elapsed      6.0800 (  0.00%)      6.0700 (  0.16%)
ext4  lru-file-readonce    elapsed      3.7300 (  0.00%)      3.5400 (  5.09%)
ext4 lru-file-readtwice    elapsed      6.2400 (  0.00%)      6.0100 (  3.69%)
btrfs lru-file-readonce    elapsed      5.0100 (  0.00%)      4.9300 (  1.60%)
btrfslru-file-readtwice    elapsed      7.5800 (  0.00%)      7.6300 ( -0.66%)
xfs   lru-file-readonce    elapsed      3.7000 (  0.00%)      3.6400 (  1.62%)
xfs  lru-file-readtwice    elapsed      6.2400 (  0.00%)      5.8600 (  6.09%)

In most cases the time to read the file is slightly lowered. Unlike the
previous test there is no impact on mark_page_accessed as the pages are
already resident for this test and there is no opportunity to mark the
pages accessed without using atomic operations. Instead the profiles show
a reduction in the time spent in page_waitqueue.

This is similarly reflected in the time taken to mmap a range of pages.
These are the results for xfs only but the other filesystems tell a
similar story.

                       3.15.0-rc3            3.15.0-rc3
                      accessed-v2           lockpage-v2
Procs 107M     567.0000 (  0.00%)    542.0000 (  4.41%)
Procs 214M    1075.0000 (  0.00%)   1041.0000 (  3.16%)
Procs 322M    1918.0000 (  0.00%)   1522.0000 ( 20.65%)
Procs 429M    2063.0000 (  0.00%)   1950.0000 (  5.48%)
Procs 536M    2566.0000 (  0.00%)   2506.0000 (  2.34%)
Procs 644M    2920.0000 (  0.00%)   2804.0000 (  3.97%)
Procs 751M    3366.0000 (  0.00%)   3260.0000 (  3.15%)
Procs 859M    3800.0000 (  0.00%)   3672.0000 (  3.37%)
Procs 966M    4291.0000 (  0.00%)   4236.0000 (  1.28%)
Procs 1073M   4923.0000 (  0.00%)   4815.0000 (  2.19%)
Procs 1181M   5223.0000 (  0.00%)   5075.0000 (  2.83%)
Procs 1288M   5576.0000 (  0.00%)   5419.0000 (  2.82%)
Procs 1395M   5855.0000 (  0.00%)   5636.0000 (  3.74%)
Procs 1503M   6049.0000 (  0.00%)   5862.0000 (  3.09%)
Procs 1610M   6454.0000 (  0.00%)   6137.0000 (  4.91%)
Procs 1717M   6806.0000 (  0.00%)   6474.0000 (  4.88%)
Procs 1825M   7377.0000 (  0.00%)   6979.0000 (  5.40%)
Procs 1932M   7633.0000 (  0.00%)   7396.0000 (  3.10%)
Procs 2040M   8137.0000 (  0.00%)   7769.0000 (  4.52%)
Procs 2147M   8617.0000 (  0.00%)   8205.0000 (  4.78%)

         samples percentage
xfs        67544     1.1655  vmlinux-3.15.0-rc4-accessed-v3 unlock_page
xfs        49888     0.8609  vmlinux-3.15.0-rc4-accessed-v3 __wake_up_bit
xfs         1747     0.0301  vmlinux-3.15.0-rc4-accessed-v3 block_page_mkwrite
xfs         1578     0.0272  vmlinux-3.15.0-rc4-accessed-v3 wake_up_bit
xfs            2    3.5e-05  vmlinux-3.15.0-rc4-accessed-v3 lock_page
xfs        83010     1.3447  vmlinux-3.15.0-rc4-waitqueue-v3 unlock_page
xfs         2354     0.0381  vmlinux-3.15.0-rc4-waitqueue-v3 __wake_up_bit
xfs         2064     0.0334  vmlinux-3.15.0-rc4-waitqueue-v3 wake_up_bit
xfs           26    4.2e-04  vmlinux-3.15.0-rc4-waitqueue-v3 page_waitqueue
xfs            3    4.9e-05  vmlinux-3.15.0-rc4-waitqueue-v3 lock_page
xfs            2    3.2e-05  vmlinux-3.15.0-rc4-waitqueue-v3 __lock_page

[jack@suse.cz: Fix add_page_wait_queue]
[mhocko@suse.cz: Use sleep_on_page_killable in __wait_on_page_locked_killable]
[steiner@sgi.com: Do not update struct page unnecessarily]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/page-flags.h |  18 +++++
 include/linux/pagemap.h    |   6 +-
 mm/filemap.c               | 178 ++++++++++++++++++++++++++++++++++++++++-----
 mm/page_alloc.c            |   1 +
 mm/swap.c                  |  10 +++
 mm/vmscan.c                |   3 +
 6 files changed, 196 insertions(+), 20 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 2093eb7..b2d0470 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -87,6 +87,7 @@ enum pageflags {
 	PG_private_2,		/* If pagecache, has fs aux data */
 	PG_writeback,		/* Page is under writeback */
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
+	PG_waiters,		/* Page has PG_locked waiters. */
 	PG_head,		/* A head page */
 	PG_tail,		/* A tail page */
 #else
@@ -213,6 +214,22 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
 
 __PAGEFLAG(SlobFree, slob_free)
 
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+PAGEFLAG(Waiters, waiters) __CLEARPAGEFLAG(Waiters, waiters)
+	TESTCLEARFLAG(Waiters, waiters)
+#define __PG_WAITERS		(1 << PG_waiters)
+#else
+/* Always fallback to slow path on 32-bit */
+static inline bool PageWaiters(struct page *page)
+{
+	return true;
+}
+static inline void __ClearPageWaiters(struct page *page) {}
+static inline void ClearPageWaiters(struct page *page) {}
+static inline void SetPageWaiters(struct page *page) {}
+#define __PG_WAITERS		0
+#endif /* CONFIG_PAGEFLAGS_EXTENDED */
+
 /*
  * Private page markings that may be used by the filesystem that owns the page
  * for its own purposes.
@@ -506,6 +523,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 	 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
 	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
+	 __PG_WAITERS | \
 	 __PG_COMPOUND_LOCK)
 
 /*
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e5ffaa0..2ec2d78 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -485,13 +485,15 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
  * Never use this directly!
  */
 extern void wait_on_page_bit(struct page *page, int bit_nr);
+extern void __wait_on_page_locked(struct page *page);
 
 extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
+extern int __wait_on_page_locked_killable(struct page *page);
 
 static inline int wait_on_page_locked_killable(struct page *page)
 {
 	if (PageLocked(page))
-		return wait_on_page_bit_killable(page, PG_locked);
+		return __wait_on_page_locked_killable(page);
 	return 0;
 }
 
@@ -505,7 +507,7 @@ static inline int wait_on_page_locked_killable(struct page *page)
 static inline void wait_on_page_locked(struct page *page)
 {
 	if (PageLocked(page))
-		wait_on_page_bit(page, PG_locked);
+		__wait_on_page_locked(page);
 }
 
 /* 
diff --git a/mm/filemap.c b/mm/filemap.c
index c60ed0f..d81ed7d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -241,15 +241,15 @@ void delete_from_page_cache(struct page *page)
 }
 EXPORT_SYMBOL(delete_from_page_cache);
 
-static int sleep_on_page(void *word)
+static int sleep_on_page(void)
 {
-	io_schedule();
+	io_schedule_timeout(HZ);
 	return 0;
 }
 
-static int sleep_on_page_killable(void *word)
+static int sleep_on_page_killable(void)
 {
-	sleep_on_page(word);
+	sleep_on_page();
 	return fatal_signal_pending(current) ? -EINTR : 0;
 }
 
@@ -680,30 +680,105 @@ static wait_queue_head_t *page_waitqueue(struct page *page)
 	return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
 }
 
-static inline void wake_up_page(struct page *page, int bit)
+static inline wait_queue_head_t *clear_page_waiters(struct page *page)
 {
-	__wake_up_bit(page_waitqueue(page), &page->flags, bit);
+	wait_queue_head_t *wqh = NULL;
+
+	if (!PageWaiters(page))
+		return NULL;
+
+	/*
+	 * Prepare to clear PG_waiters if the waitqueue is no longer
+	 * active. Note that there is no guarantee that a page with no
+	 * waiters will get cleared as there may be unrelated pages
+	 * sleeping on the same page wait queue. Accurate detection
+	 * would require a counter. In the event of a collision, the
+	 * waiter bit will dangle and lookups will be required until
+	 * the page is unlocked without collisions. The bit will need to
+	 * be cleared before freeing to avoid triggering debug checks.
+	 *
+	 * Furthermore, this can race with processes about to sleep on
+	 * the same page if it adds itself to the waitqueue just after
+	 * this check. The timeout in sleep_on_page prevents the race
+	 * being a terminal one. In effect, the uncontended and non-race
+	 * cases are faster in exchange for occasional worst case of the
+	 * timeout saving us.
+	 */
+	wqh = page_waitqueue(page);
+	if (!waitqueue_active(wqh))
+		ClearPageWaiters(page);
+
+	return wqh;
+}
+
+/* Returns true if the page is locked */
+static inline bool prepare_wait_lock(struct page *page, wait_queue_head_t *wqh,
+				wait_queue_t *wq, int state)
+{
+
+	/* Set PG_waiters so a racing unlock_page will check the waitiqueue */
+	if (!PageWaiters(page))
+		SetPageWaiters(page);
+
+	prepare_to_wait_exclusive(wqh, wq, state);
+
+	/*
+	 * A racing unlock can miss that the waitqueue is active and clear the
+	 * waiters again. This is not race free and cannot obvious be made
+	 * race free without introducing new locking. Instead, sleep_on_page()
+	 * has a timeout to catch the race cases where a race occurs.
+	 */
+	if (!PageWaiters(page))
+		SetPageWaiters(page);
+	return PageLocked(page);
+}
+
+static inline bool prepare_wait_bit(struct page *page, wait_queue_head_t *wqh,
+				wait_queue_t *wq, int state, int bit_nr)
+{
+	if (!PageWaiters(page))
+		SetPageWaiters(page);
+
+	prepare_to_wait(wqh, wq, state);
+	if (!PageWaiters(page))
+		SetPageWaiters(page);
+	return test_bit(bit_nr, &page->flags);
 }
 
 void wait_on_page_bit(struct page *page, int bit_nr)
 {
+	wait_queue_head_t *wqh;
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
 
-	if (test_bit(bit_nr, &page->flags))
-		__wait_on_bit(page_waitqueue(page), &wait, sleep_on_page,
-							TASK_UNINTERRUPTIBLE);
+	if (!test_bit(bit_nr, &page->flags))
+		return;
+	wqh = page_waitqueue(page);
+
+	do {
+		if (prepare_wait_bit(page, wqh, &wait.wait, TASK_KILLABLE, bit_nr))
+			sleep_on_page_killable();
+	} while (test_bit(bit_nr, &page->flags));
+	finish_wait(wqh, &wait.wait);
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
 int wait_on_page_bit_killable(struct page *page, int bit_nr)
 {
+	wait_queue_head_t *wqh;
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+	int ret = 0;
 
 	if (!test_bit(bit_nr, &page->flags))
 		return 0;
+	wqh = page_waitqueue(page);
 
-	return __wait_on_bit(page_waitqueue(page), &wait,
-			     sleep_on_page_killable, TASK_KILLABLE);
+	do {
+		if (prepare_wait_bit(page, wqh, &wait.wait, TASK_KILLABLE, bit_nr))
+			ret = sleep_on_page_killable();
+	} while (!ret && test_bit(bit_nr, &page->flags));
+	finish_wait(wqh, &wait.wait);
+
+	return ret;
 }
 
 /**
@@ -719,6 +794,8 @@ void add_page_wait_queue(struct page *page, wait_queue_t *waiter)
 	unsigned long flags;
 
 	spin_lock_irqsave(&q->lock, flags);
+	if (!PageWaiters(page))
+		SetPageWaiters(page);
 	__add_wait_queue(q, waiter);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
@@ -738,10 +815,26 @@ EXPORT_SYMBOL_GPL(add_page_wait_queue);
  */
 void unlock_page(struct page *page)
 {
+	wait_queue_head_t *wqh = clear_page_waiters(page);
+
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
+
+	/*
+	 * No additional barrier needed due to clear_bit_unlock barriering all updates
+	 * before waking waiters
+	 */
 	clear_bit_unlock(PG_locked, &page->flags);
-	smp_mb__after_clear_bit();
-	wake_up_page(page, PG_locked);
+
+	/*
+	 * Wake the queue if waiters were detected. Ordinarily this wakeup
+	 * would be unconditional to catch races between the lock bit being
+	 * set and a new process joining the queue. However, that would
+	 * require the waitqueue to be looked up every time. Instead we
+	 * optimse for the uncontended and non-race case and recover using
+	 * a timeout in sleep_on_page.
+	 */
+	if (wqh)
+		__wake_up_bit(wqh, &page->flags, PG_locked);
 }
 EXPORT_SYMBOL(unlock_page);
 
@@ -751,14 +844,19 @@ EXPORT_SYMBOL(unlock_page);
  */
 void end_page_writeback(struct page *page)
 {
+	wait_queue_head_t *wqh;
 	if (TestClearPageReclaim(page))
 		rotate_reclaimable_page(page);
 
 	if (!test_clear_page_writeback(page))
 		BUG();
 
+	wqh = clear_page_waiters(page);
+
 	smp_mb__after_clear_bit();
-	wake_up_page(page, PG_writeback);
+
+	if (wqh)
+		__wake_up_bit(wqh, &page->flags, PG_writeback);
 }
 EXPORT_SYMBOL(end_page_writeback);
 
@@ -768,22 +866,66 @@ EXPORT_SYMBOL(end_page_writeback);
  */
 void __lock_page(struct page *page)
 {
+	wait_queue_head_t *wqh = page_waitqueue(page);
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
-	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
-							TASK_UNINTERRUPTIBLE);
+	do {
+		if (prepare_wait_lock(page, wqh, &wait.wait, TASK_UNINTERRUPTIBLE))
+			sleep_on_page();
+	} while (!trylock_page(page));
+
+	finish_wait(wqh, &wait.wait);
 }
 EXPORT_SYMBOL(__lock_page);
 
 int __lock_page_killable(struct page *page)
 {
+	wait_queue_head_t *wqh = page_waitqueue(page);
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+	int ret = 0;
+
+	do {
+		if (prepare_wait_lock(page, wqh, &wait.wait, TASK_KILLABLE))
+			ret = sleep_on_page_killable();
+	} while (!ret && !trylock_page(page));
+
+	finish_wait(wqh, &wait.wait);
 
-	return __wait_on_bit_lock(page_waitqueue(page), &wait,
-					sleep_on_page_killable, TASK_KILLABLE);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(__lock_page_killable);
 
+int  __wait_on_page_locked_killable(struct page *page)
+{
+	int ret = 0;
+	wait_queue_head_t *wqh = page_waitqueue(page);
+	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+
+	do {
+		if (prepare_wait_lock(page, wqh, &wait.wait, TASK_KILLABLE))
+			ret = sleep_on_page_killable();
+	} while (!ret && PageLocked(page));
+
+	finish_wait(wqh, &wait.wait);
+
+	return ret;
+}
+EXPORT_SYMBOL(__wait_on_page_locked_killable);
+
+void  __wait_on_page_locked(struct page *page)
+{
+	wait_queue_head_t *wqh = page_waitqueue(page);
+	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+
+	do {
+		if (prepare_wait_lock(page, wqh, &wait.wait, TASK_UNINTERRUPTIBLE))
+			sleep_on_page();
+	} while (PageLocked(page));
+
+	finish_wait(wqh, &wait.wait);
+}
+EXPORT_SYMBOL(__wait_on_page_locked);
+
 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 			 unsigned int flags)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 02f3ffc..613cb4f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6534,6 +6534,7 @@ static const struct trace_print_flags pageflag_names[] = {
 	{1UL << PG_private_2,		"private_2"	},
 	{1UL << PG_writeback,		"writeback"	},
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
+	{1UL << PG_waiters,		"waiters"	},
 	{1UL << PG_head,		"head"		},
 	{1UL << PG_tail,		"tail"		},
 #else
diff --git a/mm/swap.c b/mm/swap.c
index 77baa36..66d2077 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -67,6 +67,10 @@ static void __page_cache_release(struct page *page)
 static void __put_single_page(struct page *page)
 {
 	__page_cache_release(page);
+
+	/* Clear dangling waiters from collisions on page_waitqueue */
+	__ClearPageWaiters(page);
+
 	free_hot_cold_page(page, false);
 }
 
@@ -867,6 +871,12 @@ void release_pages(struct page **pages, int nr, bool cold)
 		/* Clear Active bit in case of parallel mark_page_accessed */
 		__ClearPageActive(page);
 
+		/*
+		 * Clear waiters bit that may still be set due to a collision
+		 * on page_waitqueue
+		 */
+		__ClearPageWaiters(page);
+
 		list_add(&page->lru, &pages_to_free);
 	}
 	if (zone)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8db1318..20250b8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1090,6 +1090,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * waiting on the page lock, because there are no references.
 		 */
 		__clear_page_locked(page);
+		__ClearPageWaiters(page);
 free_it:
 		nr_reclaimed++;
 
@@ -1421,6 +1422,7 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
+			__ClearPageWaiters(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
@@ -1629,6 +1631,7 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
+			__ClearPageWaiters(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
-- 
1.8.4.5


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets
  2014-05-13  9:45 ` [PATCH 04/19] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets Mel Gorman
@ 2014-05-13 10:58   ` Peter Zijlstra
  2014-05-13 12:28     ` Mel Gorman
  0 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2014-05-13 10:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Vlastimil Babka, Jan Kara,
	Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel, Linux-MM,
	Linux-FSDevel

[-- Attachment #1: Type: text/plain, Size: 1342 bytes --]

On Tue, May 13, 2014 at 10:45:35AM +0100, Mel Gorman wrote:
> +#ifdef HAVE_JUMP_LABEL
> +extern struct static_key cpusets_enabled_key;
> +static inline bool cpusets_enabled(void)
> +{
> +	return static_key_false(&cpusets_enabled_key);
> +}
> +
> +/* jump label reference count + the top-level cpuset */
> +#define number_of_cpusets (static_key_count(&cpusets_enabled_key) + 1)
> +
> +static inline void cpuset_inc(void)
> +{
> +	static_key_slow_inc(&cpusets_enabled_key);
> +}
> +
> +static inline void cpuset_dec(void)
> +{
> +	static_key_slow_dec(&cpusets_enabled_key);
> +}
> +
> +static inline void cpuset_init_count(void) { }
> +
> +#else
>  extern int number_of_cpusets;	/* How many cpusets are defined in system? */
>  
> +static inline bool cpusets_enabled(void)
> +{
> +	return number_of_cpusets > 1;
> +}
> +
> +static inline void cpuset_inc(void)
> +{
> +	number_of_cpusets++;
> +}
> +
> +static inline void cpuset_dec(void)
> +{
> +	number_of_cpusets--;
> +}
> +
> +static inline void cpuset_init_count(void)
> +{
> +	number_of_cpusets = 1;
> +}
> +#endif /* HAVE_JUMP_LABEL */

I'm still puzzled by the whole #else branch here, why not
unconditionally use the jump-label one? Without HAVE_JUMP_LABEL we'll
revert to a simple atomic_t counter, which should be perfectly fine, no?

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 17/19] fs: buffer: Do not use unnecessary atomic operations when discarding buffers
  2014-05-13  9:45 ` [PATCH 17/19] fs: buffer: Do not use unnecessary atomic operations when discarding buffers Mel Gorman
@ 2014-05-13 11:09   ` Peter Zijlstra
  2014-05-13 12:50     ` Mel Gorman
  2014-05-13 13:50   ` Jan Kara
  2014-05-13 22:29   ` Andrew Morton
  2 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2014-05-13 11:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Vlastimil Babka, Jan Kara,
	Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel, Linux-MM,
	Linux-FSDevel

[-- Attachment #1: Type: text/plain, Size: 2024 bytes --]

On Tue, May 13, 2014 at 10:45:48AM +0100, Mel Gorman wrote:
> Discarding buffers uses a bunch of atomic operations when discarding buffers
> because ...... I can't think of a reason. Use a cmpxchg loop to clear all the
> necessary flags. In most (all?) cases this will be a single atomic operations.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  fs/buffer.c                 | 14 +++++++++-----
>  include/linux/buffer_head.h |  5 +++++
>  2 files changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 9ddb9fc..e80012d 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1485,14 +1485,18 @@ EXPORT_SYMBOL(set_bh_page);
>   */
>  static void discard_buffer(struct buffer_head * bh)
>  {
> +	unsigned long b_state, b_state_old;
> +
>  	lock_buffer(bh);
>  	clear_buffer_dirty(bh);
>  	bh->b_bdev = NULL;
> -	clear_buffer_mapped(bh);
> -	clear_buffer_req(bh);
> -	clear_buffer_new(bh);
> -	clear_buffer_delay(bh);
> -	clear_buffer_unwritten(bh);
> +	b_state = bh->b_state;
> +	for (;;) {
> +		b_state_old = cmpxchg(&bh->b_state, b_state, (b_state & ~BUFFER_FLAGS_DISCARD));
> +		if (b_state_old == b_state)
> +			break;
> +		b_state = b_state_old;
> +	}
>  	unlock_buffer(bh);
>  }

So.. I'm soon going to introduce atomic_{or,and}() and
atomic64_{or,and}() across the board, but of course this isn't an
atomic_long_t but a regular unsigned long.

Its a bit unfortunate we have this discrepancy with types vs atomic ops,
there's:

  cmpxchg, xchg -- mostly available for all 1,2,3,4 (and 8 where
  appropriate) byte values.

  bitops -- operate on unsigned long *

  atomic* -- operate on atomic_*t


So while ideally we'd be able to use the unconditional atomic and
operation which is available on a lot of architectures, we'll be stuck
with a cmpxchg loop instead :/

*sigh*

Anyway, nothing wrong with this patch, however, you could, if you really
wanted to push things, also include BH_Lock in that clear :-)

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 04/19] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets
  2014-05-13 10:58   ` Peter Zijlstra
@ 2014-05-13 12:28     ` Mel Gorman
  0 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13 12:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Johannes Weiner, Vlastimil Babka, Jan Kara,
	Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel, Linux-MM,
	Linux-FSDevel

On Tue, May 13, 2014 at 12:58:51PM +0200, Peter Zijlstra wrote:
> On Tue, May 13, 2014 at 10:45:35AM +0100, Mel Gorman wrote:
> > +#ifdef HAVE_JUMP_LABEL
> > +extern struct static_key cpusets_enabled_key;
> > +static inline bool cpusets_enabled(void)
> > +{
> > +	return static_key_false(&cpusets_enabled_key);
> > +}
> > +
> > +/* jump label reference count + the top-level cpuset */
> > +#define number_of_cpusets (static_key_count(&cpusets_enabled_key) + 1)
> > +
> > +static inline void cpuset_inc(void)
> > +{
> > +	static_key_slow_inc(&cpusets_enabled_key);
> > +}
> > +
> > +static inline void cpuset_dec(void)
> > +{
> > +	static_key_slow_dec(&cpusets_enabled_key);
> > +}
> > +
> > +static inline void cpuset_init_count(void) { }
> > +
> > +#else
> >  extern int number_of_cpusets;	/* How many cpusets are defined in system? */
> >  
> > +static inline bool cpusets_enabled(void)
> > +{
> > +	return number_of_cpusets > 1;
> > +}
> > +
> > +static inline void cpuset_inc(void)
> > +{
> > +	number_of_cpusets++;
> > +}
> > +
> > +static inline void cpuset_dec(void)
> > +{
> > +	number_of_cpusets--;
> > +}
> > +
> > +static inline void cpuset_init_count(void)
> > +{
> > +	number_of_cpusets = 1;
> > +}
> > +#endif /* HAVE_JUMP_LABEL */
> 
> I'm still puzzled by the whole #else branch here, why not
> unconditionally use the jump-label one? Without HAVE_JUMP_LABEL we'll
> revert to a simple atomic_t counter, which should be perfectly fine, no?

No good reason -- the intent was to preserve the old behaviour if jump
labels were not available but there is no good reason for that. I'll delete
the alternative implementation, make number_of_cpusets an inline function
and move cpusets_enabled_key into the __read_mostly section. It's untested
but the patch now looks like

---8<---
mm: page_alloc: Use jump labels to avoid checking number_of_cpusets

If cpusets are not in use then we still check a global variable on every
page allocation. Use jump labels to avoid the overhead.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/cpuset.h | 28 +++++++++++++++++++++++++---
 kernel/cpuset.c        | 14 ++++----------
 mm/page_alloc.c        |  3 ++-
 3 files changed, 31 insertions(+), 14 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index b19d3dc..a94af76 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -15,7 +15,27 @@
 
 #ifdef CONFIG_CPUSETS
 
-extern int number_of_cpusets;	/* How many cpusets are defined in system? */
+extern struct static_key cpusets_enabled_key;
+static inline bool cpusets_enabled(void)
+{
+	return static_key_false(&cpusets_enabled_key);
+}
+
+static inline int nr_cpusets(void)
+{
+	/* jump label reference count + the top-level cpuset */
+	return static_key_count(&cpusets_enabled_key) + 1;
+}
+
+static inline void cpuset_inc(void)
+{
+	static_key_slow_inc(&cpusets_enabled_key);
+}
+
+static inline void cpuset_dec(void)
+{
+	static_key_slow_dec(&cpusets_enabled_key);
+}
 
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
@@ -32,13 +52,13 @@ extern int __cpuset_node_allowed_hardwall(int node, gfp_t gfp_mask);
 
 static inline int cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
 {
-	return number_of_cpusets <= 1 ||
+	return nr_cpusets() <= 1 ||
 		__cpuset_node_allowed_softwall(node, gfp_mask);
 }
 
 static inline int cpuset_node_allowed_hardwall(int node, gfp_t gfp_mask)
 {
-	return number_of_cpusets <= 1 ||
+	return nr_cpusets() <= 1 ||
 		__cpuset_node_allowed_hardwall(node, gfp_mask);
 }
 
@@ -124,6 +144,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 
 #else /* !CONFIG_CPUSETS */
 
+static inline bool cpusets_enabled(void) { return false; }
+
 static inline int cpuset_init(void) { return 0; }
 static inline void cpuset_init_smp(void) {}
 
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 3d54c41..1300178 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -61,12 +61,7 @@
 #include <linux/cgroup.h>
 #include <linux/wait.h>
 
-/*
- * Tracks how many cpusets are currently defined in system.
- * When there is only one cpuset (the root cpuset) we can
- * short circuit some hooks.
- */
-int number_of_cpusets __read_mostly;
+struct static_key cpusets_enabled_key __read_mostly = STATIC_KEY_INIT_FALSE;
 
 /* See "Frequency meter" comments, below. */
 
@@ -611,7 +606,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
 		goto done;
 	}
 
-	csa = kmalloc(number_of_cpusets * sizeof(cp), GFP_KERNEL);
+	csa = kmalloc(nr_cpusets() * sizeof(cp), GFP_KERNEL);
 	if (!csa)
 		goto done;
 	csn = 0;
@@ -1888,7 +1883,7 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
 	if (is_spread_slab(parent))
 		set_bit(CS_SPREAD_SLAB, &cs->flags);
 
-	number_of_cpusets++;
+	cpuset_inc();
 
 	if (!test_bit(CGRP_CPUSET_CLONE_CHILDREN, &css->cgroup->flags))
 		goto out_unlock;
@@ -1939,7 +1934,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
 	if (is_sched_load_balance(cs))
 		update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
 
-	number_of_cpusets--;
+	cpuset_dec();
 	clear_bit(CS_ONLINE, &cs->flags);
 
 	mutex_unlock(&cpuset_mutex);
@@ -1992,7 +1987,6 @@ int __init cpuset_init(void)
 	if (!alloc_cpumask_var(&cpus_attach, GFP_KERNEL))
 		BUG();
 
-	number_of_cpusets = 1;
 	return 0;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5c559e3..cb12b9a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1930,7 +1930,8 @@ zonelist_scan:
 		if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
-		if ((alloc_flags & ALLOC_CPUSET) &&
+		if (cpusets_enabled() &&
+			(alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				continue;
 		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 17/19] fs: buffer: Do not use unnecessary atomic operations when discarding buffers
  2014-05-13 11:09   ` Peter Zijlstra
@ 2014-05-13 12:50     ` Mel Gorman
  2014-05-13 13:49       ` Jan Kara
  2014-05-13 14:01       ` Peter Zijlstra
  0 siblings, 2 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13 12:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Johannes Weiner, Vlastimil Babka, Jan Kara,
	Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel, Linux-MM,
	Linux-FSDevel

On Tue, May 13, 2014 at 01:09:51PM +0200, Peter Zijlstra wrote:
> On Tue, May 13, 2014 at 10:45:48AM +0100, Mel Gorman wrote:
> > Discarding buffers uses a bunch of atomic operations when discarding buffers
> > because ...... I can't think of a reason. Use a cmpxchg loop to clear all the
> > necessary flags. In most (all?) cases this will be a single atomic operations.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  fs/buffer.c                 | 14 +++++++++-----
> >  include/linux/buffer_head.h |  5 +++++
> >  2 files changed, 14 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 9ddb9fc..e80012d 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -1485,14 +1485,18 @@ EXPORT_SYMBOL(set_bh_page);
> >   */
> >  static void discard_buffer(struct buffer_head * bh)
> >  {
> > +	unsigned long b_state, b_state_old;
> > +
> >  	lock_buffer(bh);
> >  	clear_buffer_dirty(bh);
> >  	bh->b_bdev = NULL;
> > -	clear_buffer_mapped(bh);
> > -	clear_buffer_req(bh);
> > -	clear_buffer_new(bh);
> > -	clear_buffer_delay(bh);
> > -	clear_buffer_unwritten(bh);
> > +	b_state = bh->b_state;
> > +	for (;;) {
> > +		b_state_old = cmpxchg(&bh->b_state, b_state, (b_state & ~BUFFER_FLAGS_DISCARD));
> > +		if (b_state_old == b_state)
> > +			break;
> > +		b_state = b_state_old;
> > +	}
> >  	unlock_buffer(bh);
> >  }
> 
> So.. I'm soon going to introduce atomic_{or,and}() and
> atomic64_{or,and}() across the board, but of course this isn't an
> atomic_long_t but a regular unsigned long.
> 
> Its a bit unfortunate we have this discrepancy with types vs atomic ops,
> there's:
> 
>   cmpxchg, xchg -- mostly available for all 1,2,3,4 (and 8 where
>   appropriate) byte values.
> 
>   bitops -- operate on unsigned long *
> 
>   atomic* -- operate on atomic_*t

I hit the same problem when dealing with pageblock bitmap. I would have
preferred it to do an atomic_read() but the actual conversion to use
atomic_t for the map became a mess with little or no upside.

> 
> operation which is available on a lot of architectures, we'll be stuck
> with a cmpxchg loop instead :/
> 
> *sigh*
> 
> Anyway, nothing wrong with this patch, however, you could, if you really
> wanted to push things, also include BH_Lock in that clear :-)

That's a bold strategy Cotton.

Untested patch on top

---8<---
diff --git a/fs/buffer.c b/fs/buffer.c
index e80012d..42fcb6d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1490,6 +1490,8 @@ static void discard_buffer(struct buffer_head * bh)
 	lock_buffer(bh);
 	clear_buffer_dirty(bh);
 	bh->b_bdev = NULL;
+
+	smp_mb__before_clear_bit();
 	b_state = bh->b_state;
 	for (;;) {
 		b_state_old = cmpxchg(&bh->b_state, b_state, (b_state & ~BUFFER_FLAGS_DISCARD));
@@ -1497,7 +1499,13 @@ static void discard_buffer(struct buffer_head * bh)
 			break;
 		b_state = b_state_old;
 	}
-	unlock_buffer(bh);
+
+	/*
+	 * BUFFER_FLAGS_DISCARD include BH_lock so it has been cleared so the
+	 * wake_up_bit is the last part of a unlock_buffer
+	 */
+	smp_mb__after_clear_bit();
+	wake_up_bit(&bh->b_state, BH_Lock);
 }
 
 /**
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 95f565a..523db58 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -80,7 +80,7 @@ struct buffer_head {
 /* Bits that are cleared during an invalidate */
 #define BUFFER_FLAGS_DISCARD \
 	(1 << BH_Mapped | 1 << BH_New | 1 << BH_Req | \
-	 1 << BH_Delay | 1 << BH_Unwritten)
+	 1 << BH_Delay | 1 << BH_Unwritten | 1 << BH_Lock)
 
 /*
  * macro tricks to expand the set_buffer_foo(), clear_buffer_foo()

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-13  9:45 ` [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath Mel Gorman
@ 2014-05-13 12:53   ` Mel Gorman
  2014-05-13 14:17     ` Peter Zijlstra
  2014-05-13 16:52   ` [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath Peter Zijlstra
  1 sibling, 1 reply; 103+ messages in thread
From: Mel Gorman @ 2014-05-13 12:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel

On Tue, May 13, 2014 at 10:45:50AM +0100, Mel Gorman wrote:
>  void unlock_page(struct page *page)
>  {
> +	wait_queue_head_t *wqh = clear_page_waiters(page);
> +
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> +
> +	/*
> +	 * No additional barrier needed due to clear_bit_unlock barriering all updates
> +	 * before waking waiters
> +	 */
>  	clear_bit_unlock(PG_locked, &page->flags);
> -	smp_mb__after_clear_bit();
> -	wake_up_page(page, PG_locked);

This is wrong. The smp_mb__after_clear_bit() is still required to ensure
that the cleared bit is visible before the wakeup on all architectures.

---8<---
diff --git a/mm/filemap.c b/mm/filemap.c
index 6ac066e..028b5a1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -819,11 +819,8 @@ void unlock_page(struct page *page)
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 
-	/*
-	 * No additional barrier needed due to clear_bit_unlock barriering all updates
-	 * before waking waiters
-	 */
 	clear_bit_unlock(PG_locked, &page->flags);
+	smp_mb__after_clear_bit();
 
 	/*
 	 * Wake the queue if waiters were detected. Ordinarily this wakeup

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 10/19] mm: page_alloc: Reduce number of times page_to_pfn is called
  2014-05-13  9:45 ` [PATCH 10/19] mm: page_alloc: Reduce number of times page_to_pfn is called Mel Gorman
@ 2014-05-13 13:27   ` Vlastimil Babka
  2014-05-13 14:09     ` Mel Gorman
  0 siblings, 1 reply; 103+ messages in thread
From: Vlastimil Babka @ 2014-05-13 13:27 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Johannes Weiner, Jan Kara, Michal Hocko, Hugh Dickins,
	Peter Zijlstra, Dave Hansen, Linux Kernel, Linux-MM,
	Linux-FSDevel

On 05/13/2014 11:45 AM, Mel Gorman wrote:
> In the free path we calculate page_to_pfn multiple times. Reduce that.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Just two comments.
I just don't like #define but I can live with that.
__free_one_page() is marked inline so presumably it would reuse 
page_to_pfn() result in its caller already. But it seems to me it's 
quite large so I wonder if it gets inlined anyway, and if the attribute 
still makes sense...

> ---
>   include/linux/mmzone.h          |  9 +++++++--
>   include/linux/pageblock-flags.h | 33 +++++++++++++--------------------
>   mm/page_alloc.c                 | 34 +++++++++++++++++++---------------
>   3 files changed, 39 insertions(+), 37 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 835aa3d..bd6f504 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -78,10 +78,15 @@ extern int page_group_by_mobility_disabled;
>   #define NR_MIGRATETYPE_BITS (PB_migrate_end - PB_migrate + 1)
>   #define MIGRATETYPE_MASK ((1UL << NR_MIGRATETYPE_BITS) - 1)
>
> -static inline int get_pageblock_migratetype(struct page *page)
> +#define get_pageblock_migratetype(page)					\
> +	get_pfnblock_flags_mask(page, page_to_pfn(page),		\
> +			PB_migrate_end, MIGRATETYPE_MASK)
> +
> +static inline int get_pfnblock_migratetype(struct page *page, unsigned long pfn)
>   {
>   	BUILD_BUG_ON(PB_migrate_end - PB_migrate != 2);
> -	return get_pageblock_flags_mask(page, PB_migrate_end, MIGRATETYPE_MASK);
> +	return get_pfnblock_flags_mask(page, pfn, PB_migrate_end,
> +					MIGRATETYPE_MASK);
>   }
>
>   struct free_area {
> diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
> index c08730c..2baeee1 100644
> --- a/include/linux/pageblock-flags.h
> +++ b/include/linux/pageblock-flags.h
> @@ -65,33 +65,26 @@ extern int pageblock_order;
>   /* Forward declaration */
>   struct page;
>
> -unsigned long get_pageblock_flags_mask(struct page *page,
> +unsigned long get_pfnblock_flags_mask(struct page *page,
> +				unsigned long pfn,
>   				unsigned long end_bitidx,
>   				unsigned long mask);
> -void set_pageblock_flags_mask(struct page *page,
> +
> +void set_pfnblock_flags_mask(struct page *page,
>   				unsigned long flags,
> +				unsigned long pfn,
>   				unsigned long end_bitidx,
>   				unsigned long mask);
>
>   /* Declarations for getting and setting flags. See mm/page_alloc.c */
> -static inline unsigned long get_pageblock_flags_group(struct page *page,
> -					int start_bitidx, int end_bitidx)
> -{
> -	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
> -	unsigned long mask = (1 << nr_flag_bits) - 1;
> -
> -	return get_pageblock_flags_mask(page, end_bitidx, mask);
> -}
> -
> -static inline void set_pageblock_flags_group(struct page *page,
> -					unsigned long flags,
> -					int start_bitidx, int end_bitidx)
> -{
> -	unsigned long nr_flag_bits = end_bitidx - start_bitidx + 1;
> -	unsigned long mask = (1 << nr_flag_bits) - 1;
> -
> -	set_pageblock_flags_mask(page, flags, end_bitidx, mask);
> -}
> +#define get_pageblock_flags_group(page, start_bitidx, end_bitidx) \
> +	get_pfnblock_flags_mask(page, page_to_pfn(page),		\
> +			end_bitidx,					\
> +			(1 << (end_bitidx - start_bitidx + 1)) - 1)
> +#define set_pageblock_flags_group(page, flags, start_bitidx, end_bitidx) \
> +	set_pfnblock_flags_mask(page, flags, page_to_pfn(page),		\
> +			end_bitidx,					\
> +			(1 << (end_bitidx - start_bitidx + 1)) - 1)
>
>   #ifdef CONFIG_COMPACTION
>   #define get_pageblock_skip(page) \
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b438eb7..3948f0a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -559,6 +559,7 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
>    */
>
>   static inline void __free_one_page(struct page *page,
> +		unsigned long pfn,
>   		struct zone *zone, unsigned int order,
>   		int migratetype)
>   {
> @@ -575,7 +576,7 @@ static inline void __free_one_page(struct page *page,
>
>   	VM_BUG_ON(migratetype == -1);
>
> -	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
> +	page_idx = pfn & ((1 << MAX_ORDER) - 1);
>
>   	VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
>   	VM_BUG_ON_PAGE(bad_range(zone, page), page);
> @@ -710,7 +711,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>   			list_del(&page->lru);
>   			mt = get_freepage_migratetype(page);
>   			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
> -			__free_one_page(page, zone, 0, mt);
> +			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
>   			trace_mm_page_pcpu_drain(page, 0, mt);
>   			if (likely(!is_migrate_isolate_page(page))) {
>   				__mod_zone_page_state(zone, NR_FREE_PAGES, 1);
> @@ -722,13 +723,15 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>   	spin_unlock(&zone->lock);
>   }
>
> -static void free_one_page(struct zone *zone, struct page *page, int order,
> +static void free_one_page(struct zone *zone,
> +				struct page *page, unsigned long pfn,
> +				int order,
>   				int migratetype)
>   {
>   	spin_lock(&zone->lock);
>   	zone->pages_scanned = 0;
>
> -	__free_one_page(page, zone, order, migratetype);
> +	__free_one_page(page, pfn, zone, order, migratetype);
>   	if (unlikely(!is_migrate_isolate(migratetype)))
>   		__mod_zone_freepage_state(zone, 1 << order, migratetype);
>   	spin_unlock(&zone->lock);
> @@ -765,15 +768,16 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>   {
>   	unsigned long flags;
>   	int migratetype;
> +	unsigned long pfn = page_to_pfn(page);
>
>   	if (!free_pages_prepare(page, order))
>   		return;
>
>   	local_irq_save(flags);
>   	__count_vm_events(PGFREE, 1 << order);
> -	migratetype = get_pageblock_migratetype(page);
> +	migratetype = get_pfnblock_migratetype(page, pfn);
>   	set_freepage_migratetype(page, migratetype);
> -	free_one_page(page_zone(page), page, order, migratetype);
> +	free_one_page(page_zone(page), page, pfn, order, migratetype);
>   	local_irq_restore(flags);
>   }
>
> @@ -1376,12 +1380,13 @@ void free_hot_cold_page(struct page *page, int cold)
>   	struct zone *zone = page_zone(page);
>   	struct per_cpu_pages *pcp;
>   	unsigned long flags;
> +	unsigned long pfn = page_to_pfn(page);
>   	int migratetype;
>
>   	if (!free_pages_prepare(page, 0))
>   		return;
>
> -	migratetype = get_pageblock_migratetype(page);
> +	migratetype = get_pfnblock_migratetype(page, pfn);
>   	set_freepage_migratetype(page, migratetype);
>   	local_irq_save(flags);
>   	__count_vm_event(PGFREE);
> @@ -1395,7 +1400,7 @@ void free_hot_cold_page(struct page *page, int cold)
>   	 */
>   	if (migratetype >= MIGRATE_PCPTYPES) {
>   		if (unlikely(is_migrate_isolate(migratetype))) {
> -			free_one_page(zone, page, 0, migratetype);
> +			free_one_page(zone, page, pfn, 0, migratetype);
>   			goto out;
>   		}
>   		migratetype = MIGRATE_MOVABLE;
> @@ -6032,17 +6037,16 @@ static inline int pfn_to_bitidx(struct zone *zone, unsigned long pfn)
>    * @end_bitidx: The last bit of interest
>    * returns pageblock_bits flags
>    */
> -unsigned long get_pageblock_flags_mask(struct page *page,
> +unsigned long get_pfnblock_flags_mask(struct page *page, unsigned long pfn,
>   					unsigned long end_bitidx,
>   					unsigned long mask)
>   {
>   	struct zone *zone;
>   	unsigned long *bitmap;
> -	unsigned long pfn, bitidx, word_bitidx;
> +	unsigned long bitidx, word_bitidx;
>   	unsigned long word;
>
>   	zone = page_zone(page);
> -	pfn = page_to_pfn(page);
>   	bitmap = get_pageblock_bitmap(zone, pfn);
>   	bitidx = pfn_to_bitidx(zone, pfn);
>   	word_bitidx = bitidx / BITS_PER_LONG;
> @@ -6054,25 +6058,25 @@ unsigned long get_pageblock_flags_mask(struct page *page,
>   }
>
>   /**
> - * set_pageblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages
> + * set_pfnblock_flags_mask - Set the requested group of flags for a pageblock_nr_pages block of pages
>    * @page: The page within the block of interest
>    * @start_bitidx: The first bit of interest
>    * @end_bitidx: The last bit of interest
>    * @flags: The flags to set
>    */
> -void set_pageblock_flags_mask(struct page *page, unsigned long flags,
> +void set_pfnblock_flags_mask(struct page *page, unsigned long flags,
> +					unsigned long pfn,
>   					unsigned long end_bitidx,
>   					unsigned long mask)
>   {
>   	struct zone *zone;
>   	unsigned long *bitmap;
> -	unsigned long pfn, bitidx, word_bitidx;
> +	unsigned long bitidx, word_bitidx;
>   	unsigned long old_word, word;
>
>   	BUILD_BUG_ON(NR_PAGEBLOCK_BITS != 4);
>
>   	zone = page_zone(page);
> -	pfn = page_to_pfn(page);
>   	bitmap = get_pageblock_bitmap(zone, pfn);
>   	bitidx = pfn_to_bitidx(zone, pfn);
>   	word_bitidx = bitidx / BITS_PER_LONG;
>


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 11/19] mm: page_alloc: Lookup pageblock migratetype with IRQs enabled during free
  2014-05-13  9:45 ` [PATCH 11/19] mm: page_alloc: Lookup pageblock migratetype with IRQs enabled during free Mel Gorman
@ 2014-05-13 13:36   ` Vlastimil Babka
  2014-05-13 14:23     ` Mel Gorman
  0 siblings, 1 reply; 103+ messages in thread
From: Vlastimil Babka @ 2014-05-13 13:36 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Johannes Weiner, Jan Kara, Michal Hocko, Hugh Dickins,
	Peter Zijlstra, Dave Hansen, Linux Kernel, Linux-MM,
	Linux-FSDevel

On 05/13/2014 11:45 AM, Mel Gorman wrote:
> get_pageblock_migratetype() is called during free with IRQs disabled. This
> is unnecessary and disables IRQs for longer than necessary.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>

With a comment below,

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>   mm/page_alloc.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3948f0a..fcbf637 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -773,9 +773,9 @@ static void __free_pages_ok(struct page *page, unsigned int order)
>   	if (!free_pages_prepare(page, order))
>   		return;
>
> +	migratetype = get_pfnblock_migratetype(page, pfn);
>   	local_irq_save(flags);
>   	__count_vm_events(PGFREE, 1 << order);
> -	migratetype = get_pfnblock_migratetype(page, pfn);
>   	set_freepage_migratetype(page, migratetype);

The line above could be also outside disabled IRQ, no?

>   	free_one_page(page_zone(page), page, pfn, order, migratetype);
>   	local_irq_restore(flags);
>


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 17/19] fs: buffer: Do not use unnecessary atomic operations when discarding buffers
  2014-05-13 12:50     ` Mel Gorman
@ 2014-05-13 13:49       ` Jan Kara
  2014-05-13 14:30         ` Mel Gorman
  2014-05-13 14:01       ` Peter Zijlstra
  1 sibling, 1 reply; 103+ messages in thread
From: Jan Kara @ 2014-05-13 13:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Andrew Morton, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel

On Tue 13-05-14 13:50:07, Mel Gorman wrote:
> On Tue, May 13, 2014 at 01:09:51PM +0200, Peter Zijlstra wrote:
> > On Tue, May 13, 2014 at 10:45:48AM +0100, Mel Gorman wrote:
> > > Discarding buffers uses a bunch of atomic operations when discarding buffers
> > > because ...... I can't think of a reason. Use a cmpxchg loop to clear all the
> > > necessary flags. In most (all?) cases this will be a single atomic operations.
> > > 
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > ---
> > >  fs/buffer.c                 | 14 +++++++++-----
> > >  include/linux/buffer_head.h |  5 +++++
> > >  2 files changed, 14 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/buffer.c b/fs/buffer.c
> > > index 9ddb9fc..e80012d 100644
> > > --- a/fs/buffer.c
> > > +++ b/fs/buffer.c
> > > @@ -1485,14 +1485,18 @@ EXPORT_SYMBOL(set_bh_page);
> > >   */
> > >  static void discard_buffer(struct buffer_head * bh)
> > >  {
> > > +	unsigned long b_state, b_state_old;
> > > +
> > >  	lock_buffer(bh);
> > >  	clear_buffer_dirty(bh);
> > >  	bh->b_bdev = NULL;
> > > -	clear_buffer_mapped(bh);
> > > -	clear_buffer_req(bh);
> > > -	clear_buffer_new(bh);
> > > -	clear_buffer_delay(bh);
> > > -	clear_buffer_unwritten(bh);
> > > +	b_state = bh->b_state;
> > > +	for (;;) {
> > > +		b_state_old = cmpxchg(&bh->b_state, b_state, (b_state & ~BUFFER_FLAGS_DISCARD));
> > > +		if (b_state_old == b_state)
> > > +			break;
> > > +		b_state = b_state_old;
> > > +	}
> > >  	unlock_buffer(bh);
> > >  }
> > 
> > So.. I'm soon going to introduce atomic_{or,and}() and
> > atomic64_{or,and}() across the board, but of course this isn't an
> > atomic_long_t but a regular unsigned long.
> > 
> > Its a bit unfortunate we have this discrepancy with types vs atomic ops,
> > there's:
> > 
> >   cmpxchg, xchg -- mostly available for all 1,2,3,4 (and 8 where
> >   appropriate) byte values.
> > 
> >   bitops -- operate on unsigned long *
> > 
> >   atomic* -- operate on atomic_*t
> 
> I hit the same problem when dealing with pageblock bitmap. I would have
> preferred it to do an atomic_read() but the actual conversion to use
> atomic_t for the map became a mess with little or no upside.
> 
> > 
> > operation which is available on a lot of architectures, we'll be stuck
> > with a cmpxchg loop instead :/
> > 
> > *sigh*
> > 
> > Anyway, nothing wrong with this patch, however, you could, if you really
> > wanted to push things, also include BH_Lock in that clear :-)
> 
> That's a bold strategy Cotton.
> 
> Untested patch on top
  Although this looks correct, I have to say I prefer the explicit
unlock_buffer() unless this has a measurable benefit.

								Honza
 
> ---8<---
> diff --git a/fs/buffer.c b/fs/buffer.c
> index e80012d..42fcb6d 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1490,6 +1490,8 @@ static void discard_buffer(struct buffer_head * bh)
>  	lock_buffer(bh);
>  	clear_buffer_dirty(bh);
>  	bh->b_bdev = NULL;
> +
> +	smp_mb__before_clear_bit();
>  	b_state = bh->b_state;
>  	for (;;) {
>  		b_state_old = cmpxchg(&bh->b_state, b_state, (b_state & ~BUFFER_FLAGS_DISCARD));
> @@ -1497,7 +1499,13 @@ static void discard_buffer(struct buffer_head * bh)
>  			break;
>  		b_state = b_state_old;
>  	}
> -	unlock_buffer(bh);
> +
> +	/*
> +	 * BUFFER_FLAGS_DISCARD include BH_lock so it has been cleared so the
> +	 * wake_up_bit is the last part of a unlock_buffer
> +	 */
> +	smp_mb__after_clear_bit();
> +	wake_up_bit(&bh->b_state, BH_Lock);
>  }
>  
>  /**
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 95f565a..523db58 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -80,7 +80,7 @@ struct buffer_head {
>  /* Bits that are cleared during an invalidate */
>  #define BUFFER_FLAGS_DISCARD \
>  	(1 << BH_Mapped | 1 << BH_New | 1 << BH_Req | \
> -	 1 << BH_Delay | 1 << BH_Unwritten)
> +	 1 << BH_Delay | 1 << BH_Unwritten | 1 << BH_Lock)
>  
>  /*
>   * macro tricks to expand the set_buffer_foo(), clear_buffer_foo()
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 17/19] fs: buffer: Do not use unnecessary atomic operations when discarding buffers
  2014-05-13  9:45 ` [PATCH 17/19] fs: buffer: Do not use unnecessary atomic operations when discarding buffers Mel Gorman
  2014-05-13 11:09   ` Peter Zijlstra
@ 2014-05-13 13:50   ` Jan Kara
  2014-05-13 22:29   ` Andrew Morton
  2 siblings, 0 replies; 103+ messages in thread
From: Jan Kara @ 2014-05-13 13:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Vlastimil Babka, Jan Kara,
	Michal Hocko, Hugh Dickins, Peter Zijlstra, Dave Hansen,
	Linux Kernel, Linux-MM, Linux-FSDevel

On Tue 13-05-14 10:45:48, Mel Gorman wrote:
> Discarding buffers uses a bunch of atomic operations when discarding buffers
> because ...... I can't think of a reason. Use a cmpxchg loop to clear all the
> necessary flags. In most (all?) cases this will be a single atomic operations.
  Looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  fs/buffer.c                 | 14 +++++++++-----
>  include/linux/buffer_head.h |  5 +++++
>  2 files changed, 14 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 9ddb9fc..e80012d 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1485,14 +1485,18 @@ EXPORT_SYMBOL(set_bh_page);
>   */
>  static void discard_buffer(struct buffer_head * bh)
>  {
> +	unsigned long b_state, b_state_old;
> +
>  	lock_buffer(bh);
>  	clear_buffer_dirty(bh);
>  	bh->b_bdev = NULL;
> -	clear_buffer_mapped(bh);
> -	clear_buffer_req(bh);
> -	clear_buffer_new(bh);
> -	clear_buffer_delay(bh);
> -	clear_buffer_unwritten(bh);
> +	b_state = bh->b_state;
> +	for (;;) {
> +		b_state_old = cmpxchg(&bh->b_state, b_state, (b_state & ~BUFFER_FLAGS_DISCARD));
> +		if (b_state_old == b_state)
> +			break;
> +		b_state = b_state_old;
> +	}
>  	unlock_buffer(bh);
>  }
>  
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index c40302f..95f565a 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -77,6 +77,11 @@ struct buffer_head {
>  	atomic_t b_count;		/* users using this buffer_head */
>  };
>  
> +/* Bits that are cleared during an invalidate */
> +#define BUFFER_FLAGS_DISCARD \
> +	(1 << BH_Mapped | 1 << BH_New | 1 << BH_Req | \
> +	 1 << BH_Delay | 1 << BH_Unwritten)
> +
>  /*
>   * macro tricks to expand the set_buffer_foo(), clear_buffer_foo()
>   * and buffer_foo() functions.
> -- 
> 1.8.4.5
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 17/19] fs: buffer: Do not use unnecessary atomic operations when discarding buffers
  2014-05-13 12:50     ` Mel Gorman
  2014-05-13 13:49       ` Jan Kara
@ 2014-05-13 14:01       ` Peter Zijlstra
  2014-05-13 14:46         ` Mel Gorman
  1 sibling, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2014-05-13 14:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Vlastimil Babka, Jan Kara,
	Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel, Linux-MM,
	Linux-FSDevel

On Tue, May 13, 2014 at 01:50:07PM +0100, Mel Gorman wrote:
> > Anyway, nothing wrong with this patch, however, you could, if you really
> > wanted to push things, also include BH_Lock in that clear :-)
> 
> That's a bold strategy Cotton.

:-)

> Untested patch on top
> 
> ---8<---
> diff --git a/fs/buffer.c b/fs/buffer.c
> index e80012d..42fcb6d 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1490,6 +1490,8 @@ static void discard_buffer(struct buffer_head * bh)
>  	lock_buffer(bh);
>  	clear_buffer_dirty(bh);
>  	bh->b_bdev = NULL;
> +
> +	smp_mb__before_clear_bit();

Not needed.

>  	b_state = bh->b_state;
>  	for (;;) {
>  		b_state_old = cmpxchg(&bh->b_state, b_state, (b_state & ~BUFFER_FLAGS_DISCARD));
> @@ -1497,7 +1499,13 @@ static void discard_buffer(struct buffer_head * bh)
>  			break;
>  		b_state = b_state_old;
>  	}
> -	unlock_buffer(bh);
> +
> +	/*
> +	 * BUFFER_FLAGS_DISCARD include BH_lock so it has been cleared so the
> +	 * wake_up_bit is the last part of a unlock_buffer
> +	 */
> +	smp_mb__after_clear_bit();

Similarly superfluous.

> +	wake_up_bit(&bh->b_state, BH_Lock);
>  }

The thing is that cmpxchg() guarantees full barrier semantics before and
after the op, and since the loop guarantees at least one cmpxchg() call
its all good.

Now just to confuse everyone, you could have written the loop like:

	b_state = bh->b_state;
	for (;;) {
		b_state_new = b_state & ~BUFFER_FLAGS_DISCARD;
		if (b_state == b_state_new)
			break;
		b_state = cmpxchg(&bh->b_state, b_state, b_state_new);
	}

Which is 'similar' but doesn't guarantee that cmpxchg() gets called.
If you expect the initial value to match the new state, the above form
is slightly faster, but the lack of barrier guarantees can still spoil
the fun.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 10/19] mm: page_alloc: Reduce number of times page_to_pfn is called
  2014-05-13 13:27   ` Vlastimil Babka
@ 2014-05-13 14:09     ` Mel Gorman
  0 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13 14:09 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel

On Tue, May 13, 2014 at 03:27:29PM +0200, Vlastimil Babka wrote:
> On 05/13/2014 11:45 AM, Mel Gorman wrote:
> >In the free path we calculate page_to_pfn multiple times. Reduce that.
> >
> >Signed-off-by: Mel Gorman <mgorman@suse.de>
> >Acked-by: Rik van Riel <riel@redhat.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Just two comments.
> I just don't like #define but I can live with that.

page_to_pfn is not available in that context due to header dependency
problems. It can be avoided by moving the two functions into mm/internal.h
so I'll do that. I cannot see why code outside of mm/ would be missing
with those bits anyway.

Thanks

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-13 12:53   ` Mel Gorman
@ 2014-05-13 14:17     ` Peter Zijlstra
  2014-05-13 15:27       ` Paul E. McKenney
  2014-05-14 16:11       ` Oleg Nesterov
  0 siblings, 2 replies; 103+ messages in thread
From: Peter Zijlstra @ 2014-05-13 14:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Vlastimil Babka, Jan Kara,
	Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel, Linux-MM,
	Linux-FSDevel, Oleg Nesterov, Paul McKenney, Linus Torvalds,
	David Howells

On Tue, May 13, 2014 at 01:53:13PM +0100, Mel Gorman wrote:
> On Tue, May 13, 2014 at 10:45:50AM +0100, Mel Gorman wrote:
> >  void unlock_page(struct page *page)
> >  {
> > +	wait_queue_head_t *wqh = clear_page_waiters(page);
> > +
> >  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> > +
> > +	/*
> > +	 * No additional barrier needed due to clear_bit_unlock barriering all updates
> > +	 * before waking waiters
> > +	 */
> >  	clear_bit_unlock(PG_locked, &page->flags);
> > -	smp_mb__after_clear_bit();
> > -	wake_up_page(page, PG_locked);
> 
> This is wrong. The smp_mb__after_clear_bit() is still required to ensure
> that the cleared bit is visible before the wakeup on all architectures.

wakeup implies a mb, and I just noticed that our Documentation is
'obsolete' and only mentions it implies a wmb.

Also, if you're going to use smp_mb__after_atomic() you can use
clear_bit() and not use clear_bit_unlock().



---
Subject: doc: Update wakeup barrier documentation

As per commit e0acd0a68ec7 ("sched: fix the theoretical signal_wake_up()
vs schedule() race") both wakeup and schedule now imply a full barrier.

Furthermore, the barrier is unconditional when calling try_to_wake_up()
and has been for a fair while.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 Documentation/memory-barriers.txt | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 46412bded104..dae5158c2382 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1881,9 +1881,9 @@ The whole sequence above is available in various canned forms, all of which
 	event_indicated = 1;
 	wake_up_process(event_daemon);
 
-A write memory barrier is implied by wake_up() and co. if and only if they wake
-something up.  The barrier occurs before the task state is cleared, and so sits
-between the STORE to indicate the event and the STORE to set TASK_RUNNING:
+A full memory barrier is implied by wake_up() and co. The barrier occurs
+before the task state is cleared, and so sits between the STORE to indicate
+the event and the STORE to set TASK_RUNNING:
 
 	CPU 1				CPU 2
 	===============================	===============================

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 11/19] mm: page_alloc: Lookup pageblock migratetype with IRQs enabled during free
  2014-05-13 13:36   ` Vlastimil Babka
@ 2014-05-13 14:23     ` Mel Gorman
  0 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13 14:23 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel

On Tue, May 13, 2014 at 03:36:25PM +0200, Vlastimil Babka wrote:
> On 05/13/2014 11:45 AM, Mel Gorman wrote:
> >get_pageblock_migratetype() is called during free with IRQs disabled. This
> >is unnecessary and disables IRQs for longer than necessary.
> >
> >Signed-off-by: Mel Gorman <mgorman@suse.de>
> >Acked-by: Rik van Riel <riel@redhat.com>
> 
> With a comment below,
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 

Thanks

> >---
> >  mm/page_alloc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index 3948f0a..fcbf637 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -773,9 +773,9 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> >  	if (!free_pages_prepare(page, order))
> >  		return;
> >
> >+	migratetype = get_pfnblock_migratetype(page, pfn);
> >  	local_irq_save(flags);
> >  	__count_vm_events(PGFREE, 1 << order);
> >-	migratetype = get_pfnblock_migratetype(page, pfn);
> >  	set_freepage_migratetype(page, migratetype);
> 
> The line above could be also outside disabled IRQ, no?
> 

I guess it could but the difference would be marginal at
best. get_pfnblock_migratetype is a lookup of the pageblock bitfield and is
an expensive operation. set_freepage_migratetype() on the other hand is just

static inline void set_freepage_migratetype(struct page *page, int migratetype)
{
        page->index = migratetype;
}

If anything the line could be just removed as right now nothing below
that level is actually using the information (it's primarily of interest
in the per-cpu allocator) but that would be outside the scope of this
patch as move_freepages would also need addressing. I feel the gain is
too marginal to justify the churn.

> >  	free_one_page(page_zone(page), page, pfn, order, migratetype);
> >  	local_irq_restore(flags);
> >
> 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 18/19] mm: Non-atomically mark page accessed during page cache allocation where possible
  2014-05-13  9:45 ` [PATCH 18/19] mm: Non-atomically mark page accessed during page cache allocation where possible Mel Gorman
@ 2014-05-13 14:29   ` Theodore Ts'o
  2014-05-20 15:49   ` [PATCH] mm: non-atomically mark page accessed during page cache allocation where possible -fix Mel Gorman
  1 sibling, 0 replies; 103+ messages in thread
From: Theodore Ts'o @ 2014-05-13 14:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Vlastimil Babka, Jan Kara,
	Michal Hocko, Hugh Dickins, Peter Zijlstra, Dave Hansen,
	Linux Kernel, Linux-MM, Linux-FSDevel

Acked-by: "Theodore Ts'o" <tytso@mit.edu>

Thanks!!

				- Ted

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 17/19] fs: buffer: Do not use unnecessary atomic operations when discarding buffers
  2014-05-13 13:49       ` Jan Kara
@ 2014-05-13 14:30         ` Mel Gorman
  0 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13 14:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, Andrew Morton, Johannes Weiner, Vlastimil Babka,
	Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel, Linux-MM,
	Linux-FSDevel

On Tue, May 13, 2014 at 03:49:43PM +0200, Jan Kara wrote:
> > > operation which is available on a lot of architectures, we'll be stuck
> > > with a cmpxchg loop instead :/
> > > 
> > > *sigh*
> > > 
> > > Anyway, nothing wrong with this patch, however, you could, if you really
> > > wanted to push things, also include BH_Lock in that clear :-)
> > 
> > That's a bold strategy Cotton.
> > 
> > Untested patch on top
>   Although this looks correct, I have to say I prefer the explicit
> unlock_buffer() unless this has a measurable benefit.
> 

I will keep this as a separate patch, move it to the end of the series
and check what the profiles look like. Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 17/19] fs: buffer: Do not use unnecessary atomic operations when discarding buffers
  2014-05-13 14:01       ` Peter Zijlstra
@ 2014-05-13 14:46         ` Mel Gorman
  0 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-13 14:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Johannes Weiner, Vlastimil Babka, Jan Kara,
	Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel, Linux-MM,
	Linux-FSDevel

On Tue, May 13, 2014 at 04:01:27PM +0200, Peter Zijlstra wrote:
> On Tue, May 13, 2014 at 01:50:07PM +0100, Mel Gorman wrote:
> > > Anyway, nothing wrong with this patch, however, you could, if you really
> > > wanted to push things, also include BH_Lock in that clear :-)
> > 
> > That's a bold strategy Cotton.
> 
> :-)
> 
> > Untested patch on top
> > 
> > ---8<---
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index e80012d..42fcb6d 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -1490,6 +1490,8 @@ static void discard_buffer(struct buffer_head * bh)
> >  	lock_buffer(bh);
> >  	clear_buffer_dirty(bh);
> >  	bh->b_bdev = NULL;
> > +
> > +	smp_mb__before_clear_bit();
> 
> Not needed.
> 
> >  	b_state = bh->b_state;
> >  	for (;;) {
> >  		b_state_old = cmpxchg(&bh->b_state, b_state, (b_state & ~BUFFER_FLAGS_DISCARD));
> > @@ -1497,7 +1499,13 @@ static void discard_buffer(struct buffer_head * bh)
> >  			break;
> >  		b_state = b_state_old;
> >  	}
> > -	unlock_buffer(bh);
> > +
> > +	/*
> > +	 * BUFFER_FLAGS_DISCARD include BH_lock so it has been cleared so the
> > +	 * wake_up_bit is the last part of a unlock_buffer
> > +	 */
> > +	smp_mb__after_clear_bit();
> 
> Similarly superfluous.
> 
> > +	wake_up_bit(&bh->b_state, BH_Lock);
> >  }
> 
> The thing is that cmpxchg() guarantees full barrier semantics before and
> after the op, and since the loop guarantees at least one cmpxchg() call
> its all good.
> 

Of course, thanks for pointing that out. I was only thinking of it in
terms of it being a clear_bit operation which was dumb.

> Now just to confuse everyone, you could have written the loop like:
> 
> 	b_state = bh->b_state;
> 	for (;;) {
> 		b_state_new = b_state & ~BUFFER_FLAGS_DISCARD;
> 		if (b_state == b_state_new)
> 			break;
> 		b_state = cmpxchg(&bh->b_state, b_state, b_state_new);
> 	}
> 
> Which is 'similar' but doesn't guarantee that cmpxchg() gets called.
> If you expect the initial value to match the new state, the above form
> is slightly faster, but the lack of barrier guarantees can still spoil
> the fun.

I do not really expect the initial value to match the new state. At the
very least I would expect BH_mapped to be routinely cleared during this
operation so I doubt it's worth the effort trying to deal with
conditional buffers.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-13 14:17     ` Peter Zijlstra
@ 2014-05-13 15:27       ` Paul E. McKenney
  2014-05-13 15:44         ` Peter Zijlstra
  2014-05-13 18:18         ` Oleg Nesterov
  2014-05-14 16:11       ` Oleg Nesterov
  1 sibling, 2 replies; 103+ messages in thread
From: Paul E. McKenney @ 2014-05-13 15:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Andrew Morton, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Oleg Nesterov, Linus Torvalds,
	David Howells

On Tue, May 13, 2014 at 04:17:48PM +0200, Peter Zijlstra wrote:
> On Tue, May 13, 2014 at 01:53:13PM +0100, Mel Gorman wrote:
> > On Tue, May 13, 2014 at 10:45:50AM +0100, Mel Gorman wrote:
> > >  void unlock_page(struct page *page)
> > >  {
> > > +	wait_queue_head_t *wqh = clear_page_waiters(page);
> > > +
> > >  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > +
> > > +	/*
> > > +	 * No additional barrier needed due to clear_bit_unlock barriering all updates
> > > +	 * before waking waiters
> > > +	 */
> > >  	clear_bit_unlock(PG_locked, &page->flags);
> > > -	smp_mb__after_clear_bit();
> > > -	wake_up_page(page, PG_locked);
> > 
> > This is wrong. The smp_mb__after_clear_bit() is still required to ensure
> > that the cleared bit is visible before the wakeup on all architectures.
> 
> wakeup implies a mb, and I just noticed that our Documentation is
> 'obsolete' and only mentions it implies a wmb.
> 
> Also, if you're going to use smp_mb__after_atomic() you can use
> clear_bit() and not use clear_bit_unlock().
> 
> 
> 
> ---
> Subject: doc: Update wakeup barrier documentation
> 
> As per commit e0acd0a68ec7 ("sched: fix the theoretical signal_wake_up()
> vs schedule() race") both wakeup and schedule now imply a full barrier.
> 
> Furthermore, the barrier is unconditional when calling try_to_wake_up()
> and has been for a fair while.
> 
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>

Some questions below.

							Thanx, Paul

> ---
>  Documentation/memory-barriers.txt | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 46412bded104..dae5158c2382 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1881,9 +1881,9 @@ The whole sequence above is available in various canned forms, all of which
>  	event_indicated = 1;
>  	wake_up_process(event_daemon);
> 
> -A write memory barrier is implied by wake_up() and co. if and only if they wake
> -something up.  The barrier occurs before the task state is cleared, and so sits
> -between the STORE to indicate the event and the STORE to set TASK_RUNNING:
> +A full memory barrier is implied by wake_up() and co. The barrier occurs

Last I checked, the memory barrier was guaranteed only if a wakeup
actually occurred.  If there is a sleep-wakeup race, for example,
between wait_event_interruptible() and wake_up(), then it looks to me
that the following can happen:

o	Task A invokes wait_event_interruptible(), waiting for
	X==1.

o	Before Task A gets anywhere, Task B sets Y=1, does
	smp_mb(), then sets X=1.

o	Task B invokes wake_up(), which invokes __wake_up(), which
	acquires the wait_queue_head_t's lock and invokes
	__wake_up_common(), which sees nothing to wake up.

o	Task A tests the condition, finds X==1, and returns without
	locks, memory barriers, atomic instructions, or anything else
	that would guarantee ordering.

o	Task A then loads from Y.  Because there have been no memory
	barriers, it might well see Y==0.

So what am I missing here?

On the other hand, if a wake_up() really does happen, then
the fast-path out of wait_event_interruptible() is not taken,
and __wait_event_interruptible() is called instead.  This calls
___wait_event(), which eventually calls prepare_to_wait_event(), which
in turn calls set_current_state(), which calls set_mb(), which does a
full memory barrier.  And if that isn't good enough, there is the
call to schedule() itself.  ;-)

So if a wait actually sleeps, it does imply a full memory barrier
several times over.

On the wake_up() side, wake_up() calls __wake_up(), which as mentioned
earlier calls __wake_up_common() under a lock.  This invokes the
wake-up function stored by the sleeping task, for example,
autoremove_wake_function(), which calls default_wake_function(),
which invokes try_to_wake_up(), which does smp_mb__before_spinlock()
before acquiring the to-be-waked task's PI lock.

The definition of smp_mb__before_spinlock() is smp_wmb().  There is
also an smp_rmb() in try_to_wake_up(), which still does not get us
to a full memory barrier.  It also calls select_task_rq(), which
does not seem to guarantee any particular memory ordering (but
I could easily have missed something).  It also calls ttwu_queue(),
which invokes ttwu_do_activate() under the RQ lock.  I don't see a
full memory barrier in ttwu_do_activate(), but again could easily
have missed one.  Ditto for ttwu_stat().

All the locks nest, so other than the smp_wmb() and smp_rmb(), things
could bleed in.

> +before the task state is cleared, and so sits between the STORE to indicate
> +the event and the STORE to set TASK_RUNNING:

If I am in fact correct, and if we really want to advertise the read
memory barrier, I suggest the following replacement text:

	A read and a write memory barrier (-not- a full memory barrier)
	are implied by wake_up() and co. if and only if they wake
	something up.  The write barrier occurs before the task state is
	cleared, and so sits between the STORE to indicate the event and
	the STORE to set TASK_RUNNING, and the read barrier after that:

	CPU 1				CPU 2
	===============================	===============================
	set_current_state();		STORE event_indicated
	  set_mb();			wake_up();
	    STORE current->state	  <write barrier>
	    <general barrier>		  STORE current->state
	LOAD event_indicated		  <read barrier>


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-13 15:27       ` Paul E. McKenney
@ 2014-05-13 15:44         ` Peter Zijlstra
  2014-05-13 16:14           ` Paul E. McKenney
  2014-05-13 18:22           ` Oleg Nesterov
  2014-05-13 18:18         ` Oleg Nesterov
  1 sibling, 2 replies; 103+ messages in thread
From: Peter Zijlstra @ 2014-05-13 15:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mel Gorman, Andrew Morton, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Oleg Nesterov, Linus Torvalds,
	David Howells

On Tue, May 13, 2014 at 08:27:19AM -0700, Paul E. McKenney wrote:
> > Subject: doc: Update wakeup barrier documentation
> > 
> > As per commit e0acd0a68ec7 ("sched: fix the theoretical signal_wake_up()
> > vs schedule() race") both wakeup and schedule now imply a full barrier.
> > 
> > Furthermore, the barrier is unconditional when calling try_to_wake_up()
> > and has been for a fair while.
> > 
> > Cc: Oleg Nesterov <oleg@redhat.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: David Howells <dhowells@redhat.com>
> > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> 
> Some questions below.
> 
> 							Thanx, Paul
> 
> > ---
> >  Documentation/memory-barriers.txt | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > index 46412bded104..dae5158c2382 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1881,9 +1881,9 @@ The whole sequence above is available in various canned forms, all of which
> >  	event_indicated = 1;
> >  	wake_up_process(event_daemon);
> > 
> > -A write memory barrier is implied by wake_up() and co. if and only if they wake
> > -something up.  The barrier occurs before the task state is cleared, and so sits
> > -between the STORE to indicate the event and the STORE to set TASK_RUNNING:
> > +A full memory barrier is implied by wake_up() and co. The barrier occurs
> 
> Last I checked, the memory barrier was guaranteed only if a wakeup
> actually occurred.  If there is a sleep-wakeup race, for example,
> between wait_event_interruptible() and wake_up(), then it looks to me
> that the following can happen:
> 
> o	Task A invokes wait_event_interruptible(), waiting for
> 	X==1.
> 
> o	Before Task A gets anywhere, Task B sets Y=1, does
> 	smp_mb(), then sets X=1.
> 
> o	Task B invokes wake_up(), which invokes __wake_up(), which
> 	acquires the wait_queue_head_t's lock and invokes
> 	__wake_up_common(), which sees nothing to wake up.
> 
> o	Task A tests the condition, finds X==1, and returns without
> 	locks, memory barriers, atomic instructions, or anything else
> 	that would guarantee ordering.
> 
> o	Task A then loads from Y.  Because there have been no memory
> 	barriers, it might well see Y==0.
> 
> So what am I missing here?

Ah, that's what was meant :-) The way I read it was that
wake_up_process() would only imply the barrier if the task actually got
a wakeup (ie. the return value is 1).

But yes, this makes a lot more sense. Sorry for the confusion.

> On the wake_up() side, wake_up() calls __wake_up(), which as mentioned
> earlier calls __wake_up_common() under a lock.  This invokes the
> wake-up function stored by the sleeping task, for example,
> autoremove_wake_function(), which calls default_wake_function(),
> which invokes try_to_wake_up(), which does smp_mb__before_spinlock()
> before acquiring the to-be-waked task's PI lock.
> 
> The definition of smp_mb__before_spinlock() is smp_wmb().  There is
> also an smp_rmb() in try_to_wake_up(), which still does not get us
> to a full memory barrier.  It also calls select_task_rq(), which
> does not seem to guarantee any particular memory ordering (but
> I could easily have missed something).  It also calls ttwu_queue(),
> which invokes ttwu_do_activate() under the RQ lock.  I don't see a
> full memory barrier in ttwu_do_activate(), but again could easily
> have missed one.  Ditto for ttwu_stat().

Ah, yes, so I'll defer to Oleg and Linus to explain that one. As per the
name: smp_mb__before_spinlock() should of course imply a full barrier.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-13 15:44         ` Peter Zijlstra
@ 2014-05-13 16:14           ` Paul E. McKenney
  2014-05-13 18:57             ` Oleg Nesterov
  2014-05-13 18:22           ` Oleg Nesterov
  1 sibling, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2014-05-13 16:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Andrew Morton, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Oleg Nesterov, Linus Torvalds,
	David Howells

On Tue, May 13, 2014 at 05:44:35PM +0200, Peter Zijlstra wrote:
> On Tue, May 13, 2014 at 08:27:19AM -0700, Paul E. McKenney wrote:
> > > Subject: doc: Update wakeup barrier documentation
> > > 
> > > As per commit e0acd0a68ec7 ("sched: fix the theoretical signal_wake_up()
> > > vs schedule() race") both wakeup and schedule now imply a full barrier.
> > > 
> > > Furthermore, the barrier is unconditional when calling try_to_wake_up()
> > > and has been for a fair while.
> > > 
> > > Cc: Oleg Nesterov <oleg@redhat.com>
> > > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > > Cc: David Howells <dhowells@redhat.com>
> > > Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> > 
> > Some questions below.
> > 
> > 							Thanx, Paul
> > 
> > > ---
> > >  Documentation/memory-barriers.txt | 6 +++---
> > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > > index 46412bded104..dae5158c2382 100644
> > > --- a/Documentation/memory-barriers.txt
> > > +++ b/Documentation/memory-barriers.txt
> > > @@ -1881,9 +1881,9 @@ The whole sequence above is available in various canned forms, all of which
> > >  	event_indicated = 1;
> > >  	wake_up_process(event_daemon);
> > > 
> > > -A write memory barrier is implied by wake_up() and co. if and only if they wake
> > > -something up.  The barrier occurs before the task state is cleared, and so sits
> > > -between the STORE to indicate the event and the STORE to set TASK_RUNNING:
> > > +A full memory barrier is implied by wake_up() and co. The barrier occurs
> > 
> > Last I checked, the memory barrier was guaranteed only if a wakeup
> > actually occurred.  If there is a sleep-wakeup race, for example,
> > between wait_event_interruptible() and wake_up(), then it looks to me
> > that the following can happen:
> > 
> > o	Task A invokes wait_event_interruptible(), waiting for
> > 	X==1.
> > 
> > o	Before Task A gets anywhere, Task B sets Y=1, does
> > 	smp_mb(), then sets X=1.
> > 
> > o	Task B invokes wake_up(), which invokes __wake_up(), which
> > 	acquires the wait_queue_head_t's lock and invokes
> > 	__wake_up_common(), which sees nothing to wake up.
> > 
> > o	Task A tests the condition, finds X==1, and returns without
> > 	locks, memory barriers, atomic instructions, or anything else
> > 	that would guarantee ordering.
> > 
> > o	Task A then loads from Y.  Because there have been no memory
> > 	barriers, it might well see Y==0.
> > 
> > So what am I missing here?
> 
> Ah, that's what was meant :-) The way I read it was that
> wake_up_process() would only imply the barrier if the task actually got
> a wakeup (ie. the return value is 1).
> 
> But yes, this makes a lot more sense. Sorry for the confusion.

I will work out a better wording and queue a patch.  I bet that you
are not the only one who got confused.

> > On the wake_up() side, wake_up() calls __wake_up(), which as mentioned
> > earlier calls __wake_up_common() under a lock.  This invokes the
> > wake-up function stored by the sleeping task, for example,
> > autoremove_wake_function(), which calls default_wake_function(),
> > which invokes try_to_wake_up(), which does smp_mb__before_spinlock()
> > before acquiring the to-be-waked task's PI lock.
> > 
> > The definition of smp_mb__before_spinlock() is smp_wmb().  There is
> > also an smp_rmb() in try_to_wake_up(), which still does not get us
> > to a full memory barrier.  It also calls select_task_rq(), which
> > does not seem to guarantee any particular memory ordering (but
> > I could easily have missed something).  It also calls ttwu_queue(),
> > which invokes ttwu_do_activate() under the RQ lock.  I don't see a
> > full memory barrier in ttwu_do_activate(), but again could easily
> > have missed one.  Ditto for ttwu_stat().
> 
> Ah, yes, so I'll defer to Oleg and Linus to explain that one. As per the
> name: smp_mb__before_spinlock() should of course imply a full barrier.

How about if I queue a name change to smp_wmb__before_spinlock()?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-13  9:45 ` [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath Mel Gorman
  2014-05-13 12:53   ` Mel Gorman
@ 2014-05-13 16:52   ` Peter Zijlstra
  2014-05-14  7:31     ` Mel Gorman
  1 sibling, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2014-05-13 16:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Vlastimil Babka, Jan Kara,
	Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel, Linux-MM,
	Linux-FSDevel

On Tue, May 13, 2014 at 10:45:50AM +0100, Mel Gorman wrote:
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c60ed0f..d81ed7d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -241,15 +241,15 @@ void delete_from_page_cache(struct page *page)
>  }
>  EXPORT_SYMBOL(delete_from_page_cache);
>  
> -static int sleep_on_page(void *word)
> +static int sleep_on_page(void)
>  {
> -	io_schedule();
> +	io_schedule_timeout(HZ);
>  	return 0;
>  }
>  
> -static int sleep_on_page_killable(void *word)
> +static int sleep_on_page_killable(void)
>  {
> -	sleep_on_page(word);
> +	sleep_on_page();
>  	return fatal_signal_pending(current) ? -EINTR : 0;
>  }
>  

I've got a patch from NeilBrown that conflicts with this, shouldn't be
hard to resolve though.

> @@ -680,30 +680,105 @@ static wait_queue_head_t *page_waitqueue(struct page *page)
>  	return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
>  }
>  
> -static inline void wake_up_page(struct page *page, int bit)
> +static inline wait_queue_head_t *clear_page_waiters(struct page *page)
>  {
> -	__wake_up_bit(page_waitqueue(page), &page->flags, bit);
> +	wait_queue_head_t *wqh = NULL;
> +
> +	if (!PageWaiters(page))
> +		return NULL;
> +
> +	/*
> +	 * Prepare to clear PG_waiters if the waitqueue is no longer
> +	 * active. Note that there is no guarantee that a page with no
> +	 * waiters will get cleared as there may be unrelated pages
> +	 * sleeping on the same page wait queue. Accurate detection
> +	 * would require a counter. In the event of a collision, the
> +	 * waiter bit will dangle and lookups will be required until
> +	 * the page is unlocked without collisions. The bit will need to
> +	 * be cleared before freeing to avoid triggering debug checks.
> +	 *
> +	 * Furthermore, this can race with processes about to sleep on
> +	 * the same page if it adds itself to the waitqueue just after
> +	 * this check. The timeout in sleep_on_page prevents the race
> +	 * being a terminal one. In effect, the uncontended and non-race
> +	 * cases are faster in exchange for occasional worst case of the
> +	 * timeout saving us.
> +	 */
> +	wqh = page_waitqueue(page);
> +	if (!waitqueue_active(wqh))
> +		ClearPageWaiters(page);
> +
> +	return wqh;
> +}

This of course is properly disgusting, but my brain isn't working right
on 4 hours of sleep, so I'm able to suggest anything else.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-13 15:27       ` Paul E. McKenney
  2014-05-13 15:44         ` Peter Zijlstra
@ 2014-05-13 18:18         ` Oleg Nesterov
  2014-05-13 18:24           ` Peter Zijlstra
  2014-05-13 18:52           ` Paul E. McKenney
  1 sibling, 2 replies; 103+ messages in thread
From: Oleg Nesterov @ 2014-05-13 18:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Mel Gorman, Andrew Morton, Johannes Weiner,
	Vlastimil Babka, Jan Kara, Michal Hocko, Hugh Dickins,
	Dave Hansen, Linux Kernel, Linux-MM, Linux-FSDevel,
	Linus Torvalds, David Howells

On 05/13, Paul E. McKenney wrote:
>
> On Tue, May 13, 2014 at 04:17:48PM +0200, Peter Zijlstra wrote:
> >
> > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > index 46412bded104..dae5158c2382 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1881,9 +1881,9 @@ The whole sequence above is available in various canned forms, all of which
> >  	event_indicated = 1;
> >  	wake_up_process(event_daemon);
> >
> > -A write memory barrier is implied by wake_up() and co. if and only if they wake
> > -something up.  The barrier occurs before the task state is cleared, and so sits
> > -between the STORE to indicate the event and the STORE to set TASK_RUNNING:
> > +A full memory barrier is implied by wake_up() and co. The barrier occurs
>
> Last I checked, the memory barrier was guaranteed

I have to admit, I am confused. I simply do not understand what "memory
barrier" actually means in this discussion.

To me, wake_up/ttwu should only guarantee one thing: all the preceding
STORE's should be serialized with all the subsequent manipulations with
task->state (even with LOAD(task->state)).

> If there is a sleep-wakeup race, for example,
> between wait_event_interruptible() and wake_up(), then it looks to me
> that the following can happen:
>
> o	Task A invokes wait_event_interruptible(), waiting for
> 	X==1.
>
> o	Before Task A gets anywhere, Task B sets Y=1, does
> 	smp_mb(), then sets X=1.
>
> o	Task B invokes wake_up(), which invokes __wake_up(), which
> 	acquires the wait_queue_head_t's lock and invokes
> 	__wake_up_common(), which sees nothing to wake up.
>
> o	Task A tests the condition, finds X==1, and returns without
> 	locks, memory barriers, atomic instructions, or anything else
> 	that would guarantee ordering.
>
> o	Task A then loads from Y.  Because there have been no memory
> 	barriers, it might well see Y==0.

Sure, but I can't understand "Because there have been no memory barriers".

IOW. Suppose we add mb() into wake_up(). The same can happen anyway?

And "if a wakeup actually occurred" is not clear to me too in this context.
For example, suppose that ttwu() clears task->state but that task was not
deactivated and it is going to check the condition, do we count this as
"wakeup actually occurred" ? In this case that task still can see Y==0.


> On the other hand, if a wake_up() really does happen, then
> the fast-path out of wait_event_interruptible() is not taken,
> and __wait_event_interruptible() is called instead.  This calls
> ___wait_event(), which eventually calls prepare_to_wait_event(), which
> in turn calls set_current_state(), which calls set_mb(), which does a
> full memory barrier.

Can't understand this part too... OK, and suppose that right after that
the task B from the scenario above does

	Y = 1;
	mb();
	X = 1;
	wake_up();

After that task A checks the condition, sees X==1, and returns from
wait_event() without spin_lock(wait_queue_head_t->lock) (if it also
sees list_empty_careful() == T). Then it can see Y==0 again?

> 	A read and a write memory barrier (-not- a full memory barrier)
> 	are implied by wake_up() and co. if and only if they wake
> 	something up.

Now this looks as if you document that, say,

	X = 1;
	wake_up();
	Y = 1;

doesn't need wmb() before "Y = 1" if wake_up() wakes something up. Do we
really want to document this? Is it fine to rely on this guarantee?

> The write barrier occurs before the task state is
> 	cleared, and so sits between the STORE to indicate the event and
> 	the STORE to set TASK_RUNNING, and the read barrier after that:

Plus: between the STORE to indicate the event and the LOAD which checks
task->state, otherwise:

> 	CPU 1				CPU 2
> 	===============================	===============================
> 	set_current_state();		STORE event_indicated
> 	  set_mb();			wake_up();
> 	    STORE current->state	  <write barrier>
> 	    <general barrier>		  STORE current->state
> 	LOAD event_indicated		  <read barrier>

this code is still racy.

In short: I am totally confused and most probably misunderstood you ;)

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-13 15:44         ` Peter Zijlstra
  2014-05-13 16:14           ` Paul E. McKenney
@ 2014-05-13 18:22           ` Oleg Nesterov
  1 sibling, 0 replies; 103+ messages in thread
From: Oleg Nesterov @ 2014-05-13 18:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Mel Gorman, Andrew Morton, Johannes Weiner,
	Vlastimil Babka, Jan Kara, Michal Hocko, Hugh Dickins,
	Dave Hansen, Linux Kernel, Linux-MM, Linux-FSDevel,
	Linus Torvalds, David Howells

On 05/13, Peter Zijlstra wrote:
>
> Ah, yes, so I'll defer to Oleg and Linus to explain that one. As per the
> name: smp_mb__before_spinlock() should of course imply a full barrier.

Oh yes, I agree, the name is confusing. At least the comment tries to
explain what it does.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-13 18:18         ` Oleg Nesterov
@ 2014-05-13 18:24           ` Peter Zijlstra
  2014-05-13 18:52           ` Paul E. McKenney
  1 sibling, 0 replies; 103+ messages in thread
From: Peter Zijlstra @ 2014-05-13 18:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Paul E. McKenney, Mel Gorman, Andrew Morton, Johannes Weiner,
	Vlastimil Babka, Jan Kara, Michal Hocko, Hugh Dickins,
	Dave Hansen, Linux Kernel, Linux-MM, Linux-FSDevel,
	Linus Torvalds, David Howells

On Tue, May 13, 2014 at 08:18:52PM +0200, Oleg Nesterov wrote:
> 
> In short: I am totally confused and most probably misunderstood you ;)

Yeah, my bad, I got myself totally confused and it seems to spread fast.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-13 18:18         ` Oleg Nesterov
  2014-05-13 18:24           ` Peter Zijlstra
@ 2014-05-13 18:52           ` Paul E. McKenney
  2014-05-13 19:31             ` Oleg Nesterov
  1 sibling, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2014-05-13 18:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mel Gorman, Andrew Morton, Johannes Weiner,
	Vlastimil Babka, Jan Kara, Michal Hocko, Hugh Dickins,
	Dave Hansen, Linux Kernel, Linux-MM, Linux-FSDevel,
	Linus Torvalds, David Howells

On Tue, May 13, 2014 at 08:18:52PM +0200, Oleg Nesterov wrote:
> On 05/13, Paul E. McKenney wrote:
> >
> > On Tue, May 13, 2014 at 04:17:48PM +0200, Peter Zijlstra wrote:
> > >
> > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > > index 46412bded104..dae5158c2382 100644
> > > --- a/Documentation/memory-barriers.txt
> > > +++ b/Documentation/memory-barriers.txt
> > > @@ -1881,9 +1881,9 @@ The whole sequence above is available in various canned forms, all of which
> > >  	event_indicated = 1;
> > >  	wake_up_process(event_daemon);
> > >
> > > -A write memory barrier is implied by wake_up() and co. if and only if they wake
> > > -something up.  The barrier occurs before the task state is cleared, and so sits
> > > -between the STORE to indicate the event and the STORE to set TASK_RUNNING:
> > > +A full memory barrier is implied by wake_up() and co. The barrier occurs
> >
> > Last I checked, the memory barrier was guaranteed
> 
> I have to admit, I am confused. I simply do not understand what "memory
> barrier" actually means in this discussion.
> 
> To me, wake_up/ttwu should only guarantee one thing: all the preceding
> STORE's should be serialized with all the subsequent manipulations with
> task->state (even with LOAD(task->state)).

I was thinking in terms of "everything done before the wake_up() is
visible after the wait_event*() returns" -- but only if the task doing
the wait_event*() actually sleeps and is awakened by that particular
wake_up().

Admittedly a bit of a weak guarantee!

> > If there is a sleep-wakeup race, for example,
> > between wait_event_interruptible() and wake_up(), then it looks to me
> > that the following can happen:
> >
> > o	Task A invokes wait_event_interruptible(), waiting for
> > 	X==1.
> >
> > o	Before Task A gets anywhere, Task B sets Y=1, does
> > 	smp_mb(), then sets X=1.
> >
> > o	Task B invokes wake_up(), which invokes __wake_up(), which
> > 	acquires the wait_queue_head_t's lock and invokes
> > 	__wake_up_common(), which sees nothing to wake up.
> >
> > o	Task A tests the condition, finds X==1, and returns without
> > 	locks, memory barriers, atomic instructions, or anything else
> > 	that would guarantee ordering.
> >
> > o	Task A then loads from Y.  Because there have been no memory
> > 	barriers, it might well see Y==0.
> 
> Sure, but I can't understand "Because there have been no memory barriers".
> 
> IOW. Suppose we add mb() into wake_up(). The same can happen anyway?

If the mb() is placed just after the fastpath condition check, then the
awakened task will be guaranteed to see Y=1.  Either that memory barrier
or the wait_queue_head_t's lock will guarantee the serialization, I think,
anyway.

> And "if a wakeup actually occurred" is not clear to me too in this context.
> For example, suppose that ttwu() clears task->state but that task was not
> deactivated and it is going to check the condition, do we count this as
> "wakeup actually occurred" ? In this case that task still can see Y==0.

I was thinking in terms of the task doing the wait_event*() actually
entering the scheduler.

> > On the other hand, if a wake_up() really does happen, then
> > the fast-path out of wait_event_interruptible() is not taken,
> > and __wait_event_interruptible() is called instead.  This calls
> > ___wait_event(), which eventually calls prepare_to_wait_event(), which
> > in turn calls set_current_state(), which calls set_mb(), which does a
> > full memory barrier.
> 
> Can't understand this part too... OK, and suppose that right after that
> the task B from the scenario above does
> 
> 	Y = 1;
> 	mb();
> 	X = 1;
> 	wake_up();
> 
> After that task A checks the condition, sees X==1, and returns from
> wait_event() without spin_lock(wait_queue_head_t->lock) (if it also
> sees list_empty_careful() == T). Then it can see Y==0 again?

Yes.  You need the barriers to be paired, and in this case, Task A isn't
executing a memory barrier.  Yes, the mb() has forced Task B's CPU to
commit the writes in order (or at least pretend to), but Task A might
have speculated the read to Y.

Or am I missing your point?

> > 	A read and a write memory barrier (-not- a full memory barrier)
> > 	are implied by wake_up() and co. if and only if they wake
> > 	something up.
> 
> Now this looks as if you document that, say,
> 
> 	X = 1;
> 	wake_up();
> 	Y = 1;
> 
> doesn't need wmb() before "Y = 1" if wake_up() wakes something up. Do we
> really want to document this? Is it fine to rely on this guarantee?

That is an excellent question.  It would not be hard to argue that we
should either make the guarantee unconditional by adding smp_mb() to
the wait_event*() paths or alternatively just saying that there isn't
a guarantee to begin with.

Thoughts?

> > The write barrier occurs before the task state is
> > 	cleared, and so sits between the STORE to indicate the event and
> > 	the STORE to set TASK_RUNNING, and the read barrier after that:
> 
> Plus: between the STORE to indicate the event and the LOAD which checks
> task->state, otherwise:
> 
> > 	CPU 1				CPU 2
> > 	===============================	===============================
> > 	set_current_state();		STORE event_indicated
> > 	  set_mb();			wake_up();
> > 	    STORE current->state	  <write barrier>
> > 	    <general barrier>		  STORE current->state
> > 	LOAD event_indicated		  <read barrier>
> 
> this code is still racy.

Yeah, it is missing some key components.  That said, we should figure
out exactly what we want to guarantee before I try to fix it.  ;-)

> In short: I am totally confused and most probably misunderstood you ;)

Oleg, if it confuses you, it is in desperate need of help!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-13 16:14           ` Paul E. McKenney
@ 2014-05-13 18:57             ` Oleg Nesterov
  2014-05-13 20:24               ` Paul E. McKenney
  0 siblings, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2014-05-13 18:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Mel Gorman, Andrew Morton, Johannes Weiner,
	Vlastimil Babka, Jan Kara, Michal Hocko, Hugh Dickins,
	Dave Hansen, Linux Kernel, Linux-MM, Linux-FSDevel,
	Linus Torvalds, David Howells

On 05/13, Paul E. McKenney wrote:
>
> On Tue, May 13, 2014 at 05:44:35PM +0200, Peter Zijlstra wrote:
> >
> > Ah, yes, so I'll defer to Oleg and Linus to explain that one. As per the
> > name: smp_mb__before_spinlock() should of course imply a full barrier.
>
> How about if I queue a name change to smp_wmb__before_spinlock()?

I agree, this is more accurate, simply because it describes what it
actually does.

But just in case, as for try_to_wake_up() it does not actually need
wmb() between "CONDITION = T" and "task->state = RUNNING". It would
be fine if these 2 STORE's are re-ordered, we can rely on rq->lock.

What it actually needs is a barrier between "CONDITION = T" and
"task->state & state" check. But since we do not have a store-load
barrier, wmb() was added to ensure that "CONDITION = T" can't leak
into the critical section.

But it seems that set_tlb_flush_pending() already assumes that it
acts as wmb(), so probably smp_wmb__before_spinlock() is fine.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-13 18:52           ` Paul E. McKenney
@ 2014-05-13 19:31             ` Oleg Nesterov
  2014-05-13 20:32               ` Paul E. McKenney
  0 siblings, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2014-05-13 19:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Mel Gorman, Andrew Morton, Johannes Weiner,
	Vlastimil Babka, Jan Kara, Michal Hocko, Hugh Dickins,
	Dave Hansen, Linux Kernel, Linux-MM, Linux-FSDevel,
	Linus Torvalds, David Howells

On 05/13, Paul E. McKenney wrote:
>
> On Tue, May 13, 2014 at 08:18:52PM +0200, Oleg Nesterov wrote:
> >
> > I have to admit, I am confused. I simply do not understand what "memory
> > barrier" actually means in this discussion.
> >
> > To me, wake_up/ttwu should only guarantee one thing: all the preceding
> > STORE's should be serialized with all the subsequent manipulations with
> > task->state (even with LOAD(task->state)).
>
> I was thinking in terms of "everything done before the wake_up() is
> visible after the wait_event*() returns" -- but only if the task doing
> the wait_event*() actually sleeps and is awakened by that particular
> wake_up().

Hmm. The question is, visible to whom ;) To the woken task?

Yes sure, and this is simply because both sleeper/waker take rq->lock.

> > > If there is a sleep-wakeup race, for example,
> > > between wait_event_interruptible() and wake_up(), then it looks to me
> > > that the following can happen:
> > >
> > > o	Task A invokes wait_event_interruptible(), waiting for
> > > 	X==1.
> > >
> > > o	Before Task A gets anywhere, Task B sets Y=1, does
> > > 	smp_mb(), then sets X=1.
> > >
> > > o	Task B invokes wake_up(), which invokes __wake_up(), which
> > > 	acquires the wait_queue_head_t's lock and invokes
> > > 	__wake_up_common(), which sees nothing to wake up.
> > >
> > > o	Task A tests the condition, finds X==1, and returns without
> > > 	locks, memory barriers, atomic instructions, or anything else
> > > 	that would guarantee ordering.
> > >
> > > o	Task A then loads from Y.  Because there have been no memory
> > > 	barriers, it might well see Y==0.
> >
> > Sure, but I can't understand "Because there have been no memory barriers".
> >
> > IOW. Suppose we add mb() into wake_up(). The same can happen anyway?
>
> If the mb() is placed just after the fastpath condition check, then the
> awakened task will be guaranteed to see Y=1.

Of course. My point was, this has nothing to do with the barriers provided
by wake_up(), that is why I was confused.

> > > On the other hand, if a wake_up() really does happen, then
> > > the fast-path out of wait_event_interruptible() is not taken,
> > > and __wait_event_interruptible() is called instead.  This calls
> > > ___wait_event(), which eventually calls prepare_to_wait_event(), which
> > > in turn calls set_current_state(), which calls set_mb(), which does a
> > > full memory barrier.
> >
> > Can't understand this part too... OK, and suppose that right after that
> > the task B from the scenario above does
> >
> > 	Y = 1;
> > 	mb();
> > 	X = 1;
> > 	wake_up();
> >
> > After that task A checks the condition, sees X==1, and returns from
> > wait_event() without spin_lock(wait_queue_head_t->lock) (if it also
> > sees list_empty_careful() == T). Then it can see Y==0 again?
>
> Yes.  You need the barriers to be paired, and in this case, Task A isn't
> executing a memory barrier.  Yes, the mb() has forced Task B's CPU to
> commit the writes in order (or at least pretend to), but Task A might
> have speculated the read to Y.
>
> Or am I missing your point?

I only meant that this case doesn't really differ from the scenario you
described above.

> > > 	A read and a write memory barrier (-not- a full memory barrier)
> > > 	are implied by wake_up() and co. if and only if they wake
> > > 	something up.
> >
> > Now this looks as if you document that, say,
> >
> > 	X = 1;
> > 	wake_up();
> > 	Y = 1;
> >
> > doesn't need wmb() before "Y = 1" if wake_up() wakes something up. Do we
> > really want to document this? Is it fine to rely on this guarantee?
>
> That is an excellent question.  It would not be hard to argue that we
> should either make the guarantee unconditional by adding smp_mb() to
> the wait_event*() paths or alternatively just saying that there isn't
> a guarantee to begin with.

I'd vote for "no guarantees".

> > In short: I am totally confused and most probably misunderstood you ;)
>
> Oleg, if it confuses you, it is in desperate need of help!  ;-)

Thanks, this helped ;)

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-13 18:57             ` Oleg Nesterov
@ 2014-05-13 20:24               ` Paul E. McKenney
  2014-05-14 14:25                 ` Oleg Nesterov
  0 siblings, 1 reply; 103+ messages in thread
From: Paul E. McKenney @ 2014-05-13 20:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mel Gorman, Andrew Morton, Johannes Weiner,
	Vlastimil Babka, Jan Kara, Michal Hocko, Hugh Dickins,
	Dave Hansen, Linux Kernel, Linux-MM, Linux-FSDevel,
	Linus Torvalds, David Howells

On Tue, May 13, 2014 at 08:57:42PM +0200, Oleg Nesterov wrote:
> On 05/13, Paul E. McKenney wrote:
> >
> > On Tue, May 13, 2014 at 05:44:35PM +0200, Peter Zijlstra wrote:
> > >
> > > Ah, yes, so I'll defer to Oleg and Linus to explain that one. As per the
> > > name: smp_mb__before_spinlock() should of course imply a full barrier.
> >
> > How about if I queue a name change to smp_wmb__before_spinlock()?
> 
> I agree, this is more accurate, simply because it describes what it
> actually does.
> 
> But just in case, as for try_to_wake_up() it does not actually need
> wmb() between "CONDITION = T" and "task->state = RUNNING". It would
> be fine if these 2 STORE's are re-ordered, we can rely on rq->lock.
> 
> What it actually needs is a barrier between "CONDITION = T" and
> "task->state & state" check. But since we do not have a store-load
> barrier, wmb() was added to ensure that "CONDITION = T" can't leak
> into the critical section.
> 
> But it seems that set_tlb_flush_pending() already assumes that it
> acts as wmb(), so probably smp_wmb__before_spinlock() is fine.

Except that when I go to make the change, I find the following in
the documentation:

     Memory operations issued before the ACQUIRE may be completed after
     the ACQUIRE operation has completed.  An smp_mb__before_spinlock(),
     combined with a following ACQUIRE, orders prior loads against
     subsequent loads and stores and also orders prior stores against
     subsequent stores.  Note that this is weaker than smp_mb()!  The
     smp_mb__before_spinlock() primitive is free on many architectures.

Which means that either the documentation is wrong or the implementation
is.  Yes, smp_wmb() has the semantics called out above on many platforms,
but not on Alpha or ARM.

So, as you say, set_tlb_flush_pending() only relies on smp_wmb().
The comment in try_to_wake_up() seems to be assuming a full memory
barrier.  The comment in __schedule() also seems to be relying on
a full memory barrier (prior write against subsequent read).  Yow!

So maybe barrier() on TSO systems like x86 and mainframe and stronger
barriers on other systems, depending on what their lock acquisition
looks like?

Or am I misinterpreting try_to_wake_up() and __schedule()?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-13 19:31             ` Oleg Nesterov
@ 2014-05-13 20:32               ` Paul E. McKenney
  0 siblings, 0 replies; 103+ messages in thread
From: Paul E. McKenney @ 2014-05-13 20:32 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Mel Gorman, Andrew Morton, Johannes Weiner,
	Vlastimil Babka, Jan Kara, Michal Hocko, Hugh Dickins,
	Dave Hansen, Linux Kernel, Linux-MM, Linux-FSDevel,
	Linus Torvalds, David Howells

On Tue, May 13, 2014 at 09:31:46PM +0200, Oleg Nesterov wrote:
> On 05/13, Paul E. McKenney wrote:
> >
> > On Tue, May 13, 2014 at 08:18:52PM +0200, Oleg Nesterov wrote:
> > >
> > > I have to admit, I am confused. I simply do not understand what "memory
> > > barrier" actually means in this discussion.
> > >
> > > To me, wake_up/ttwu should only guarantee one thing: all the preceding
> > > STORE's should be serialized with all the subsequent manipulations with
> > > task->state (even with LOAD(task->state)).
> >
> > I was thinking in terms of "everything done before the wake_up() is
> > visible after the wait_event*() returns" -- but only if the task doing
> > the wait_event*() actually sleeps and is awakened by that particular
> > wake_up().
> 
> Hmm. The question is, visible to whom ;) To the woken task?
> 
> Yes sure, and this is simply because both sleeper/waker take rq->lock.

Yep, that was the thought.

> > > > If there is a sleep-wakeup race, for example,
> > > > between wait_event_interruptible() and wake_up(), then it looks to me
> > > > that the following can happen:
> > > >
> > > > o	Task A invokes wait_event_interruptible(), waiting for
> > > > 	X==1.
> > > >
> > > > o	Before Task A gets anywhere, Task B sets Y=1, does
> > > > 	smp_mb(), then sets X=1.
> > > >
> > > > o	Task B invokes wake_up(), which invokes __wake_up(), which
> > > > 	acquires the wait_queue_head_t's lock and invokes
> > > > 	__wake_up_common(), which sees nothing to wake up.
> > > >
> > > > o	Task A tests the condition, finds X==1, and returns without
> > > > 	locks, memory barriers, atomic instructions, or anything else
> > > > 	that would guarantee ordering.
> > > >
> > > > o	Task A then loads from Y.  Because there have been no memory
> > > > 	barriers, it might well see Y==0.
> > >
> > > Sure, but I can't understand "Because there have been no memory barriers".
> > >
> > > IOW. Suppose we add mb() into wake_up(). The same can happen anyway?
> >
> > If the mb() is placed just after the fastpath condition check, then the
> > awakened task will be guaranteed to see Y=1.
> 
> Of course. My point was, this has nothing to do with the barriers provided
> by wake_up(), that is why I was confused.
> 
> > > > On the other hand, if a wake_up() really does happen, then
> > > > the fast-path out of wait_event_interruptible() is not taken,
> > > > and __wait_event_interruptible() is called instead.  This calls
> > > > ___wait_event(), which eventually calls prepare_to_wait_event(), which
> > > > in turn calls set_current_state(), which calls set_mb(), which does a
> > > > full memory barrier.
> > >
> > > Can't understand this part too... OK, and suppose that right after that
> > > the task B from the scenario above does
> > >
> > > 	Y = 1;
> > > 	mb();
> > > 	X = 1;
> > > 	wake_up();
> > >
> > > After that task A checks the condition, sees X==1, and returns from
> > > wait_event() without spin_lock(wait_queue_head_t->lock) (if it also
> > > sees list_empty_careful() == T). Then it can see Y==0 again?
> >
> > Yes.  You need the barriers to be paired, and in this case, Task A isn't
> > executing a memory barrier.  Yes, the mb() has forced Task B's CPU to
> > commit the writes in order (or at least pretend to), but Task A might
> > have speculated the read to Y.
> >
> > Or am I missing your point?
> 
> I only meant that this case doesn't really differ from the scenario you
> described above.

Indeed, I was taking a bit of an exploratory approach to this.

> > > > 	A read and a write memory barrier (-not- a full memory barrier)
> > > > 	are implied by wake_up() and co. if and only if they wake
> > > > 	something up.
> > >
> > > Now this looks as if you document that, say,
> > >
> > > 	X = 1;
> > > 	wake_up();
> > > 	Y = 1;
> > >
> > > doesn't need wmb() before "Y = 1" if wake_up() wakes something up. Do we
> > > really want to document this? Is it fine to rely on this guarantee?
> >
> > That is an excellent question.  It would not be hard to argue that we
> > should either make the guarantee unconditional by adding smp_mb() to
> > the wait_event*() paths or alternatively just saying that there isn't
> > a guarantee to begin with.
> 
> I'd vote for "no guarantees".

I would have no objections to that.  Other than the large number of those
things in the kernel!

The thing is that I am having a hard time imagining how you guarantee that
a wakeup actually happened.  I am betting that there are a lot of bugs
related to this weak guarantee...

> > > In short: I am totally confused and most probably misunderstood you ;)
> >
> > Oleg, if it confuses you, it is in desperate need of help!  ;-)
> 
> Thanks, this helped ;)

Glad to help!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 05/19] mm: page_alloc: Calculate classzone_idx once from the zonelist ref
  2014-05-13  9:45 ` [PATCH 05/19] mm: page_alloc: Calculate classzone_idx once from the zonelist ref Mel Gorman
@ 2014-05-13 22:25   ` Andrew Morton
  2014-05-14  6:32     ` Mel Gorman
  2014-05-14 20:29     ` Mel Gorman
  0 siblings, 2 replies; 103+ messages in thread
From: Andrew Morton @ 2014-05-13 22:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel

On Tue, 13 May 2014 10:45:36 +0100 Mel Gorman <mgorman@suse.de> wrote:

> There is no need to calculate zone_idx(preferred_zone) multiple times
> or use the pgdat to figure it out.
> 

This one falls afoul of pending mm/next changes in non-trivial ways.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 17/19] fs: buffer: Do not use unnecessary atomic operations when discarding buffers
  2014-05-13  9:45 ` [PATCH 17/19] fs: buffer: Do not use unnecessary atomic operations when discarding buffers Mel Gorman
  2014-05-13 11:09   ` Peter Zijlstra
  2014-05-13 13:50   ` Jan Kara
@ 2014-05-13 22:29   ` Andrew Morton
  2014-05-14  6:12     ` Mel Gorman
  2 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2014-05-13 22:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel

On Tue, 13 May 2014 10:45:48 +0100 Mel Gorman <mgorman@suse.de> wrote:

> Discarding buffers uses a bunch of atomic operations when discarding buffers
> because ...... I can't think of a reason. Use a cmpxchg loop to clear all the
> necessary flags. In most (all?) cases this will be a single atomic operations.
> 
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1485,14 +1485,18 @@ EXPORT_SYMBOL(set_bh_page);
>   */
>  static void discard_buffer(struct buffer_head * bh)
>  {
> +	unsigned long b_state, b_state_old;
> +
>  	lock_buffer(bh);
>  	clear_buffer_dirty(bh);
>  	bh->b_bdev = NULL;
> -	clear_buffer_mapped(bh);
> -	clear_buffer_req(bh);
> -	clear_buffer_new(bh);
> -	clear_buffer_delay(bh);
> -	clear_buffer_unwritten(bh);
> +	b_state = bh->b_state;
> +	for (;;) {
> +		b_state_old = cmpxchg(&bh->b_state, b_state, (b_state & ~BUFFER_FLAGS_DISCARD));
> +		if (b_state_old == b_state)
> +			break;
> +		b_state = b_state_old;
> +	}
>  	unlock_buffer(bh);
>  }
>  
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -77,6 +77,11 @@ struct buffer_head {
>  	atomic_t b_count;		/* users using this buffer_head */
>  };
>  
> +/* Bits that are cleared during an invalidate */
> +#define BUFFER_FLAGS_DISCARD \
> +	(1 << BH_Mapped | 1 << BH_New | 1 << BH_Req | \
> +	 1 << BH_Delay | 1 << BH_Unwritten)
> +

There isn't much point in having this in the header file is there?

--- a/fs/buffer.c~fs-buffer-do-not-use-unnecessary-atomic-operations-when-discarding-buffers-fix
+++ a/fs/buffer.c
@@ -1483,6 +1483,12 @@ EXPORT_SYMBOL(set_bh_page);
 /*
  * Called when truncating a buffer on a page completely.
  */
+
+/* Bits that are cleared during an invalidate */
+#define BUFFER_FLAGS_DISCARD \
+	(1 << BH_Mapped | 1 << BH_New | 1 << BH_Req | \
+	 1 << BH_Delay | 1 << BH_Unwritten)
+
 static void discard_buffer(struct buffer_head * bh)
 {
 	unsigned long b_state, b_state_old;
@@ -1492,7 +1498,8 @@ static void discard_buffer(struct buffer
 	bh->b_bdev = NULL;
 	b_state = bh->b_state;
 	for (;;) {
-		b_state_old = cmpxchg(&bh->b_state, b_state, (b_state & ~BUFFER_FLAGS_DISCARD));
+		b_state_old = cmpxchg(&bh->b_state, b_state,
+				      (b_state & ~BUFFER_FLAGS_DISCARD));
 		if (b_state_old == b_state)
 			break;
 		b_state = b_state_old;
--- a/include/linux/buffer_head.h~fs-buffer-do-not-use-unnecessary-atomic-operations-when-discarding-buffers-fix
+++ a/include/linux/buffer_head.h
@@ -77,11 +77,6 @@ struct buffer_head {
 	atomic_t b_count;		/* users using this buffer_head */
 };
 
-/* Bits that are cleared during an invalidate */
-#define BUFFER_FLAGS_DISCARD \
-	(1 << BH_Mapped | 1 << BH_New | 1 << BH_Req | \
-	 1 << BH_Delay | 1 << BH_Unwritten)
-
 /*
  * macro tricks to expand the set_buffer_foo(), clear_buffer_foo()
  * and buffer_foo() functions.
_


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 17/19] fs: buffer: Do not use unnecessary atomic operations when discarding buffers
  2014-05-13 22:29   ` Andrew Morton
@ 2014-05-14  6:12     ` Mel Gorman
  0 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-14  6:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel

On Tue, May 13, 2014 at 03:29:00PM -0700, Andrew Morton wrote:
> On Tue, 13 May 2014 10:45:48 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > Discarding buffers uses a bunch of atomic operations when discarding buffers
> > because ...... I can't think of a reason. Use a cmpxchg loop to clear all the
> > necessary flags. In most (all?) cases this will be a single atomic operations.
> > 
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -1485,14 +1485,18 @@ EXPORT_SYMBOL(set_bh_page);
> >   */
> >  static void discard_buffer(struct buffer_head * bh)
> >  {
> > +	unsigned long b_state, b_state_old;
> > +
> >  	lock_buffer(bh);
> >  	clear_buffer_dirty(bh);
> >  	bh->b_bdev = NULL;
> > -	clear_buffer_mapped(bh);
> > -	clear_buffer_req(bh);
> > -	clear_buffer_new(bh);
> > -	clear_buffer_delay(bh);
> > -	clear_buffer_unwritten(bh);
> > +	b_state = bh->b_state;
> > +	for (;;) {
> > +		b_state_old = cmpxchg(&bh->b_state, b_state, (b_state & ~BUFFER_FLAGS_DISCARD));
> > +		if (b_state_old == b_state)
> > +			break;
> > +		b_state = b_state_old;
> > +	}
> >  	unlock_buffer(bh);
> >  }
> >  
> > --- a/include/linux/buffer_head.h
> > +++ b/include/linux/buffer_head.h
> > @@ -77,6 +77,11 @@ struct buffer_head {
> >  	atomic_t b_count;		/* users using this buffer_head */
> >  };
> >  
> > +/* Bits that are cleared during an invalidate */
> > +#define BUFFER_FLAGS_DISCARD \
> > +	(1 << BH_Mapped | 1 << BH_New | 1 << BH_Req | \
> > +	 1 << BH_Delay | 1 << BH_Unwritten)
> > +
> 
> There isn't much point in having this in the header file is there?
> 

No, it's not necessary. I was just keeping it with the definition of the
flags. Your fix on top looks fine.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 05/19] mm: page_alloc: Calculate classzone_idx once from the zonelist ref
  2014-05-13 22:25   ` Andrew Morton
@ 2014-05-14  6:32     ` Mel Gorman
  2014-05-14 20:29     ` Mel Gorman
  1 sibling, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-14  6:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel

On Tue, May 13, 2014 at 03:25:56PM -0700, Andrew Morton wrote:
> On Tue, 13 May 2014 10:45:36 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > There is no need to calculate zone_idx(preferred_zone) multiple times
> > or use the pgdat to figure it out.
> > 
> 
> This one falls afoul of pending mm/next changes in non-trivial ways.

No problem, I can rework this patch on top of mmotm. Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-13 16:52   ` [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath Peter Zijlstra
@ 2014-05-14  7:31     ` Mel Gorman
  0 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-14  7:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Johannes Weiner, Vlastimil Babka, Jan Kara,
	Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel, Linux-MM,
	Linux-FSDevel

On Tue, May 13, 2014 at 06:52:23PM +0200, Peter Zijlstra wrote:
> On Tue, May 13, 2014 at 10:45:50AM +0100, Mel Gorman wrote:
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index c60ed0f..d81ed7d 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -241,15 +241,15 @@ void delete_from_page_cache(struct page *page)
> >  }
> >  EXPORT_SYMBOL(delete_from_page_cache);
> >  
> > -static int sleep_on_page(void *word)
> > +static int sleep_on_page(void)
> >  {
> > -	io_schedule();
> > +	io_schedule_timeout(HZ);
> >  	return 0;
> >  }
> >  
> > -static int sleep_on_page_killable(void *word)
> > +static int sleep_on_page_killable(void)
> >  {
> > -	sleep_on_page(word);
> > +	sleep_on_page();
> >  	return fatal_signal_pending(current) ? -EINTR : 0;
> >  }
> >  
> 
> I've got a patch from NeilBrown that conflicts with this, shouldn't be
> hard to resolve though.
> 

Kick me if there are problems.

> > @@ -680,30 +680,105 @@ static wait_queue_head_t *page_waitqueue(struct page *page)
> >  	return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
> >  }
> >  
> > -static inline void wake_up_page(struct page *page, int bit)
> > +static inline wait_queue_head_t *clear_page_waiters(struct page *page)
> >  {
> > -	__wake_up_bit(page_waitqueue(page), &page->flags, bit);
> > +	wait_queue_head_t *wqh = NULL;
> > +
> > +	if (!PageWaiters(page))
> > +		return NULL;
> > +
> > +	/*
> > +	 * Prepare to clear PG_waiters if the waitqueue is no longer
> > +	 * active. Note that there is no guarantee that a page with no
> > +	 * waiters will get cleared as there may be unrelated pages
> > +	 * sleeping on the same page wait queue. Accurate detection
> > +	 * would require a counter. In the event of a collision, the
> > +	 * waiter bit will dangle and lookups will be required until
> > +	 * the page is unlocked without collisions. The bit will need to
> > +	 * be cleared before freeing to avoid triggering debug checks.
> > +	 *
> > +	 * Furthermore, this can race with processes about to sleep on
> > +	 * the same page if it adds itself to the waitqueue just after
> > +	 * this check. The timeout in sleep_on_page prevents the race
> > +	 * being a terminal one. In effect, the uncontended and non-race
> > +	 * cases are faster in exchange for occasional worst case of the
> > +	 * timeout saving us.
> > +	 */
> > +	wqh = page_waitqueue(page);
> > +	if (!waitqueue_active(wqh))
> > +		ClearPageWaiters(page);
> > +
> > +	return wqh;
> > +}
> 
> This of course is properly disgusting, but my brain isn't working right
> on 4 hours of sleep, so I'm able to suggest anything else.

It could be "solved" by adding a zone lock or abusing the mapping tree_lock
to protect the waiters bit but that would put a very expensive operation into
the unlock page path. Same goes for any sort of sequence counter tricks. The
waitqueue lock cannot be used in this case because that would necessitate
looking up page_waitqueue every time which would render the patch useless.

It occurs to me that one option would be to recheck waiters once we're
added to the waitqueue and if PageWaiters is clear then recheck the bit
we're waiting on instead of going to sleep.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-13 20:24               ` Paul E. McKenney
@ 2014-05-14 14:25                 ` Oleg Nesterov
  0 siblings, 0 replies; 103+ messages in thread
From: Oleg Nesterov @ 2014-05-14 14:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Mel Gorman, Andrew Morton, Johannes Weiner,
	Vlastimil Babka, Jan Kara, Michal Hocko, Hugh Dickins,
	Dave Hansen, Linux Kernel, Linux-MM, Linux-FSDevel,
	Linus Torvalds, David Howells

On 05/13, Paul E. McKenney wrote:
>
> On Tue, May 13, 2014 at 08:57:42PM +0200, Oleg Nesterov wrote:
> > On 05/13, Paul E. McKenney wrote:
> > >
> > > On Tue, May 13, 2014 at 05:44:35PM +0200, Peter Zijlstra wrote:
> > > >
> > > > Ah, yes, so I'll defer to Oleg and Linus to explain that one. As per the
> > > > name: smp_mb__before_spinlock() should of course imply a full barrier.
> > >
> > > How about if I queue a name change to smp_wmb__before_spinlock()?
> >
> > I agree, this is more accurate, simply because it describes what it
> > actually does.
> >
> > But just in case, as for try_to_wake_up() it does not actually need
> > wmb() between "CONDITION = T" and "task->state = RUNNING". It would
> > be fine if these 2 STORE's are re-ordered, we can rely on rq->lock.
> >
> > What it actually needs is a barrier between "CONDITION = T" and
> > "task->state & state" check. But since we do not have a store-load
> > barrier, wmb() was added to ensure that "CONDITION = T" can't leak
> > into the critical section.
> >
> > But it seems that set_tlb_flush_pending() already assumes that it
> > acts as wmb(), so probably smp_wmb__before_spinlock() is fine.
>
> Except that when I go to make the change, I find the following in
> the documentation:
>
>      Memory operations issued before the ACQUIRE may be completed after
>      the ACQUIRE operation has completed.  An smp_mb__before_spinlock(),
>      combined with a following ACQUIRE, orders prior loads against
>      subsequent loads and stores and also orders prior stores against
>      subsequent stores.  Note that this is weaker than smp_mb()!  The
>      smp_mb__before_spinlock() primitive is free on many architectures.
>
> Which means that either the documentation is wrong or the implementation
> is.  Yes, smp_wmb() has the semantics called out above on many platforms,
> but not on Alpha or ARM.

Well, I think the documentation is wrong in any case. "prior loads
against subsequent loads" is not true. And it doesn't document that
the initial goal was "prior stores against the subsequent loads".
"prior stores against the subsequent stores" is obviously true for
the default implementation, but this is the "side effect" because
it uses wmb().


The only intent of wmb() added by 04e2f174 "Add memory barrier semantics
to wake_up() & co" (afaics at least) was: make sure that ttwu() does not
read p->state before the preceding stores are completed.

e0acd0a68e "sched: fix the theoretical signal_wake_up() vs schedule()
race" added the new helper for documentation, to explain that the
default implementation abuses wmb() to achieve the serialization above.

> So, as you say, set_tlb_flush_pending() only relies on smp_wmb().

The comment says ;) and this means that even if we suddenly have a new
load_store() barrier (which could work for ttwu/schedule) we can no
longer change smp_mb__before_spinlock() to use it.

> The comment in try_to_wake_up() seems to be assuming a full memory
> barrier.  The comment in __schedule() also seems to be relying on
> a full memory barrier (prior write against subsequent read).  Yow!

Well yes, but see above. Again, we need load_store() before reading
p->state, which we do not have. wmb() before spin_lock() can be used
instead.

But, try_to_wake_up() and __schedule() do not need a full barrier in
a sense that if we are going to wake this task up (or just clear its
->state), then "CONDITION = T" can be delayed till spin_unlock().

We do not care if that tasks misses CONDITION in this case, it will
call schedule() which will take the same lock. But if we are not going
to wake it up, we need to ensure that the task can't miss CONDITION.

IOW, this all is simply about

	CONDITION = T;			current->state = TASK_XXX;
					mb();

	if (p->state)			if (!CONDITION)
		wake_it_up();			schedule();

race.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-13 14:17     ` Peter Zijlstra
  2014-05-13 15:27       ` Paul E. McKenney
@ 2014-05-14 16:11       ` Oleg Nesterov
  2014-05-14 16:17         ` Peter Zijlstra
  2014-05-14 19:29         ` [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath Oleg Nesterov
  1 sibling, 2 replies; 103+ messages in thread
From: Oleg Nesterov @ 2014-05-14 16:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Andrew Morton, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

The subsequent discussion was "off-topic", and it seems that the patch
itself needs a bit more discussion,

On 05/13, Peter Zijlstra wrote:
>
> On Tue, May 13, 2014 at 01:53:13PM +0100, Mel Gorman wrote:
> > On Tue, May 13, 2014 at 10:45:50AM +0100, Mel Gorman wrote:
> > >  void unlock_page(struct page *page)
> > >  {
> > > +	wait_queue_head_t *wqh = clear_page_waiters(page);
> > > +
> > >  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > +
> > > +	/*
> > > +	 * No additional barrier needed due to clear_bit_unlock barriering all updates
> > > +	 * before waking waiters
> > > +	 */
> > >  	clear_bit_unlock(PG_locked, &page->flags);
> > > -	smp_mb__after_clear_bit();
> > > -	wake_up_page(page, PG_locked);
> >
> > This is wrong.

Yes,

> > The smp_mb__after_clear_bit() is still required to ensure
> > that the cleared bit is visible before the wakeup on all architectures.

But note that "the cleared bit is visible before the wakeup" is confusing.
I mean, we do not need mb() before __wake_up(). We need it only because
__wake_up_bit() checks waitqueue_active().


And at least

	fs/cachefiles/namei.c:cachefiles_delete_object()
	fs/block_dev.c:blkdev_get()
	kernel/signal.c:task_clear_jobctl_trapping()
	security/keys/gc.c:key_garbage_collector()

look obviously wrong.

I would be happy to send the fix, but do I need to split it per-file?
Given that it is trivial, perhaps I can send a single patch?

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-14 16:11       ` Oleg Nesterov
@ 2014-05-14 16:17         ` Peter Zijlstra
  2014-05-16 13:51           ` [PATCH 0/1] ptrace: task_clear_jobctl_trapping()->wake_up_bit() needs mb() Oleg Nesterov
  2014-05-14 19:29         ` [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath Oleg Nesterov
  1 sibling, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2014-05-14 16:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Andrew Morton, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

[-- Attachment #1: Type: text/plain, Size: 1677 bytes --]

On Wed, May 14, 2014 at 06:11:52PM +0200, Oleg Nesterov wrote:
> The subsequent discussion was "off-topic", and it seems that the patch
> itself needs a bit more discussion,
> 
> On 05/13, Peter Zijlstra wrote:
> >
> > On Tue, May 13, 2014 at 01:53:13PM +0100, Mel Gorman wrote:
> > > On Tue, May 13, 2014 at 10:45:50AM +0100, Mel Gorman wrote:
> > > >  void unlock_page(struct page *page)
> > > >  {
> > > > +	wait_queue_head_t *wqh = clear_page_waiters(page);
> > > > +
> > > >  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > > +
> > > > +	/*
> > > > +	 * No additional barrier needed due to clear_bit_unlock barriering all updates
> > > > +	 * before waking waiters
> > > > +	 */
> > > >  	clear_bit_unlock(PG_locked, &page->flags);
> > > > -	smp_mb__after_clear_bit();
> > > > -	wake_up_page(page, PG_locked);
> > >
> > > This is wrong.
> 
> Yes,
> 
> > > The smp_mb__after_clear_bit() is still required to ensure
> > > that the cleared bit is visible before the wakeup on all architectures.
> 
> But note that "the cleared bit is visible before the wakeup" is confusing.
> I mean, we do not need mb() before __wake_up(). We need it only because
> __wake_up_bit() checks waitqueue_active().
> 
> 
> And at least
> 
> 	fs/cachefiles/namei.c:cachefiles_delete_object()
> 	fs/block_dev.c:blkdev_get()
> 	kernel/signal.c:task_clear_jobctl_trapping()
> 	security/keys/gc.c:key_garbage_collector()
> 
> look obviously wrong.
> 
> I would be happy to send the fix, but do I need to split it per-file?
> Given that it is trivial, perhaps I can send a single patch?

Since its all the same issue a single patch would be fine I think.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-14 16:11       ` Oleg Nesterov
  2014-05-14 16:17         ` Peter Zijlstra
@ 2014-05-14 19:29         ` Oleg Nesterov
  2014-05-14 20:53           ` Mel Gorman
  2014-05-15 10:48           ` [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v4 Mel Gorman
  1 sibling, 2 replies; 103+ messages in thread
From: Oleg Nesterov @ 2014-05-14 19:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Andrew Morton, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

On 05/14, Oleg Nesterov wrote:
>
> The subsequent discussion was "off-topic", and it seems that the patch
> itself needs a bit more discussion,
>
> On 05/13, Peter Zijlstra wrote:
> >
> > On Tue, May 13, 2014 at 01:53:13PM +0100, Mel Gorman wrote:
> > > On Tue, May 13, 2014 at 10:45:50AM +0100, Mel Gorman wrote:
> > > >  void unlock_page(struct page *page)
> > > >  {
> > > > +	wait_queue_head_t *wqh = clear_page_waiters(page);
> > > > +
> > > >  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > > +
> > > > +	/*
> > > > +	 * No additional barrier needed due to clear_bit_unlock barriering all updates
> > > > +	 * before waking waiters
> > > > +	 */
> > > >  	clear_bit_unlock(PG_locked, &page->flags);
> > > > -	smp_mb__after_clear_bit();
> > > > -	wake_up_page(page, PG_locked);
> > >
> > > This is wrong.
>
> Yes,
>
> > > The smp_mb__after_clear_bit() is still required to ensure
> > > that the cleared bit is visible before the wakeup on all architectures.
>
> But note that "the cleared bit is visible before the wakeup" is confusing.
> I mean, we do not need mb() before __wake_up(). We need it only because
> __wake_up_bit() checks waitqueue_active().

OOPS. Sorry Mel, I wrote this looking at the chunk above.  But when I found
the whole patch http://marc.info/?l=linux-mm&m=139997442008267 I see that
it removes waitqueue_active(), so this can be correct. I do not really know,
so far I can't say I fully understand this PageWaiters() trick.

Hmm. But at least prepare_to_wait_exclusive() doesn't look right ;)

If nothing else, this needs abort_exclusive_wait() if killed. And while
"exclusive" is probably fine for __lock_page.*(), I am not sure that
__wait_on_page_locked_*() should be exclusive.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 05/19] mm: page_alloc: Calculate classzone_idx once from the zonelist ref
  2014-05-13 22:25   ` Andrew Morton
  2014-05-14  6:32     ` Mel Gorman
@ 2014-05-14 20:29     ` Mel Gorman
  1 sibling, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-14 20:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel

On Tue, May 13, 2014 at 03:25:56PM -0700, Andrew Morton wrote:
> On Tue, 13 May 2014 10:45:36 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > There is no need to calculate zone_idx(preferred_zone) multiple times
> > or use the pgdat to figure it out.
> > 
> 
> This one falls afoul of pending mm/next changes in non-trivial ways.

This should apply on top of what you already have. Thanks.

---8<---
mm: page_alloc: Calculate classzone_idx once from the zonelist ref

There is no need to calculate zone_idx(preferred_zone) multiple times
or use the pgdat to figure it out.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/page_alloc.c | 60 +++++++++++++++++++++++++++++++++------------------------
 1 file changed, 35 insertions(+), 25 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7ce44f9..606eecf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1948,11 +1948,10 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
-		struct zone *preferred_zone, int migratetype)
+		struct zone *preferred_zone, int classzone_idx, int migratetype)
 {
 	struct zoneref *z;
 	struct page *page = NULL;
-	int classzone_idx;
 	struct zone *zone;
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
@@ -1960,7 +1959,6 @@ get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
 				(gfp_mask & __GFP_WRITE);
 
-	classzone_idx = zone_idx(preferred_zone);
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
@@ -2218,7 +2216,7 @@ static inline struct page *
 __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int classzone_idx, int migratetype)
 {
 	struct page *page;
 
@@ -2236,7 +2234,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
 		order, zonelist, high_zoneidx,
 		ALLOC_WMARK_HIGH|ALLOC_CPUSET,
-		preferred_zone, migratetype);
+		preferred_zone, classzone_idx, migratetype);
 	if (page)
 		goto out;
 
@@ -2271,7 +2269,7 @@ static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, enum migrate_mode mode,
+	int classzone_idx, int migratetype, enum migrate_mode mode,
 	bool *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
@@ -2299,7 +2297,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		page = get_page_from_freelist(gfp_mask, nodemask,
 				order, zonelist, high_zoneidx,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
-				preferred_zone, migratetype);
+				preferred_zone, classzone_idx, migratetype);
 		if (page) {
 			preferred_zone->compact_blockskip_flush = false;
 			compaction_defer_reset(preferred_zone, order, true);
@@ -2331,7 +2329,8 @@ static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, enum migrate_mode mode, bool *contended_compaction,
+	int classzone_idx, int migratetype,
+	enum migrate_mode mode, bool *contended_compaction,
 	bool *deferred_compaction, unsigned long *did_some_progress)
 {
 	return NULL;
@@ -2387,7 +2386,7 @@ static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
-	int migratetype, unsigned long *did_some_progress)
+	int classzone_idx, int migratetype, unsigned long *did_some_progress)
 {
 	struct page *page = NULL;
 	bool drained = false;
@@ -2405,7 +2404,8 @@ retry:
 	page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx,
 					alloc_flags & ~ALLOC_NO_WATERMARKS,
-					preferred_zone, migratetype);
+					preferred_zone, classzone_idx,
+					migratetype);
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
@@ -2430,14 +2430,14 @@ static inline struct page *
 __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int classzone_idx, int migratetype)
 {
 	struct page *page;
 
 	do {
 		page = get_page_from_freelist(gfp_mask, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, classzone_idx, migratetype);
 
 		if (!page && gfp_mask & __GFP_NOFAIL)
 			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
@@ -2538,7 +2538,7 @@ static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, struct zone *preferred_zone,
-	int migratetype)
+	int classzone_idx, int migratetype)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	struct page *page = NULL;
@@ -2587,15 +2587,19 @@ restart:
 	 * Find the true preferred zone if the allocation is unconstrained by
 	 * cpusets.
 	 */
-	if (!(alloc_flags & ALLOC_CPUSET) && !nodemask)
-		first_zones_zonelist(zonelist, high_zoneidx, NULL,
-					&preferred_zone);
+	if (!(alloc_flags & ALLOC_CPUSET) && !nodemask) {
+		struct zoneref *preferred_zoneref;
+		preferred_zoneref = first_zones_zonelist(zonelist, high_zoneidx,
+				nodemask ? : &cpuset_current_mems_allowed,
+				&preferred_zone);
+		classzone_idx = zonelist_zone_idx(preferred_zoneref);
+	}
 
 rebalance:
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
-			preferred_zone, migratetype);
+			preferred_zone, classzone_idx, migratetype);
 	if (page)
 		goto got_pg;
 
@@ -2610,7 +2614,7 @@ rebalance:
 
 		page = __alloc_pages_high_priority(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, classzone_idx, migratetype);
 		if (page) {
 			goto got_pg;
 		}
@@ -2641,7 +2645,8 @@ rebalance:
 	 */
 	page = __alloc_pages_direct_compact(gfp_mask, order, zonelist,
 					high_zoneidx, nodemask, alloc_flags,
-					preferred_zone, migratetype,
+					preferred_zone,
+					classzone_idx, migratetype,
 					migration_mode, &contended_compaction,
 					&deferred_compaction,
 					&did_some_progress);
@@ -2671,7 +2676,8 @@ rebalance:
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
-					migratetype, &did_some_progress);
+					classzone_idx, migratetype,
+					&did_some_progress);
 	if (page)
 		goto got_pg;
 
@@ -2690,7 +2696,7 @@ rebalance:
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask, preferred_zone,
-					migratetype);
+					classzone_idx, migratetype);
 			if (page)
 				goto got_pg;
 
@@ -2731,7 +2737,8 @@ rebalance:
 		 */
 		page = __alloc_pages_direct_compact(gfp_mask, order, zonelist,
 					high_zoneidx, nodemask, alloc_flags,
-					preferred_zone, migratetype,
+					preferred_zone,
+					classzone_idx, migratetype,
 					migration_mode, &contended_compaction,
 					&deferred_compaction,
 					&did_some_progress);
@@ -2760,10 +2767,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone *preferred_zone;
+	struct zoneref *preferred_zoneref;
 	struct page *page = NULL;
 	int migratetype = allocflags_to_migratetype(gfp_mask);
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
+	int classzone_idx;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2786,11 +2795,12 @@ retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
 
 	/* The preferred zone is used for statistics later */
-	first_zones_zonelist(zonelist, high_zoneidx,
+	preferred_zoneref = first_zones_zonelist(zonelist, high_zoneidx,
 				nodemask ? : &cpuset_current_mems_allowed,
 				&preferred_zone);
 	if (!preferred_zone)
 		goto out;
+	classzone_idx = zonelist_zone_idx(preferred_zoneref);
 
 #ifdef CONFIG_CMA
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
@@ -2800,7 +2810,7 @@ retry:
 	/* First allocation attempt */
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, alloc_flags,
-			preferred_zone, migratetype);
+			preferred_zone, classzone_idx, migratetype);
 	if (unlikely(!page)) {
 		/*
 		 * The first pass makes sure allocations are spread
@@ -2826,7 +2836,7 @@ retry:
 		gfp_mask = memalloc_noio_flags(gfp_mask);
 		page = __alloc_pages_slowpath(gfp_mask, order,
 				zonelist, high_zoneidx, nodemask,
-				preferred_zone, migratetype);
+				preferred_zone, classzone_idx, migratetype);
 	}
 
 	trace_mm_page_alloc(page, order, gfp_mask, migratetype);

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath
  2014-05-14 19:29         ` [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath Oleg Nesterov
@ 2014-05-14 20:53           ` Mel Gorman
  2014-05-15 10:48           ` [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v4 Mel Gorman
  1 sibling, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-14 20:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Andrew Morton, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

On Wed, May 14, 2014 at 09:29:45PM +0200, Oleg Nesterov wrote:
> On 05/14, Oleg Nesterov wrote:
> >
> > The subsequent discussion was "off-topic", and it seems that the patch
> > itself needs a bit more discussion,
> >
> > On 05/13, Peter Zijlstra wrote:
> > >
> > > On Tue, May 13, 2014 at 01:53:13PM +0100, Mel Gorman wrote:
> > > > On Tue, May 13, 2014 at 10:45:50AM +0100, Mel Gorman wrote:
> > > > >  void unlock_page(struct page *page)
> > > > >  {
> > > > > +	wait_queue_head_t *wqh = clear_page_waiters(page);
> > > > > +
> > > > >  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> > > > > +
> > > > > +	/*
> > > > > +	 * No additional barrier needed due to clear_bit_unlock barriering all updates
> > > > > +	 * before waking waiters
> > > > > +	 */
> > > > >  	clear_bit_unlock(PG_locked, &page->flags);
> > > > > -	smp_mb__after_clear_bit();
> > > > > -	wake_up_page(page, PG_locked);
> > > >
> > > > This is wrong.
> >
> > Yes,
> >
> > > > The smp_mb__after_clear_bit() is still required to ensure
> > > > that the cleared bit is visible before the wakeup on all architectures.
> >
> > But note that "the cleared bit is visible before the wakeup" is confusing.
> > I mean, we do not need mb() before __wake_up(). We need it only because
> > __wake_up_bit() checks waitqueue_active().
> 
> OOPS. Sorry Mel, I wrote this looking at the chunk above.  But when I found
> the whole patch http://marc.info/?l=linux-mm&m=139997442008267 I see that
> it removes waitqueue_active(), so this can be correct. I do not really know,
> so far I can't say I fully understand this PageWaiters() trick.
> 

The intent is to use a page bit to determine if looking up the waitqueue is
worthwhile. However, it is currently race-prone and while barriers can be
used to reduce the race, I did not see how it could be eliminated without
using a lock which would defeat the purpose.

> Hmm. But at least prepare_to_wait_exclusive() doesn't look right ;)
> 
> If nothing else, this needs abort_exclusive_wait() if killed.

Yes, I'll fix that.

> And while
> "exclusive" is probably fine for __lock_page.*(), I am not sure that
> __wait_on_page_locked_*() should be exclusive.
> 

Indeed it shouldn't. Exclusive waits should only be if the lock is being
acquired. Thanks for pointing that out.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v4
  2014-05-14 19:29         ` [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath Oleg Nesterov
  2014-05-14 20:53           ` Mel Gorman
@ 2014-05-15 10:48           ` Mel Gorman
  2014-05-15 13:20             ` Peter Zijlstra
                               ` (2 more replies)
  1 sibling, 3 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-15 10:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

Changelog since v3
o Correct handling of exclusive waits

This patch introduces a new page flag for 64-bit capable machines,
PG_waiters, to signal there are processes waiting on PG_lock and uses it to
avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.

This adds a few branches to the fast path but avoids bouncing a dirty
cache line between CPUs. 32-bit machines always take the slow path but the
primary motivation for this patch is large machines so I do not think that
is a concern.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of
the file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or
NUMA artifacts. Throughput and wall times are presented for sync IO, only
wall times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running. The
kernels being compared are "accessed-v2" which is the patch series up
to this patch where as lockpage-v2 includes this patch.

async dd
                                   3.15.0-rc3            3.15.0-rc3
                                  accessed-v3           lockpage-v3
ext3   Max      elapsed     11.5900 (  0.00%)     11.0000 (  5.09%)
ext4   Max      elapsed     13.3400 (  0.00%)     13.4300 ( -0.67%)
tmpfs  Max      elapsed      0.4900 (  0.00%)      0.4800 (  2.04%)
btrfs  Max      elapsed     12.7800 (  0.00%)     13.8200 ( -8.14%)
xfs    Max      elapsed      2.0900 (  0.00%)      2.1100 ( -0.96%)

The xfs gain is the hardest to explain, it consistent manages to miss the
worst cases. In the other cases, the results are variable due to the async
nature of the test but the min and max figures are consistently better.

     samples percentage
ext3   90049     1.0238  vmlinux-3.15.0-rc4-accessed-v3 __wake_up_bit
ext3   61716     0.7017  vmlinux-3.15.0-rc4-accessed-v3 page_waitqueue
ext3   47529     0.5404  vmlinux-3.15.0-rc4-accessed-v3 unlock_page
ext3   23833     0.2710  vmlinux-3.15.0-rc4-accessed-v3 mark_page_accessed
ext3    9543     0.1085  vmlinux-3.15.0-rc4-accessed-v3 wake_up_bit
ext3    5036     0.0573  vmlinux-3.15.0-rc4-accessed-v3 init_page_accessed
ext3     369     0.0042  vmlinux-3.15.0-rc4-accessed-v3 __lock_page
ext3       1    1.1e-05  vmlinux-3.15.0-rc4-accessed-v3 lock_page
ext3   37376     0.4233  vmlinux-3.15.0-rc4-waitqueue-v3 unlock_page
ext3   11856     0.1343  vmlinux-3.15.0-rc4-waitqueue-v3 __wake_up_bit
ext3   11096     0.1257  vmlinux-3.15.0-rc4-waitqueue-v3 wake_up_bit
ext3     107     0.0012  vmlinux-3.15.0-rc4-waitqueue-v3 page_waitqueue
ext3      34    3.9e-04  vmlinux-3.15.0-rc4-waitqueue-v3 __lock_page
ext3       4    4.5e-05  vmlinux-3.15.0-rc4-waitqueue-v3 lock_page

There is a similar story told for each of the filesystems -- much less
time spend in page_waitqueue and __wake_up_bit due to the fact that they
now rarely need to be called. Note that for workloads that contend heavily
on the page lock that unlock_page will *increase* in cost as it has to
clear PG_waiters so while the typical case should be much faster, the worst
case costs are now higher.

The Intel vm-scalability tests tell a similar story. The ones measured here
are broadly based on dd of files 10 times the size of memory with one dd per
CPU in the system

                                              3.15.0-rc3            3.15.0-rc3
                                             accessed-v3           lockpage-v3
ext3  lru-file-readonce    elapsed      3.6300 (  0.00%)      3.6300 (  0.00%)
ext3 lru-file-readtwice    elapsed      6.0800 (  0.00%)      6.0700 (  0.16%)
ext4  lru-file-readonce    elapsed      3.7300 (  0.00%)      3.5400 (  5.09%)
ext4 lru-file-readtwice    elapsed      6.2400 (  0.00%)      6.0100 (  3.69%)
btrfs lru-file-readonce    elapsed      5.0100 (  0.00%)      4.9300 (  1.60%)
btrfslru-file-readtwice    elapsed      7.5800 (  0.00%)      7.6300 ( -0.66%)
xfs   lru-file-readonce    elapsed      3.7000 (  0.00%)      3.6400 (  1.62%)
xfs  lru-file-readtwice    elapsed      6.2400 (  0.00%)      5.8600 (  6.09%)

In most cases the time to read the file is slightly lowered. Unlike the
previous test there is no impact on mark_page_accessed as the pages are
already resident for this test and there is no opportunity to mark the
pages accessed without using atomic operations. Instead the profiles show
a reduction in the time spent in page_waitqueue.

This is similarly reflected in the time taken to mmap a range of pages.
These are the results for xfs only but the other filesystems tell a
similar story.

                       3.15.0-rc3            3.15.0-rc3
                      accessed-v2           lockpage-v2
Procs 107M     567.0000 (  0.00%)    542.0000 (  4.41%)
Procs 214M    1075.0000 (  0.00%)   1041.0000 (  3.16%)
Procs 322M    1918.0000 (  0.00%)   1522.0000 ( 20.65%)
Procs 429M    2063.0000 (  0.00%)   1950.0000 (  5.48%)
Procs 536M    2566.0000 (  0.00%)   2506.0000 (  2.34%)
Procs 644M    2920.0000 (  0.00%)   2804.0000 (  3.97%)
Procs 751M    3366.0000 (  0.00%)   3260.0000 (  3.15%)
Procs 859M    3800.0000 (  0.00%)   3672.0000 (  3.37%)
Procs 966M    4291.0000 (  0.00%)   4236.0000 (  1.28%)
Procs 1073M   4923.0000 (  0.00%)   4815.0000 (  2.19%)
Procs 1181M   5223.0000 (  0.00%)   5075.0000 (  2.83%)
Procs 1288M   5576.0000 (  0.00%)   5419.0000 (  2.82%)
Procs 1395M   5855.0000 (  0.00%)   5636.0000 (  3.74%)
Procs 1503M   6049.0000 (  0.00%)   5862.0000 (  3.09%)
Procs 1610M   6454.0000 (  0.00%)   6137.0000 (  4.91%)
Procs 1717M   6806.0000 (  0.00%)   6474.0000 (  4.88%)
Procs 1825M   7377.0000 (  0.00%)   6979.0000 (  5.40%)
Procs 1932M   7633.0000 (  0.00%)   7396.0000 (  3.10%)
Procs 2040M   8137.0000 (  0.00%)   7769.0000 (  4.52%)
Procs 2147M   8617.0000 (  0.00%)   8205.0000 (  4.78%)

         samples percentage
xfs        67544     1.1655  vmlinux-3.15.0-rc4-accessed-v3 unlock_page
xfs        49888     0.8609  vmlinux-3.15.0-rc4-accessed-v3 __wake_up_bit
xfs         1747     0.0301  vmlinux-3.15.0-rc4-accessed-v3 block_page_mkwrite
xfs         1578     0.0272  vmlinux-3.15.0-rc4-accessed-v3 wake_up_bit
xfs            2    3.5e-05  vmlinux-3.15.0-rc4-accessed-v3 lock_page
xfs        83010     1.3447  vmlinux-3.15.0-rc4-waitqueue-v3 unlock_page
xfs         2354     0.0381  vmlinux-3.15.0-rc4-waitqueue-v3 __wake_up_bit
xfs         2064     0.0334  vmlinux-3.15.0-rc4-waitqueue-v3 wake_up_bit
xfs           26    4.2e-04  vmlinux-3.15.0-rc4-waitqueue-v3 page_waitqueue
xfs            3    4.9e-05  vmlinux-3.15.0-rc4-waitqueue-v3 lock_page
xfs            2    3.2e-05  vmlinux-3.15.0-rc4-waitqueue-v3 __lock_page

[jack@suse.cz: Fix add_page_wait_queue]
[mhocko@suse.cz: Use sleep_on_page_killable in __wait_on_page_locked_killable]
[steiner@sgi.com: Do not update struct page unnecessarily]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/page-flags.h |  18 +++++
 include/linux/pagemap.h    |   6 +-
 mm/filemap.c               | 172 ++++++++++++++++++++++++++++++++++++++++-----
 mm/page_alloc.c            |   1 +
 mm/swap.c                  |  10 +++
 mm/vmscan.c                |   3 +
 6 files changed, 190 insertions(+), 20 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7baf0fe..b697e4f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -87,6 +87,7 @@ enum pageflags {
 	PG_private_2,		/* If pagecache, has fs aux data */
 	PG_writeback,		/* Page is under writeback */
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
+	PG_waiters,		/* Page has PG_locked waiters. */
 	PG_head,		/* A head page */
 	PG_tail,		/* A tail page */
 #else
@@ -213,6 +214,22 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
 
 __PAGEFLAG(SlobFree, slob_free)
 
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+PAGEFLAG(Waiters, waiters) __CLEARPAGEFLAG(Waiters, waiters)
+	TESTCLEARFLAG(Waiters, waiters)
+#define __PG_WAITERS		(1 << PG_waiters)
+#else
+/* Always fallback to slow path on 32-bit */
+static inline bool PageWaiters(struct page *page)
+{
+	return true;
+}
+static inline void __ClearPageWaiters(struct page *page) {}
+static inline void ClearPageWaiters(struct page *page) {}
+static inline void SetPageWaiters(struct page *page) {}
+#define __PG_WAITERS		0
+#endif /* CONFIG_PAGEFLAGS_EXTENDED */
+
 /*
  * Private page markings that may be used by the filesystem that owns the page
  * for its own purposes.
@@ -509,6 +526,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 	 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
 	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
+	 __PG_WAITERS | \
 	 __PG_COMPOUND_LOCK)
 
 /*
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index c74f8bb..2124a83 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -513,13 +513,15 @@ static inline int lock_page_or_retry(struct page *page, struct mm_struct *mm,
  * Never use this directly!
  */
 extern void wait_on_page_bit(struct page *page, int bit_nr);
+extern void __wait_on_page_locked(struct page *page);
 
 extern int wait_on_page_bit_killable(struct page *page, int bit_nr);
+extern int __wait_on_page_locked_killable(struct page *page);
 
 static inline int wait_on_page_locked_killable(struct page *page)
 {
 	if (PageLocked(page))
-		return wait_on_page_bit_killable(page, PG_locked);
+		return __wait_on_page_locked_killable(page);
 	return 0;
 }
 
@@ -533,7 +535,7 @@ static inline int wait_on_page_locked_killable(struct page *page)
 static inline void wait_on_page_locked(struct page *page)
 {
 	if (PageLocked(page))
-		wait_on_page_bit(page, PG_locked);
+		__wait_on_page_locked(page);
 }
 
 /* 
diff --git a/mm/filemap.c b/mm/filemap.c
index bec4b9b..5034ca7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -241,15 +241,22 @@ void delete_from_page_cache(struct page *page)
 }
 EXPORT_SYMBOL(delete_from_page_cache);
 
-static int sleep_on_page(void *word)
+static int sleep_on_page(struct page *page)
 {
-	io_schedule();
+	/*
+	 * A racing unlock can miss that the waitqueue is active and clear the
+	 * waiters again. Only sleep if PageWaiters is still set and timeout
+	 * to recheck as races can still occur.
+	 */
+	if (PageWaiters(page))
+		io_schedule_timeout(HZ);
+
 	return 0;
 }
 
-static int sleep_on_page_killable(void *word)
+static int sleep_on_page_killable(struct page *page)
 {
-	sleep_on_page(word);
+	sleep_on_page(page);
 	return fatal_signal_pending(current) ? -EINTR : 0;
 }
 
@@ -682,30 +689,87 @@ static wait_queue_head_t *page_waitqueue(struct page *page)
 	return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
 }
 
-static inline void wake_up_page(struct page *page, int bit)
+static inline wait_queue_head_t *clear_page_waiters(struct page *page)
 {
-	__wake_up_bit(page_waitqueue(page), &page->flags, bit);
+	wait_queue_head_t *wqh = NULL;
+
+	if (!PageWaiters(page))
+		return NULL;
+
+	/*
+	 * Prepare to clear PG_waiters if the waitqueue is no longer
+	 * active. Note that there is no guarantee that a page with no
+	 * waiters will get cleared as there may be unrelated pages
+	 * sleeping on the same page wait queue. Accurate detection
+	 * would require a counter. In the event of a collision, the
+	 * waiter bit will dangle and lookups will be required until
+	 * the page is unlocked without collisions. The bit will need to
+	 * be cleared before freeing to avoid triggering debug checks.
+	 *
+	 * Furthermore, this can race with processes about to sleep on
+	 * the same page if it adds itself to the waitqueue just after
+	 * this check. The timeout in sleep_on_page prevents the race
+	 * being a terminal one. In effect, the uncontended and non-race
+	 * cases are faster in exchange for occasional worst case of the
+	 * timeout saving us.
+	 */
+	wqh = page_waitqueue(page);
+	if (!waitqueue_active(wqh))
+		ClearPageWaiters(page);
+
+	return wqh;
+}
+
+/* Returns true if the page is locked */
+static inline bool prepare_wait_bit(struct page *page, wait_queue_head_t *wqh,
+			wait_queue_t *wq, int state, int bit_nr, bool exclusive)
+{
+
+	/* Set PG_waiters so a racing unlock_page will check the waitiqueue */
+	if (!PageWaiters(page))
+		SetPageWaiters(page);
+
+	if (exclusive)
+		prepare_to_wait_exclusive(wqh, wq, state);
+	else
+		prepare_to_wait(wqh, wq, state);
+	return test_bit(bit_nr, &page->flags);
 }
 
 void wait_on_page_bit(struct page *page, int bit_nr)
 {
+	wait_queue_head_t *wqh;
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
 
-	if (test_bit(bit_nr, &page->flags))
-		__wait_on_bit(page_waitqueue(page), &wait, sleep_on_page,
-							TASK_UNINTERRUPTIBLE);
+	if (!test_bit(bit_nr, &page->flags))
+		return;
+	wqh = page_waitqueue(page);
+
+	do {
+		if (prepare_wait_bit(page, wqh, &wait.wait, TASK_KILLABLE, bit_nr, false))
+			sleep_on_page_killable(page);
+	} while (test_bit(bit_nr, &page->flags));
+	finish_wait(wqh, &wait.wait);
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
 int wait_on_page_bit_killable(struct page *page, int bit_nr)
 {
+	wait_queue_head_t *wqh;
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
+	int ret = 0;
 
 	if (!test_bit(bit_nr, &page->flags))
 		return 0;
+	wqh = page_waitqueue(page);
+
+	do {
+		if (prepare_wait_bit(page, wqh, &wait.wait, TASK_KILLABLE, bit_nr, false))
+			ret = sleep_on_page_killable(page);
+	} while (!ret && test_bit(bit_nr, &page->flags));
+	finish_wait(wqh, &wait.wait);
 
-	return __wait_on_bit(page_waitqueue(page), &wait,
-			     sleep_on_page_killable, TASK_KILLABLE);
+	return ret;
 }
 
 /**
@@ -721,6 +785,8 @@ void add_page_wait_queue(struct page *page, wait_queue_t *waiter)
 	unsigned long flags;
 
 	spin_lock_irqsave(&q->lock, flags);
+	if (!PageWaiters(page))
+		SetPageWaiters(page);
 	__add_wait_queue(q, waiter);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
@@ -740,10 +806,29 @@ EXPORT_SYMBOL_GPL(add_page_wait_queue);
  */
 void unlock_page(struct page *page)
 {
+	wait_queue_head_t *wqh = clear_page_waiters(page);
+
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
-	clear_bit_unlock(PG_locked, &page->flags);
+
+	/*
+	 * clear_bit_unlock is not necessary in this case as there is no
+	 * need to strongly order the clearing of PG_waiters and PG_locked.
+	 * The smp_mb__after_atomic() barrier is still required for RELEASE
+	 * semantics as there is no guarantee that a wakeup will take place
+	 */
+	clear_bit(PG_locked, &page->flags);
 	smp_mb__after_atomic();
-	wake_up_page(page, PG_locked);
+
+	/*
+	 * Wake the queue if waiters were detected. Ordinarily this wakeup
+	 * would be unconditional to catch races between the lock bit being
+	 * set and a new process joining the queue. However, that would
+	 * require the waitqueue to be looked up every time. Instead we
+	 * optimse for the uncontended and non-race case and recover using
+	 * a timeout in sleep_on_page.
+	 */
+	if (wqh)
+		__wake_up_bit(wqh, &page->flags, PG_locked);
 }
 EXPORT_SYMBOL(unlock_page);
 
@@ -753,14 +838,18 @@ EXPORT_SYMBOL(unlock_page);
  */
 void end_page_writeback(struct page *page)
 {
+	wait_queue_head_t *wqh;
 	if (TestClearPageReclaim(page))
 		rotate_reclaimable_page(page);
 
 	if (!test_clear_page_writeback(page))
 		BUG();
 
+	wqh = clear_page_waiters(page);
 	smp_mb__after_atomic();
-	wake_up_page(page, PG_writeback);
+
+	if (wqh)
+		__wake_up_bit(wqh, &page->flags, PG_writeback);
 }
 EXPORT_SYMBOL(end_page_writeback);
 
@@ -795,22 +884,69 @@ EXPORT_SYMBOL_GPL(page_endio);
  */
 void __lock_page(struct page *page)
 {
+	wait_queue_head_t *wqh = page_waitqueue(page);
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
-	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
-							TASK_UNINTERRUPTIBLE);
+	do {
+		if (prepare_wait_bit(page, wqh, &wait.wait, TASK_UNINTERRUPTIBLE, PG_locked, true))
+			sleep_on_page(page);
+	} while (!trylock_page(page));
+
+	finish_wait(wqh, &wait.wait);
 }
 EXPORT_SYMBOL(__lock_page);
 
 int __lock_page_killable(struct page *page)
 {
+	wait_queue_head_t *wqh = page_waitqueue(page);
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+	int ret = 0;
+
+	do {
+		if (prepare_wait_bit(page, wqh, &wait.wait, TASK_KILLABLE, PG_locked, true))
+			ret = sleep_on_page_killable(page);
+	} while (!ret && !trylock_page(page));
 
-	return __wait_on_bit_lock(page_waitqueue(page), &wait,
-					sleep_on_page_killable, TASK_KILLABLE);
+	if (!ret)
+		finish_wait(wqh, &wait.wait);
+	else
+		abort_exclusive_wait(wqh, &wait.wait, TASK_KILLABLE, &wait.key);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(__lock_page_killable);
 
+int  __wait_on_page_locked_killable(struct page *page)
+{
+	int ret = 0;
+	wait_queue_head_t *wqh = page_waitqueue(page);
+	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+
+	do {
+		if (prepare_wait_bit(page, wqh, &wait.wait, TASK_KILLABLE, PG_locked, false))
+			ret = sleep_on_page_killable(page);
+	} while (!ret && PageLocked(page));
+
+	finish_wait(wqh, &wait.wait);
+
+	return ret;
+}
+EXPORT_SYMBOL(__wait_on_page_locked_killable);
+
+void  __wait_on_page_locked(struct page *page)
+{
+	wait_queue_head_t *wqh = page_waitqueue(page);
+	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
+
+	do {
+		if (prepare_wait_bit(page, wqh, &wait.wait, TASK_UNINTERRUPTIBLE, PG_locked, false))
+			sleep_on_page(page);
+	} while (PageLocked(page));
+
+	finish_wait(wqh, &wait.wait);
+}
+EXPORT_SYMBOL(__wait_on_page_locked);
+
 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
 			 unsigned int flags)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 606eecf..0959b09 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6604,6 +6604,7 @@ static const struct trace_print_flags pageflag_names[] = {
 	{1UL << PG_private_2,		"private_2"	},
 	{1UL << PG_writeback,		"writeback"	},
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
+	{1UL << PG_waiters,		"waiters"	},
 	{1UL << PG_head,		"head"		},
 	{1UL << PG_tail,		"tail"		},
 #else
diff --git a/mm/swap.c b/mm/swap.c
index 9e8e347..bf9bd4c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -67,6 +67,10 @@ static void __page_cache_release(struct page *page)
 static void __put_single_page(struct page *page)
 {
 	__page_cache_release(page);
+
+	/* Clear dangling waiters from collisions on page_waitqueue */
+	__ClearPageWaiters(page);
+
 	free_hot_cold_page(page, false);
 }
 
@@ -916,6 +920,12 @@ void release_pages(struct page **pages, int nr, bool cold)
 		/* Clear Active bit in case of parallel mark_page_accessed */
 		__ClearPageActive(page);
 
+		/*
+		 * Clear waiters bit that may still be set due to a collision
+		 * on page_waitqueue
+		 */
+		__ClearPageWaiters(page);
+
 		list_add(&page->lru, &pages_to_free);
 	}
 	if (zone)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7f85041..e409cbc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1096,6 +1096,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * waiting on the page lock, because there are no references.
 		 */
 		__clear_page_locked(page);
+		__ClearPageWaiters(page);
 free_it:
 		nr_reclaimed++;
 
@@ -1427,6 +1428,7 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
+			__ClearPageWaiters(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
@@ -1650,6 +1652,7 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
+			__ClearPageWaiters(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v4
  2014-05-15 10:48           ` [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v4 Mel Gorman
@ 2014-05-15 13:20             ` Peter Zijlstra
  2014-05-15 13:29               ` Peter Zijlstra
                                 ` (2 more replies)
  2014-05-15 15:03             ` Oleg Nesterov
  2014-05-15 21:24             ` Andrew Morton
  2 siblings, 3 replies; 103+ messages in thread
From: Peter Zijlstra @ 2014-05-15 13:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

[-- Attachment #1: Type: text/plain, Size: 7865 bytes --]

On Thu, May 15, 2014 at 11:48:09AM +0100, Mel Gorman wrote:

> +static inline wait_queue_head_t *clear_page_waiters(struct page *page)
>  {
> +	wait_queue_head_t *wqh = NULL;
> +
> +	if (!PageWaiters(page))
> +		return NULL;
> +
> +	/*
> +	 * Prepare to clear PG_waiters if the waitqueue is no longer
> +	 * active. Note that there is no guarantee that a page with no
> +	 * waiters will get cleared as there may be unrelated pages
> +	 * sleeping on the same page wait queue. Accurate detection
> +	 * would require a counter. In the event of a collision, the
> +	 * waiter bit will dangle and lookups will be required until
> +	 * the page is unlocked without collisions. The bit will need to
> +	 * be cleared before freeing to avoid triggering debug checks.
> +	 *
> +	 * Furthermore, this can race with processes about to sleep on
> +	 * the same page if it adds itself to the waitqueue just after
> +	 * this check. The timeout in sleep_on_page prevents the race
> +	 * being a terminal one. In effect, the uncontended and non-race
> +	 * cases are faster in exchange for occasional worst case of the
> +	 * timeout saving us.
> +	 */
> +	wqh = page_waitqueue(page);
> +	if (!waitqueue_active(wqh))
> +		ClearPageWaiters(page);
> +
> +	return wqh;
> +}

So clear_page_waiters() is I think a bad name for this function, for one
it doesn't relate to returning a wait_queue_head.

Secondly, I think the clear condition is wrong, if I understand the rest
of the code correctly we'll keep PageWaiters set until the above
condition, which is not a single waiter on the waitqueue.

Would it not make much more sense to clear the page when there are no
more waiters of this page?

For the case where there are no waiters at all, this is the same
condition, but in case there's a hash collision and there's other pages
waiting, we'll iterate the lot anyway, so we might as well clear it
there.

> +/* Returns true if the page is locked */
> +static inline bool prepare_wait_bit(struct page *page, wait_queue_head_t *wqh,
> +			wait_queue_t *wq, int state, int bit_nr, bool exclusive)
> +{
> +
> +	/* Set PG_waiters so a racing unlock_page will check the waitiqueue */
> +	if (!PageWaiters(page))
> +		SetPageWaiters(page);
> +
> +	if (exclusive)
> +		prepare_to_wait_exclusive(wqh, wq, state);
> +	else
> +		prepare_to_wait(wqh, wq, state);
> +	return test_bit(bit_nr, &page->flags);
>  }
>  
>  void wait_on_page_bit(struct page *page, int bit_nr)
>  {
> +	wait_queue_head_t *wqh;
>  	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
>  
> +	if (!test_bit(bit_nr, &page->flags))
> +		return;
> +	wqh = page_waitqueue(page);
> +
> +	do {
> +		if (prepare_wait_bit(page, wqh, &wait.wait, TASK_KILLABLE, bit_nr, false))
> +			sleep_on_page_killable(page);
> +	} while (test_bit(bit_nr, &page->flags));
> +	finish_wait(wqh, &wait.wait);
>  }
>  EXPORT_SYMBOL(wait_on_page_bit);

Afaict, after this patch, wait_on_page_bit() is only used by
wait_on_page_writeback(), and might I ask why that needs the PageWaiter
set?

>  int wait_on_page_bit_killable(struct page *page, int bit_nr)
>  {
> +	wait_queue_head_t *wqh;
>  	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
> +	int ret = 0;
>  
>  	if (!test_bit(bit_nr, &page->flags))
>  		return 0;
> +	wqh = page_waitqueue(page);
> +
> +	do {
> +		if (prepare_wait_bit(page, wqh, &wait.wait, TASK_KILLABLE, bit_nr, false))
> +			ret = sleep_on_page_killable(page);
> +	} while (!ret && test_bit(bit_nr, &page->flags));
> +	finish_wait(wqh, &wait.wait);
>  
> +	return ret;
>  }

The only user of wait_on_page_bit_killable() _was_
wait_on_page_locked_killable(), but you've just converted that to use
__wait_on_page_bit_killable().

So we can scrap this function.

>  /**
> @@ -721,6 +785,8 @@ void add_page_wait_queue(struct page *page, wait_queue_t *waiter)
>  	unsigned long flags;
>  
>  	spin_lock_irqsave(&q->lock, flags);
> +	if (!PageWaiters(page))
> +		SetPageWaiters(page);
>  	__add_wait_queue(q, waiter);
>  	spin_unlock_irqrestore(&q->lock, flags);
>  }

What does add_page_wait_queue() do and why does it need PageWaiters?

> @@ -740,10 +806,29 @@ EXPORT_SYMBOL_GPL(add_page_wait_queue);
>   */
>  void unlock_page(struct page *page)
>  {
> +	wait_queue_head_t *wqh = clear_page_waiters(page);
> +
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> +
> +	/*
> +	 * clear_bit_unlock is not necessary in this case as there is no
> +	 * need to strongly order the clearing of PG_waiters and PG_locked.
> +	 * The smp_mb__after_atomic() barrier is still required for RELEASE
> +	 * semantics as there is no guarantee that a wakeup will take place
> +	 */
> +	clear_bit(PG_locked, &page->flags);
>  	smp_mb__after_atomic();

If you need RELEASE, use _unlock() because that's exactly what it does.

> +
> +	/*
> +	 * Wake the queue if waiters were detected. Ordinarily this wakeup
> +	 * would be unconditional to catch races between the lock bit being
> +	 * set and a new process joining the queue. However, that would
> +	 * require the waitqueue to be looked up every time. Instead we
> +	 * optimse for the uncontended and non-race case and recover using
> +	 * a timeout in sleep_on_page.
> +	 */
> +	if (wqh)
> +		__wake_up_bit(wqh, &page->flags, PG_locked);

And the only reason we're not clearing PageWaiters under q->lock is to
skimp on the last contended unlock_page() ?

>  }
>  EXPORT_SYMBOL(unlock_page);
>  
> @@ -795,22 +884,69 @@ EXPORT_SYMBOL_GPL(page_endio);
>   */
>  void __lock_page(struct page *page)
>  {
> +	wait_queue_head_t *wqh = page_waitqueue(page);
>  	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
>  
> +	do {
> +		if (prepare_wait_bit(page, wqh, &wait.wait, TASK_UNINTERRUPTIBLE, PG_locked, true))
> +			sleep_on_page(page);
> +	} while (!trylock_page(page));
> +
> +	finish_wait(wqh, &wait.wait);
>  }



So I suppose I'm failing to see the problem with something like:

extern void __lock_page(struct page *);
extern void __unlock_page(struct page *);

static inline void lock_page(struct page *page)
{
	if (!trylock_page(page))
		__lock_page(page);
}

static inline void unlock_page(struct page *page)
{
	clear_bit_unlock(&page->flags, PG_locked);
	if (PageWaiters(page))
		__unlock_page();
}

void __lock_page(struct page *page)
{
	struct wait_queue_head_t *wqh = page_waitqueue(page);
	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);

	spin_lock_irq(&wqh->lock);
	if (!PageWaiters(page))
		SetPageWaiters(page);

	wait.flags |= WQ_FLAG_EXCLUSIVE;
	preempt_disable();
	do {
		if (list_empty(&wait->task_list))
			__add_wait_queue_tail(wqh, &wait);

		set_current_state(TASK_UNINTERRUPTIBLE);

		if (test_bit(wait.key.bit_nr, wait.key.flags)) {
			spin_unlock_irq(&wqh->lock);
			schedule_preempt_disabled();
			spin_lock_irq(&wqh->lock);
		}
	} while (!trylock_page(page));

	__remove_wait_queue(wqh, &wait);
	__set_current_state(TASK_RUNNING);
	preempt_enable();
	spin_unlock_irq(&wqh->lock);
}

void __unlock_page(struct page *page)
{
	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(&page->flags, PG_locked);
	struct wait_queue_head_t *wqh = page_waitqueue(page);
	wait_queue_t *curr;

	spin_lock_irq(&wqh->lock);
	list_for_each_entry(curr, &wqh->task_list, task_list) {
		unsigned int flags = curr->flags;

		if (curr->func(curr, TASK_NORMAL, 0, &key))
			goto unlock;
	}
	ClearPageWaiters(page);
unlock:
	spin_unlock_irq(&wqh->lock);
}

Yes, the __unlock_page() will have the unconditional wqh->lock, but it
should also call __unlock_page() a lot less, and it doesn't have that
horrid timeout.

Now, the above is clearly sub-optimal when !extended_page_flags, but I
suppose we could have two versions of __unlock_page() for that.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v4
  2014-05-15 13:20             ` Peter Zijlstra
@ 2014-05-15 13:29               ` Peter Zijlstra
  2014-05-15 15:34               ` Oleg Nesterov
  2014-05-15 16:18               ` Mel Gorman
  2 siblings, 0 replies; 103+ messages in thread
From: Peter Zijlstra @ 2014-05-15 13:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

[-- Attachment #1: Type: text/plain, Size: 980 bytes --]

On Thu, May 15, 2014 at 03:20:58PM +0200, Peter Zijlstra wrote:
> void __unlock_page(struct page *page)
> {
> 	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(&page->flags, PG_locked);
> 	struct wait_queue_head_t *wqh = page_waitqueue(page);
> 	wait_queue_t *curr;

	if (!PG_waiters && !waitqueue_active(wqh))
		return;

> 	spin_lock_irq(&wqh->lock);
> 	list_for_each_entry(curr, &wqh->task_list, task_list) {
> 		unsigned int flags = curr->flags;
> 
> 		if (curr->func(curr, TASK_NORMAL, 0, &key))
> 			goto unlock;
> 	}
> 	ClearPageWaiters(page);
> unlock:
> 	spin_unlock_irq(&wqh->lock);
> }
> 
> Yes, the __unlock_page() will have the unconditional wqh->lock, but it
> should also call __unlock_page() a lot less, and it doesn't have that
> horrid timeout.
> 
> Now, the above is clearly sub-optimal when !extended_page_flags, but I
> suppose we could have two versions of __unlock_page() for that.

Or I suppose the above would fix it too.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v4
  2014-05-15 10:48           ` [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v4 Mel Gorman
  2014-05-15 13:20             ` Peter Zijlstra
@ 2014-05-15 15:03             ` Oleg Nesterov
  2014-05-15 21:24             ` Andrew Morton
  2 siblings, 0 replies; 103+ messages in thread
From: Oleg Nesterov @ 2014-05-15 15:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Peter Zijlstra, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

On 05/15, Mel Gorman wrote:
>
> This patch introduces a new page flag for 64-bit capable machines,
> PG_waiters, to signal there are processes waiting on PG_lock and uses it to
> avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.

I can't apply this patch, it depends on something else, so I am not sure
I read it correctly. I'll try to read it later, just one question for now.

>  void unlock_page(struct page *page)
>  {
> +	wait_queue_head_t *wqh = clear_page_waiters(page);
> +
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> -	clear_bit_unlock(PG_locked, &page->flags);
> +
> +	/*
> +	 * clear_bit_unlock is not necessary in this case as there is no
> +	 * need to strongly order the clearing of PG_waiters and PG_locked.

OK,

> +	 * The smp_mb__after_atomic() barrier is still required for RELEASE
> +	 * semantics as there is no guarantee that a wakeup will take place
> +	 */
> +	clear_bit(PG_locked, &page->flags);
>  	smp_mb__after_atomic();

But clear_bit_unlock() provides the release semantics, so why mb__after is
better?

> -	wake_up_page(page, PG_locked);
> +
> +	/*
> +	 * Wake the queue if waiters were detected. Ordinarily this wakeup
> +	 * would be unconditional to catch races between the lock bit being
> +	 * set and a new process joining the queue. However, that would
> +	 * require the waitqueue to be looked up every time. Instead we
> +	 * optimse for the uncontended and non-race case and recover using
> +	 * a timeout in sleep_on_page.
> +	 */
> +	if (wqh)
> +		__wake_up_bit(wqh, &page->flags, PG_locked);

This is what I can't understand. Given that PageWaiters() logic is racy
anyway (and timeout(HZ) should save us), why do we need to call
clear_page_waiters() beforehand? Why unlock_page/end_page_writeback can't
simply call wake_up_page_bit() which checks/clears PG_waiters at the end?

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v4
  2014-05-15 13:20             ` Peter Zijlstra
  2014-05-15 13:29               ` Peter Zijlstra
@ 2014-05-15 15:34               ` Oleg Nesterov
  2014-05-15 15:45                 ` Peter Zijlstra
  2014-05-15 16:18               ` Mel Gorman
  2 siblings, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2014-05-15 15:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Andrew Morton, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

On 05/15, Peter Zijlstra wrote:
>
> So I suppose I'm failing to see the problem with something like:

Yeeees, I was thinking about something like this too ;)

> static inline void lock_page(struct page *page)
> {
> 	if (!trylock_page(page))
> 		__lock_page(page);
> }
>
> static inline void unlock_page(struct page *page)
> {
> 	clear_bit_unlock(&page->flags, PG_locked);
> 	if (PageWaiters(page))
> 		__unlock_page();
> }

but in this case we need mb() before PageWaiters(), I guess.

> void __lock_page(struct page *page)
> {
> 	struct wait_queue_head_t *wqh = page_waitqueue(page);
> 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
>
> 	spin_lock_irq(&wqh->lock);
> 	if (!PageWaiters(page))
> 		SetPageWaiters(page);
>
> 	wait.flags |= WQ_FLAG_EXCLUSIVE;
> 	preempt_disable();

why?

> 	do {
> 		if (list_empty(&wait->task_list))
> 			__add_wait_queue_tail(wqh, &wait);
>
> 		set_current_state(TASK_UNINTERRUPTIBLE);
>
> 		if (test_bit(wait.key.bit_nr, wait.key.flags)) {
> 			spin_unlock_irq(&wqh->lock);
> 			schedule_preempt_disabled();
> 			spin_lock_irq(&wqh->lock);

OK, probably to avoid the preemption before schedule(). Still can't
undestand why this makes sense, but in this case it would be better
to do disable/enable under "if (test_bit())" ?

Of course, this needs more work for lock_page_killable(), but this
should be simple.

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v4
  2014-05-15 15:34               ` Oleg Nesterov
@ 2014-05-15 15:45                 ` Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: Peter Zijlstra @ 2014-05-15 15:45 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mel Gorman, Andrew Morton, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

[-- Attachment #1: Type: text/plain, Size: 2079 bytes --]

On Thu, May 15, 2014 at 05:34:24PM +0200, Oleg Nesterov wrote:
> On 05/15, Peter Zijlstra wrote:
> >
> > So I suppose I'm failing to see the problem with something like:
> 
> Yeeees, I was thinking about something like this too ;)
> 
> > static inline void lock_page(struct page *page)
> > {
> > 	if (!trylock_page(page))
> > 		__lock_page(page);
> > }
> >
> > static inline void unlock_page(struct page *page)
> > {
> > 	clear_bit_unlock(&page->flags, PG_locked);
> > 	if (PageWaiters(page))
> > 		__unlock_page();
> > }
> 
> but in this case we need mb() before PageWaiters(), I guess.

Ah indeed so, or rather, this is a good reason to use smp_mb__after_atomic().

> > void __lock_page(struct page *page)
> > {
> > 	struct wait_queue_head_t *wqh = page_waitqueue(page);
> > 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
> >
> > 	spin_lock_irq(&wqh->lock);
> > 	if (!PageWaiters(page))
> > 		SetPageWaiters(page);
> >
> > 	wait.flags |= WQ_FLAG_EXCLUSIVE;
> > 	preempt_disable();
> 
> why?
> 
> > 	do {
> > 		if (list_empty(&wait->task_list))
> > 			__add_wait_queue_tail(wqh, &wait);
> >
> > 		set_current_state(TASK_UNINTERRUPTIBLE);
> >
> > 		if (test_bit(wait.key.bit_nr, wait.key.flags)) {
> > 			spin_unlock_irq(&wqh->lock);
> > 			schedule_preempt_disabled();
> > 			spin_lock_irq(&wqh->lock);
> 
> OK, probably to avoid the preemption before schedule().

Indeed.

> Still can't  undestand why this makes sense,

Because calling schedule twice in a row is like a bit of wasted effort.
Its just annoying there isn't a more convenient way to express this,
because its a fairly common thing in wait loops.

> but in this case it would be better
> to do disable/enable under "if (test_bit())" ?

Ah yes.. that code grew and the preempt_disable came about before that
test_bit() block.. :-)

> Of course, this needs more work for lock_page_killable(), but this
> should be simple.

Yeah, I just wanted to illustrate the point, and cobbling one together
from various wait loops was plenty I thought ;-)

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v4
  2014-05-15 13:20             ` Peter Zijlstra
  2014-05-15 13:29               ` Peter Zijlstra
  2014-05-15 15:34               ` Oleg Nesterov
@ 2014-05-15 16:18               ` Mel Gorman
  2 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-15 16:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

On Thu, May 15, 2014 at 03:20:58PM +0200, Peter Zijlstra wrote:
> On Thu, May 15, 2014 at 11:48:09AM +0100, Mel Gorman wrote:
> 
> > +static inline wait_queue_head_t *clear_page_waiters(struct page *page)
> >  {
> > +	wait_queue_head_t *wqh = NULL;
> > +
> > +	if (!PageWaiters(page))
> > +		return NULL;
> > +
> > +	/*
> > +	 * Prepare to clear PG_waiters if the waitqueue is no longer
> > +	 * active. Note that there is no guarantee that a page with no
> > +	 * waiters will get cleared as there may be unrelated pages
> > +	 * sleeping on the same page wait queue. Accurate detection
> > +	 * would require a counter. In the event of a collision, the
> > +	 * waiter bit will dangle and lookups will be required until
> > +	 * the page is unlocked without collisions. The bit will need to
> > +	 * be cleared before freeing to avoid triggering debug checks.
> > +	 *
> > +	 * Furthermore, this can race with processes about to sleep on
> > +	 * the same page if it adds itself to the waitqueue just after
> > +	 * this check. The timeout in sleep_on_page prevents the race
> > +	 * being a terminal one. In effect, the uncontended and non-race
> > +	 * cases are faster in exchange for occasional worst case of the
> > +	 * timeout saving us.
> > +	 */
> > +	wqh = page_waitqueue(page);
> > +	if (!waitqueue_active(wqh))
> > +		ClearPageWaiters(page);
> > +
> > +	return wqh;
> > +}
> 
> So clear_page_waiters() is I think a bad name for this function, for one
> it doesn't relate to returning a wait_queue_head.
> 

Fair point. find_waiters_queue()?

> Secondly, I think the clear condition is wrong, if I understand the rest
> of the code correctly we'll keep PageWaiters set until the above
> condition, which is not a single waiter on the waitqueue.
> 
> Would it not make much more sense to clear the page when there are no
> more waiters of this page?
> 

The page_waitqueue is hashed and multiple unrelated pages can be waiting
on the same queue. The queue entry is allocated on the stack so we've lost
track of the page being waited on and we've lost track of the page at
that point. I didn't spot a fast way of detecting if any of the waiters
were for that particular page or not and there is an expectation that
collisions on this waitqueue are rare.

> For the case where there are no waiters at all, this is the same
> condition, but in case there's a hash collision and there's other pages
> waiting, we'll iterate the lot anyway, so we might as well clear it
> there.
> 

> > +/* Returns true if the page is locked */
> > +static inline bool prepare_wait_bit(struct page *page, wait_queue_head_t *wqh,
> > +			wait_queue_t *wq, int state, int bit_nr, bool exclusive)
> > +{
> > +
> > +	/* Set PG_waiters so a racing unlock_page will check the waitiqueue */
> > +	if (!PageWaiters(page))
> > +		SetPageWaiters(page);
> > +
> > +	if (exclusive)
> > +		prepare_to_wait_exclusive(wqh, wq, state);
> > +	else
> > +		prepare_to_wait(wqh, wq, state);
> > +	return test_bit(bit_nr, &page->flags);
> >  }
> >  
> >  void wait_on_page_bit(struct page *page, int bit_nr)
> >  {
> > +	wait_queue_head_t *wqh;
> >  	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
> >  
> > +	if (!test_bit(bit_nr, &page->flags))
> > +		return;
> > +	wqh = page_waitqueue(page);
> > +
> > +	do {
> > +		if (prepare_wait_bit(page, wqh, &wait.wait, TASK_KILLABLE, bit_nr, false))
> > +			sleep_on_page_killable(page);
> > +	} while (test_bit(bit_nr, &page->flags));
> > +	finish_wait(wqh, &wait.wait);
> >  }
> >  EXPORT_SYMBOL(wait_on_page_bit);
> 
> Afaict, after this patch, wait_on_page_bit() is only used by
> wait_on_page_writeback(), and might I ask why that needs the PageWaiter
> set?
> 

To avoid doing a page_waitqueue lookup in end_page_writeback().

> >  int wait_on_page_bit_killable(struct page *page, int bit_nr)
> >  {
> > +	wait_queue_head_t *wqh;
> >  	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
> > +	int ret = 0;
> >  
> >  	if (!test_bit(bit_nr, &page->flags))
> >  		return 0;
> > +	wqh = page_waitqueue(page);
> > +
> > +	do {
> > +		if (prepare_wait_bit(page, wqh, &wait.wait, TASK_KILLABLE, bit_nr, false))
> > +			ret = sleep_on_page_killable(page);
> > +	} while (!ret && test_bit(bit_nr, &page->flags));
> > +	finish_wait(wqh, &wait.wait);
> >  
> > +	return ret;
> >  }
> 
> The only user of wait_on_page_bit_killable() _was_
> wait_on_page_locked_killable(), but you've just converted that to use
> __wait_on_page_bit_killable().
> 
> So we can scrap this function.
> 

Scrapped

> >  /**
> > @@ -721,6 +785,8 @@ void add_page_wait_queue(struct page *page, wait_queue_t *waiter)
> >  	unsigned long flags;
> >  
> >  	spin_lock_irqsave(&q->lock, flags);
> > +	if (!PageWaiters(page))
> > +		SetPageWaiters(page);
> >  	__add_wait_queue(q, waiter);
> >  	spin_unlock_irqrestore(&q->lock, flags);
> >  }
> 
> What does add_page_wait_queue() do and why does it need PageWaiters?
> 

cachefiles uses it for an internal monitor but you're right that this is
necessary because it does not go through paths that conditionally wake
depending on PageWaiters.

Deleted.

> > @@ -740,10 +806,29 @@ EXPORT_SYMBOL_GPL(add_page_wait_queue);
> >   */
> >  void unlock_page(struct page *page)
> >  {
> > +	wait_queue_head_t *wqh = clear_page_waiters(page);
> > +
> >  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> > +
> > +	/*
> > +	 * clear_bit_unlock is not necessary in this case as there is no
> > +	 * need to strongly order the clearing of PG_waiters and PG_locked.
> > +	 * The smp_mb__after_atomic() barrier is still required for RELEASE
> > +	 * semantics as there is no guarantee that a wakeup will take place
> > +	 */
> > +	clear_bit(PG_locked, &page->flags);
> >  	smp_mb__after_atomic();
> 
> If you need RELEASE, use _unlock() because that's exactly what it does.
> 

Done

> > +
> > +	/*
> > +	 * Wake the queue if waiters were detected. Ordinarily this wakeup
> > +	 * would be unconditional to catch races between the lock bit being
> > +	 * set and a new process joining the queue. However, that would
> > +	 * require the waitqueue to be looked up every time. Instead we
> > +	 * optimse for the uncontended and non-race case and recover using
> > +	 * a timeout in sleep_on_page.
> > +	 */
> > +	if (wqh)
> > +		__wake_up_bit(wqh, &page->flags, PG_locked);
> 
> And the only reason we're not clearing PageWaiters under q->lock is to
> skimp on the last contended unlock_page() ?
> 

During implementation I used a new zone lock and then tree_lock to protect
the bit prior to using io_schedule_timeout. This protected the PageWaiter
bit but the granularity of such a lock was troublesome. The problem I
encountered was that the unlock_page() path would not have a reference
to the waitqueue when checking PageWaiters and could hit this race (as it
was structured at the time, code has changed since)

unlock_page			lock_page
				prepare_to_wait
  if (!PageWaiters)
  	return
				SetPageWaiters
				sleep forever

The order of SetPageWaiters is now different but I didn't revisit using
q->lock to see if that race can be closed.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v4
  2014-05-15 10:48           ` [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v4 Mel Gorman
  2014-05-15 13:20             ` Peter Zijlstra
  2014-05-15 15:03             ` Oleg Nesterov
@ 2014-05-15 21:24             ` Andrew Morton
  2014-05-21 12:15               ` [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5 Mel Gorman
  2 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2014-05-15 21:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

On Thu, 15 May 2014 11:48:09 +0100 Mel Gorman <mgorman@suse.de> wrote:

> Changelog since v3
> o Correct handling of exclusive waits
> 
> This patch introduces a new page flag for 64-bit capable machines,
> PG_waiters, to signal there are processes waiting on PG_lock and uses it to
> avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.
> 
> This adds a few branches to the fast path but avoids bouncing a dirty
> cache line between CPUs. 32-bit machines always take the slow path but the
> primary motivation for this patch is large machines so I do not think that
> is a concern.
> 
> The test case used to evaulate this is a simple dd of a large file done
> multiple times with the file deleted on each iterations. The size of
> the file is 1/10th physical memory to avoid dirty page balancing. In the
> async case it will be possible that the workload completes without even
> hitting the disk and will have variable results but highlight the impact
> of mark_page_accessed for async IO. The sync results are expected to be
> more stable. The exception is tmpfs where the normal case is for the "IO"
> to not hit the disk.
> 
> The test machine was single socket and UMA to avoid any scheduling or
> NUMA artifacts. Throughput and wall times are presented for sync IO, only
> wall times are shown for async as the granularity reported by dd and the
> variability is unsuitable for comparison. As async results were variable
> do to writback timings, I'm only reporting the maximum figures. The sync
> results were stable enough to make the mean and stddev uninteresting.
> 
> The performance results are reported based on a run with no profiling.
> Profile data is based on a separate run with oprofile running. The
> kernels being compared are "accessed-v2" which is the patch series up
> to this patch where as lockpage-v2 includes this patch.
> 
> async dd
>                                    3.15.0-rc3            3.15.0-rc3
>                                   accessed-v3           lockpage-v3
> ext3   Max      elapsed     11.5900 (  0.00%)     11.0000 (  5.09%)
> ext4   Max      elapsed     13.3400 (  0.00%)     13.4300 ( -0.67%)
> tmpfs  Max      elapsed      0.4900 (  0.00%)      0.4800 (  2.04%)
> btrfs  Max      elapsed     12.7800 (  0.00%)     13.8200 ( -8.14%)
> xfs    Max      elapsed      2.0900 (  0.00%)      2.1100 ( -0.96%)

So ext3 got 5% faster and btrfs got 8% slower?

>
> ...
>

The numbers look pretty marginal from here and the patch is, umm, not a
thing of beauty or simplicity.

I'd be inclined to go find something else to work on, frankly.

> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -241,15 +241,22 @@ void delete_from_page_cache(struct page *page)
>  }
>  EXPORT_SYMBOL(delete_from_page_cache);
>  
> -static int sleep_on_page(void *word)
> +static int sleep_on_page(struct page *page)
>  {
> -	io_schedule();
> +	/*
> +	 * A racing unlock can miss that the waitqueue is active and clear the
> +	 * waiters again. Only sleep if PageWaiters is still set and timeout
> +	 * to recheck as races can still occur.
> +	 */
> +	if (PageWaiters(page))
> +		io_schedule_timeout(HZ);

ew.

>  	return 0;
>  }
>  
> ...
>
> +/* Returns true if the page is locked */

Comment is inaccurate.

> +static inline bool prepare_wait_bit(struct page *page, wait_queue_head_t *wqh,
> +			wait_queue_t *wq, int state, int bit_nr, bool exclusive)
> +{
> +
> +	/* Set PG_waiters so a racing unlock_page will check the waitiqueue */
> +	if (!PageWaiters(page))
> +		SetPageWaiters(page);
> +
> +	if (exclusive)
> +		prepare_to_wait_exclusive(wqh, wq, state);
> +	else
> +		prepare_to_wait(wqh, wq, state);
> +	return test_bit(bit_nr, &page->flags);
>  }
>
> ...
>
>  int wait_on_page_bit_killable(struct page *page, int bit_nr)
>  {
> +	wait_queue_head_t *wqh;
>  	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
> +	int ret = 0;
>  
>  	if (!test_bit(bit_nr, &page->flags))
>  		return 0;
> +	wqh = page_waitqueue(page);
> +
> +	do {
> +		if (prepare_wait_bit(page, wqh, &wait.wait, TASK_KILLABLE, bit_nr, false))
> +			ret = sleep_on_page_killable(page);
> +	} while (!ret && test_bit(bit_nr, &page->flags));
> +	finish_wait(wqh, &wait.wait);
>  
> -	return __wait_on_bit(page_waitqueue(page), &wait,
> -			     sleep_on_page_killable, TASK_KILLABLE);
> +	return ret;
>  }

Please find a way to test all this nicely when there are signals pending?
  
>  /**
>
> ...
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 0/1] ptrace: task_clear_jobctl_trapping()->wake_up_bit() needs mb()
  2014-05-14 16:17         ` Peter Zijlstra
@ 2014-05-16 13:51           ` Oleg Nesterov
  2014-05-16 13:51             ` [PATCH 1/1] " Oleg Nesterov
  2014-05-21 19:18             ` [PATCH 0/1] " Andrew Morton
  0 siblings, 2 replies; 103+ messages in thread
From: Oleg Nesterov @ 2014-05-16 13:51 UTC (permalink / raw)
  To: Andrew Morton, Peter Zijlstra, David Howells
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, Jan Kara,
	Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel, Linux-MM,
	Linux-FSDevel, Paul McKenney, Linus Torvalds

On 05/14, Peter Zijlstra wrote:
>
> On Wed, May 14, 2014 at 06:11:52PM +0200, Oleg Nesterov wrote:
> >
> > I mean, we do not need mb() before __wake_up(). We need it only because
> > __wake_up_bit() checks waitqueue_active().
> >
> >
> > And at least
> >
> > 	fs/cachefiles/namei.c:cachefiles_delete_object()
> > 	fs/block_dev.c:blkdev_get()
> > 	kernel/signal.c:task_clear_jobctl_trapping()
> > 	security/keys/gc.c:key_garbage_collector()
> >
> > look obviously wrong.
> >
> > I would be happy to send the fix, but do I need to split it per-file?
> > Given that it is trivial, perhaps I can send a single patch?
>
> Since its all the same issue a single patch would be fine I think.

Actually blkdev_get() is fine, it relies on bdev_lock. But bd_prepare_to_claim()
is the good example of abusing bit_waitqueue(). Not only it is itself suboptimal,
this doesn't allow to optimize wake_up_bit-like paths. And there are more, say,
inode_sleep_on_writeback(). Plus we have wait_on_atomic_t() which I think should
be generalized or even unified with the regular wait_on_bit(). Perhaps I'll try
to do this later, fortunately the recent patch from Neil greatly reduced the
number of "action" functions.

As for cachefiles_walk_to_object() and key_garbage_collector(), it still seems
to me they need smp_mb__after_clear_bit() but I'll leave this to David, I am
not comfortable to change the code I absolutely do not understand. In particular,
I fail to understand why key_garbage_collector() does smp_mb() before clear_bit().
At least it could be smp_mb__before_clear_bit().

So let me send a trivial patch which only changes task_clear_jobctl_trapping().

Oleg.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH 1/1] ptrace: task_clear_jobctl_trapping()->wake_up_bit() needs mb()
  2014-05-16 13:51           ` [PATCH 0/1] ptrace: task_clear_jobctl_trapping()->wake_up_bit() needs mb() Oleg Nesterov
@ 2014-05-16 13:51             ` Oleg Nesterov
  2014-05-21  9:29               ` Peter Zijlstra
  2014-05-21 19:18             ` [PATCH 0/1] " Andrew Morton
  1 sibling, 1 reply; 103+ messages in thread
From: Oleg Nesterov @ 2014-05-16 13:51 UTC (permalink / raw)
  To: Andrew Morton, Peter Zijlstra, David Howells
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, Jan Kara,
	Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel, Linux-MM,
	Linux-FSDevel, Paul McKenney, Linus Torvalds

__wake_up_bit() checks waitqueue_active() and thus the caller needs
mb() as wake_up_bit() documents, fix task_clear_jobctl_trapping().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/signal.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index c2a8542..f4c4119 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -277,6 +277,7 @@ void task_clear_jobctl_trapping(struct task_struct *task)
 {
 	if (unlikely(task->jobctl & JOBCTL_TRAPPING)) {
 		task->jobctl &= ~JOBCTL_TRAPPING;
+		smp_mb();	/* advised by wake_up_bit() */
 		wake_up_bit(&task->jobctl, JOBCTL_TRAPPING_BIT);
 	}
 }
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH] mm: Avoid unnecessary atomic operations during end_page_writeback
  2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
                   ` (18 preceding siblings ...)
  2014-05-13  9:45 ` [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath Mel Gorman
@ 2014-05-19  8:57 ` Mel Gorman
  19 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-19  8:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel

If a page is marked for immediate reclaim then it is moved to the tail of
the LRU list. This occurs when the system is under enough memory pressure
for pages under writeback to reach the end of the LRU but we test for
this using atomic operations on every writeback. This patch uses an
optimistic non-atomic test first. It'll miss some pages in rare cases but
the consequences are not severe enough to warrant such a penalty.

While the function does not dominate profiles during a simple dd test the
cost of it is reduced.

73048     0.7428  vmlinux-3.15.0-rc5-mmotm-20140513 end_page_writeback
23740     0.2409  vmlinux-3.15.0-rc5-lessatomic     end_page_writeback

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/filemap.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index bec4b9b..dafb06f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -753,8 +753,17 @@ EXPORT_SYMBOL(unlock_page);
  */
 void end_page_writeback(struct page *page)
 {
-	if (TestClearPageReclaim(page))
+	/*
+	 * TestClearPageReclaim could be used here but it is an atomic
+	 * operation and overkill in this particular case. Failing to
+	 * shuffle a page marked for immediate reclaim is too mild to
+	 * justify taking an atomic operation penalty at the end of
+	 * ever page writeback.
+	 */
+	if (PageReclaim(page)) {
+		ClearPageReclaim(page);
 		rotate_reclaimable_page(page);
+	}
 
 	if (!test_clear_page_writeback(page))
 		BUG();

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [PATCH] mm: non-atomically mark page accessed during page cache allocation where possible -fix
  2014-05-13  9:45 ` [PATCH 18/19] mm: Non-atomically mark page accessed during page cache allocation where possible Mel Gorman
  2014-05-13 14:29   ` Theodore Ts'o
@ 2014-05-20 15:49   ` Mel Gorman
  2014-05-20 19:34     ` Andrew Morton
  1 sibling, 1 reply; 103+ messages in thread
From: Mel Gorman @ 2014-05-20 15:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel

Prabhakar Lad reported the following problem

  I see following issue on DA850 evm,
  git bisect points me to
  commit id: 975c3a671f11279441006a29a19f55ccc15fb320
  ( mm: non-atomically mark page accessed during page cache allocation
  where possible)

  Unable to handle kernel paging request at virtual address 30e03501
  pgd = c68cc000
  [30e03501] *pgd=00000000
  Internal error: Oops: 1 [#1] PREEMPT ARM
  Modules linked in:
  CPU: 0 PID: 1015 Comm: network.sh Not tainted 3.15.0-rc5-00323-g975c3a6 #9
  task: c70c4e00 ti: c73d0000 task.ti: c73d0000
  PC is at init_page_accessed+0xc/0x24
  LR is at shmem_write_begin+0x54/0x60
  pc : [<c0088aa0>]    lr : [<c00923e8>]    psr: 20000013
  sp : c73d1d90  ip : c73d1da0  fp : c73d1d9c
  r10: c73d1dec  r9 : 00000000  r8 : 00000000
  r7 : c73d1e6c  r6 : c694d7bc  r5 : ffffffe4  r4 : c73d1dec
  r3 : c73d0000  r2 : 00000001  r1 : 00000000  r0 : 30e03501
  Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
  Control: 0005317f  Table: c68cc000  DAC: 00000015
  Process network.sh (pid: 1015, stack limit = 0xc73d01c0)

pagep is set but not pointing to anywhere valid as it's an uninitialised
stack variable. This patch is a fix to
mm-non-atomically-mark-page-accessed-during-page-cache-allocation-where-possible.patch

Reported-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/filemap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 2a7b9d1..0691481 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2459,7 +2459,7 @@ ssize_t generic_perform_write(struct file *file,
 		flags |= AOP_FLAG_UNINTERRUPTIBLE;
 
 	do {
-		struct page *page;
+		struct page *page = NULL;
 		unsigned long offset;	/* Offset into pagecache page */
 		unsigned long bytes;	/* Bytes to write to page */
 		size_t copied;		/* Bytes copied from user */

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: non-atomically mark page accessed during page cache allocation where possible -fix
  2014-05-20 15:49   ` [PATCH] mm: non-atomically mark page accessed during page cache allocation where possible -fix Mel Gorman
@ 2014-05-20 19:34     ` Andrew Morton
  2014-05-21 12:09       ` Mel Gorman
  2014-05-22  5:35       ` Prabhakar Lad
  0 siblings, 2 replies; 103+ messages in thread
From: Andrew Morton @ 2014-05-20 19:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Prabhakar Lad

On Tue, 20 May 2014 16:49:00 +0100 Mel Gorman <mgorman@suse.de> wrote:

> Prabhakar Lad reported the following problem
> 
>   I see following issue on DA850 evm,
>   git bisect points me to
>   commit id: 975c3a671f11279441006a29a19f55ccc15fb320
>   ( mm: non-atomically mark page accessed during page cache allocation
>   where possible)
> 
>   Unable to handle kernel paging request at virtual address 30e03501
>   pgd = c68cc000
>   [30e03501] *pgd=00000000
>   Internal error: Oops: 1 [#1] PREEMPT ARM
>   Modules linked in:
>   CPU: 0 PID: 1015 Comm: network.sh Not tainted 3.15.0-rc5-00323-g975c3a6 #9
>   task: c70c4e00 ti: c73d0000 task.ti: c73d0000
>   PC is at init_page_accessed+0xc/0x24
>   LR is at shmem_write_begin+0x54/0x60
>   pc : [<c0088aa0>]    lr : [<c00923e8>]    psr: 20000013
>   sp : c73d1d90  ip : c73d1da0  fp : c73d1d9c
>   r10: c73d1dec  r9 : 00000000  r8 : 00000000
>   r7 : c73d1e6c  r6 : c694d7bc  r5 : ffffffe4  r4 : c73d1dec
>   r3 : c73d0000  r2 : 00000001  r1 : 00000000  r0 : 30e03501
>   Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
>   Control: 0005317f  Table: c68cc000  DAC: 00000015
>   Process network.sh (pid: 1015, stack limit = 0xc73d01c0)
> 
> pagep is set but not pointing to anywhere valid as it's an uninitialised
> stack variable. This patch is a fix to
> mm-non-atomically-mark-page-accessed-during-page-cache-allocation-where-possible.patch
> 
> ...
>
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2459,7 +2459,7 @@ ssize_t generic_perform_write(struct file *file,
>  		flags |= AOP_FLAG_UNINTERRUPTIBLE;
>  
>  	do {
> -		struct page *page;
> +		struct page *page = NULL;
>  		unsigned long offset;	/* Offset into pagecache page */
>  		unsigned long bytes;	/* Bytes to write to page */
>  		size_t copied;		/* Bytes copied from user */

Well not really.  generic_perform_write() only touches *page if
->write_begin() returned "success", which is reasonable behavior.

I'd say you mucked up shmem_write_begin() - it runs
init_page_accessed() even if shmem_getpage() returned an error.  It
shouldn't be doing that.

This?

From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm/shmem.c: don't run init_page_accessed() against an uninitialised pointer

If shmem_getpage() returned an error then it didn't necessarily initialise
*pagep.  So shmem_write_begin() shouldn't be playing with *pagep in this
situation.

Fixes an oops when "mm: non-atomically mark page accessed during page
cache allocation where possible" (quite reasonably) left *pagep
uninitialized.

Reported-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/shmem.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN mm/shmem.c~mm-non-atomically-mark-page-accessed-during-page-cache-allocation-where-possiblefix-2 mm/shmem.c
--- a/mm/shmem.c~mm-non-atomically-mark-page-accessed-during-page-cache-allocation-where-possiblefix-2
+++ a/mm/shmem.c
@@ -1376,7 +1376,7 @@ shmem_write_begin(struct file *file, str
 	struct inode *inode = mapping->host;
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
 	ret = shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
-	if (*pagep)
+	if (ret == 0 && *pagep)
 		init_page_accessed(*pagep);
 	return ret;
 }
_


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/1] ptrace: task_clear_jobctl_trapping()->wake_up_bit() needs mb()
  2014-05-16 13:51             ` [PATCH 1/1] " Oleg Nesterov
@ 2014-05-21  9:29               ` Peter Zijlstra
  2014-05-21 19:19                 ` Andrew Morton
  0 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2014-05-21  9:29 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Andrew Morton, David Howells, Mel Gorman, Johannes Weiner,
	Vlastimil Babka, Jan Kara, Michal Hocko, Hugh Dickins,
	Dave Hansen, Linux Kernel, Linux-MM, Linux-FSDevel,
	Paul McKenney, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 942 bytes --]

On Fri, May 16, 2014 at 03:51:37PM +0200, Oleg Nesterov wrote:
> __wake_up_bit() checks waitqueue_active() and thus the caller needs
> mb() as wake_up_bit() documents, fix task_clear_jobctl_trapping().
> 
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>

Seeing how you are one of the ptrace maintainers, how do you want this
routed? Does Andrew pick this up, do I stuff it somewhere?

> ---
>  kernel/signal.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/signal.c b/kernel/signal.c
> index c2a8542..f4c4119 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -277,6 +277,7 @@ void task_clear_jobctl_trapping(struct task_struct *task)
>  {
>  	if (unlikely(task->jobctl & JOBCTL_TRAPPING)) {
>  		task->jobctl &= ~JOBCTL_TRAPPING;
> +		smp_mb();	/* advised by wake_up_bit() */
>  		wake_up_bit(&task->jobctl, JOBCTL_TRAPPING_BIT);
>  	}
>  }
> -- 
> 1.5.5.1
> 
> 

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: non-atomically mark page accessed during page cache allocation where possible -fix
  2014-05-20 19:34     ` Andrew Morton
@ 2014-05-21 12:09       ` Mel Gorman
  2014-05-21 22:11         ` Andrew Morton
  2014-05-22  5:35       ` Prabhakar Lad
  1 sibling, 1 reply; 103+ messages in thread
From: Mel Gorman @ 2014-05-21 12:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Prabhakar Lad

On Tue, May 20, 2014 at 12:34:53PM -0700, Andrew Morton wrote:
> On Tue, 20 May 2014 16:49:00 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > Prabhakar Lad reported the following problem
> > 
> >   I see following issue on DA850 evm,
> >   git bisect points me to
> >   commit id: 975c3a671f11279441006a29a19f55ccc15fb320
> >   ( mm: non-atomically mark page accessed during page cache allocation
> >   where possible)
> > 
> >   Unable to handle kernel paging request at virtual address 30e03501
> >   pgd = c68cc000
> >   [30e03501] *pgd=00000000
> >   Internal error: Oops: 1 [#1] PREEMPT ARM
> >   Modules linked in:
> >   CPU: 0 PID: 1015 Comm: network.sh Not tainted 3.15.0-rc5-00323-g975c3a6 #9
> >   task: c70c4e00 ti: c73d0000 task.ti: c73d0000
> >   PC is at init_page_accessed+0xc/0x24
> >   LR is at shmem_write_begin+0x54/0x60
> >   pc : [<c0088aa0>]    lr : [<c00923e8>]    psr: 20000013
> >   sp : c73d1d90  ip : c73d1da0  fp : c73d1d9c
> >   r10: c73d1dec  r9 : 00000000  r8 : 00000000
> >   r7 : c73d1e6c  r6 : c694d7bc  r5 : ffffffe4  r4 : c73d1dec
> >   r3 : c73d0000  r2 : 00000001  r1 : 00000000  r0 : 30e03501
> >   Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
> >   Control: 0005317f  Table: c68cc000  DAC: 00000015
> >   Process network.sh (pid: 1015, stack limit = 0xc73d01c0)
> > 
> > pagep is set but not pointing to anywhere valid as it's an uninitialised
> > stack variable. This patch is a fix to
> > mm-non-atomically-mark-page-accessed-during-page-cache-allocation-where-possible.patch
> > 
> > ...
> >
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2459,7 +2459,7 @@ ssize_t generic_perform_write(struct file *file,
> >  		flags |= AOP_FLAG_UNINTERRUPTIBLE;
> >  
> >  	do {
> > -		struct page *page;
> > +		struct page *page = NULL;
> >  		unsigned long offset;	/* Offset into pagecache page */
> >  		unsigned long bytes;	/* Bytes to write to page */
> >  		size_t copied;		/* Bytes copied from user */
> 
> Well not really.  generic_perform_write() only touches *page if
> ->write_begin() returned "success", which is reasonable behavior.
> 
> I'd say you mucked up shmem_write_begin() - it runs
> init_page_accessed() even if shmem_getpage() returned an error.  It
> shouldn't be doing that.
> 
> This?
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm/shmem.c: don't run init_page_accessed() against an uninitialised pointer
> 
> If shmem_getpage() returned an error then it didn't necessarily initialise
> *pagep.  So shmem_write_begin() shouldn't be playing with *pagep in this
> situation.
> 
> Fixes an oops when "mm: non-atomically mark page accessed during page
> cache allocation where possible" (quite reasonably) left *pagep
> uninitialized.
> 
> Reported-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
  2014-05-15 21:24             ` Andrew Morton
@ 2014-05-21 12:15               ` Mel Gorman
  2014-05-21 13:02                 ` Peter Zijlstra
  2014-05-21 21:26                 ` Andrew Morton
  0 siblings, 2 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-21 12:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

Andrew had suggested dropping v4 of the patch entirely as the numbers were
marginal and the complexity was high. However, even on a relatively small
machine running simple workloads the overhead of page_waitqueue and wakeup
functions is around 5% of system CPU time. That's quite high for basic
operations so I felt it was worth another shot. The performance figures
are better with this version than they were for v4 and overall the patch
should be more comprehensible.

Changelog since v4
o Remove dependency on io_schedule_timeout
o Push waiting logic down into waitqueue

This patch introduces a new page flag for 64-bit capable machines,
PG_waiters, to signal there are processes waiting on PG_lock and uses it to
avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.

This adds a few branches to the fast path but avoids bouncing a dirty
cache line between CPUs. 32-bit machines always take the slow path but the
primary motivation for this patch is large machines so I do not think that
is a concern.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of
the file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or
NUMA artifacts. Throughput and wall times are presented for sync IO, only
wall times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running. The
kernels being compared are "accessed-v2" which is the patch series up
to this patch where as lockpage-v2 includes this patch.

async dd
                                 3.15.0-rc5            3.15.0-rc5
                                      mmotm           lockpage-v5
btrfs Max      ddtime      0.5863 (  0.00%)      0.5621 (  4.14%)
ext3  Max      ddtime      1.4870 (  0.00%)      1.4609 (  1.76%)
ext4  Max      ddtime      1.0440 (  0.00%)      1.0376 (  0.61%)
tmpfs Max      ddtime      0.3541 (  0.00%)      0.3486 (  1.54%)
xfs   Max      ddtime      0.4995 (  0.00%)      0.4834 (  3.21%)

A separate run with profiles showed this

     samples percentage
ext3  225851    2.3180  vmlinux-3.15.0-rc5-mmotm       test_clear_page_writeback
ext3  106848    1.0966  vmlinux-3.15.0-rc5-mmotm       __wake_up_bit
ext3   71849    0.7374  vmlinux-3.15.0-rc5-mmotm       page_waitqueue
ext3   40319    0.4138  vmlinux-3.15.0-rc5-mmotm       unlock_page
ext3   26243    0.2693  vmlinux-3.15.0-rc5-mmotm       end_page_writeback
ext3  178777    1.7774  vmlinux-3.15.0-rc5-lockpage-v5 test_clear_page_writeback
ext3   67702    0.6731  vmlinux-3.15.0-rc5-lockpage-v5 unlock_page
ext3   22357    0.2223  vmlinux-3.15.0-rc5-lockpage-v5 end_page_writeback
ext3   11131    0.1107  vmlinux-3.15.0-rc5-lockpage-v5 __wake_up_bit
ext3    6360    0.0632  vmlinux-3.15.0-rc5-lockpage-v5 __wake_up_page_bit
ext3    1660    0.0165  vmlinux-3.15.0-rc5-lockpage-v5 page_waitqueue

The profiles show a clear reduction in waitqueue and wakeup functions. The
cost of unlock_page is higher as it's checking PageWaiters but it is offset
by reduced numbers of calls to page_waitqueue and _wake_up_bit. There is a
similar story told for each of the filesystems.  Note that for workloads
that contend heavily on the page lock that unlock_page may increase in
cost as it has to clear PG_waiters so while the typical case should be
much faster, the worst case costs are now higher.

This is also reflected in the time taken to mmap a range of pages.
These are the results for xfs only but the other filesystems tell a
similar story.

                       3.15.0-rc5            3.15.0-rc5
                            mmotm           lockpage-v5
Procs 107M     423.0000 (  0.00%)    409.0000 (  3.31%)
Procs 214M     847.0000 (  0.00%)    823.0000 (  2.83%)
Procs 322M    1296.0000 (  0.00%)   1232.0000 (  4.94%)
Procs 429M    1692.0000 (  0.00%)   1644.0000 (  2.84%)
Procs 536M    2137.0000 (  0.00%)   2057.0000 (  3.74%)
Procs 644M    2542.0000 (  0.00%)   2472.0000 (  2.75%)
Procs 751M    2953.0000 (  0.00%)   2872.0000 (  2.74%)
Procs 859M    3360.0000 (  0.00%)   3310.0000 (  1.49%)
Procs 966M    3770.0000 (  0.00%)   3724.0000 (  1.22%)
Procs 1073M   4220.0000 (  0.00%)   4114.0000 (  2.51%)
Procs 1181M   4638.0000 (  0.00%)   4546.0000 (  1.98%)
Procs 1288M   5038.0000 (  0.00%)   4940.0000 (  1.95%)
Procs 1395M   5481.0000 (  0.00%)   5431.0000 (  0.91%)
Procs 1503M   5940.0000 (  0.00%)   5832.0000 (  1.82%)
Procs 1610M   6316.0000 (  0.00%)   6204.0000 (  1.77%)
Procs 1717M   6749.0000 (  0.00%)   6799.0000 ( -0.74%)
Procs 1825M   7323.0000 (  0.00%)   7082.0000 (  3.29%)
Procs 1932M   7694.0000 (  0.00%)   7452.0000 (  3.15%)
Procs 2040M   8079.0000 (  0.00%)   7927.0000 (  1.88%)
Procs 2147M   8495.0000 (  0.00%)   8360.0000 (  1.59%)

   samples percentage
xfs  78334    1.3089  vmlinux-3.15.0-rc5-mmotm          page_waitqueue
xfs  55910    0.9342  vmlinux-3.15.0-rc5-mmotm          unlock_page
xfs  45120    0.7539  vmlinux-3.15.0-rc5-mmotm          __wake_up_bit
xfs  41414    0.6920  vmlinux-3.15.0-rc5-mmotm          test_clear_page_writeback
xfs   4823    0.0806  vmlinux-3.15.0-rc5-mmotm          end_page_writeback
xfs 100864    1.7063  vmlinux-3.15.0-rc5-lockpage-v5    unlock_page
xfs  52547    0.8889  vmlinux-3.15.0-rc5-lockpage-v5    test_clear_page_writeback
xfs   5031    0.0851  vmlinux-3.15.0-rc5-lockpage-v5    end_page_writeback
xfs   1938    0.0328  vmlinux-3.15.0-rc5-lockpage-v5    __wake_up_bit
xfs      9   1.5e-04  vmlinux-3.15.0-rc5-lockpage-v5    __wake_up_page_bit
xfs      7   1.2e-04  vmlinux-3.15.0-rc5-lockpage-v5    page_waitqueue

[jack@suse.cz: Fix add_page_wait_queue]
[mhocko@suse.cz: Use sleep_on_page_killable in __wait_on_page_locked_killable]
[steiner@sgi.com: Do not update struct page unnecessarily]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/page-flags.h | 18 +++++++++
 include/linux/wait.h       |  6 +++
 kernel/sched/wait.c        | 94 +++++++++++++++++++++++++++++++++++++++-------
 mm/filemap.c               | 25 ++++++------
 mm/page_alloc.c            |  1 +
 mm/swap.c                  | 10 +++++
 mm/vmscan.c                |  3 ++
 7 files changed, 132 insertions(+), 25 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7baf0fe..b697e4f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -87,6 +87,7 @@ enum pageflags {
 	PG_private_2,		/* If pagecache, has fs aux data */
 	PG_writeback,		/* Page is under writeback */
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
+	PG_waiters,		/* Page has PG_locked waiters. */
 	PG_head,		/* A head page */
 	PG_tail,		/* A tail page */
 #else
@@ -213,6 +214,22 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
 
 __PAGEFLAG(SlobFree, slob_free)
 
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+PAGEFLAG(Waiters, waiters) __CLEARPAGEFLAG(Waiters, waiters)
+	TESTCLEARFLAG(Waiters, waiters)
+#define __PG_WAITERS		(1 << PG_waiters)
+#else
+/* Always fallback to slow path on 32-bit */
+static inline bool PageWaiters(struct page *page)
+{
+	return true;
+}
+static inline void __ClearPageWaiters(struct page *page) {}
+static inline void ClearPageWaiters(struct page *page) {}
+static inline void SetPageWaiters(struct page *page) {}
+#define __PG_WAITERS		0
+#endif /* CONFIG_PAGEFLAGS_EXTENDED */
+
 /*
  * Private page markings that may be used by the filesystem that owns the page
  * for its own purposes.
@@ -509,6 +526,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 	 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
 	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
+	 __PG_WAITERS | \
 	 __PG_COMPOUND_LOCK)
 
 /*
diff --git a/include/linux/wait.h b/include/linux/wait.h
index bd68819..5dda464 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -147,8 +147,13 @@ void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *k
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_bit(wait_queue_head_t *, void *, int);
+void __wake_up_page_bit(wait_queue_head_t *, struct page *page, void *, int);
 int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
+int __wait_on_page_bit(wait_queue_head_t *, struct wait_bit_queue *,
+				struct page *page, int (*)(void *), unsigned);
 int __wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
+int __wait_on_page_bit_lock(wait_queue_head_t *, struct wait_bit_queue *,
+				struct page *page, int (*)(void *), unsigned);
 void wake_up_bit(void *, int);
 void wake_up_atomic_t(atomic_t *);
 int out_of_line_wait_on_bit(void *, int, int (*)(void *), unsigned);
@@ -822,6 +827,7 @@ void prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state);
 void prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state);
 long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state);
 void finish_wait(wait_queue_head_t *q, wait_queue_t *wait);
+void finish_wait_page(wait_queue_head_t *q, wait_queue_t *wait, struct page *page);
 void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, unsigned int mode, void *key);
 int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
 int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 0ffa20a..f829e73 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
  * stops them from bleeding out - it would still allow subsequent
  * loads to move into the critical region).
  */
-void
-prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
+static inline void
+__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
+			struct page *page, int state, bool exclusive)
 {
 	unsigned long flags;
 
-	wait->flags &= ~WQ_FLAG_EXCLUSIVE;
 	spin_lock_irqsave(&q->lock, flags);
-	if (list_empty(&wait->task_list))
-		__add_wait_queue(q, wait);
+	if (page && !PageWaiters(page))
+		SetPageWaiters(page);
+	if (list_empty(&wait->task_list)) {
+		if (exclusive) {
+			wait->flags |= WQ_FLAG_EXCLUSIVE;
+			__add_wait_queue_tail(q, wait);
+		} else {
+			wait->flags &= ~WQ_FLAG_EXCLUSIVE;
+			__add_wait_queue(q, wait);
+		}
+	}
 	set_current_state(state);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
+
+void
+prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
+{
+	return __prepare_to_wait(q, wait, NULL, state, false);
+}
 EXPORT_SYMBOL(prepare_to_wait);
 
 void
 prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
 {
-	unsigned long flags;
-
-	wait->flags |= WQ_FLAG_EXCLUSIVE;
-	spin_lock_irqsave(&q->lock, flags);
-	if (list_empty(&wait->task_list))
-		__add_wait_queue_tail(q, wait);
-	set_current_state(state);
-	spin_unlock_irqrestore(&q->lock, flags);
+	return __prepare_to_wait(q, wait, NULL, state, true);
 }
 EXPORT_SYMBOL(prepare_to_wait_exclusive);
 
@@ -228,7 +236,8 @@ EXPORT_SYMBOL(prepare_to_wait_event);
  * the wait descriptor from the given waitqueue if still
  * queued.
  */
-void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
+static inline void __finish_wait(wait_queue_head_t *q, wait_queue_t *wait,
+			struct page *page)
 {
 	unsigned long flags;
 
@@ -249,9 +258,16 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
 	if (!list_empty_careful(&wait->task_list)) {
 		spin_lock_irqsave(&q->lock, flags);
 		list_del_init(&wait->task_list);
+		if (page && !waitqueue_active(q))
+			ClearPageWaiters(page);
 		spin_unlock_irqrestore(&q->lock, flags);
 	}
 }
+
+void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
+{
+	return __finish_wait(q, wait, NULL);
+}
 EXPORT_SYMBOL(finish_wait);
 
 /**
@@ -331,6 +347,22 @@ __wait_on_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
 	finish_wait(wq, &q->wait);
 	return ret;
 }
+
+int __sched
+__wait_on_page_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
+			struct page *page,
+			int (*action)(void *), unsigned mode)
+{
+	int ret = 0;
+
+	do {
+		__prepare_to_wait(wq, &q->wait, page, mode, false);
+		if (test_bit(q->key.bit_nr, q->key.flags))
+			ret = (*action)(q->key.flags);
+	} while (test_bit(q->key.bit_nr, q->key.flags) && !ret);
+	__finish_wait(wq, &q->wait, page);
+	return ret;
+}
 EXPORT_SYMBOL(__wait_on_bit);
 
 int __sched out_of_line_wait_on_bit(void *word, int bit,
@@ -344,6 +376,27 @@ int __sched out_of_line_wait_on_bit(void *word, int bit,
 EXPORT_SYMBOL(out_of_line_wait_on_bit);
 
 int __sched
+__wait_on_page_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
+			struct page *page,
+			int (*action)(void *), unsigned mode)
+{
+	do {
+		int ret;
+
+		__prepare_to_wait(wq, &q->wait, page, mode, true);
+		if (!test_bit(q->key.bit_nr, q->key.flags))
+			continue;
+		ret = action(q->key.flags);
+		if (!ret)
+			continue;
+		abort_exclusive_wait(wq, &q->wait, mode, &q->key);
+		return ret;
+	} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
+	__finish_wait(wq, &q->wait, page);
+	return 0;
+}
+
+int __sched
 __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
 			int (*action)(void *), unsigned mode)
 {
@@ -374,6 +427,19 @@ int __sched out_of_line_wait_on_bit_lock(void *word, int bit,
 }
 EXPORT_SYMBOL(out_of_line_wait_on_bit_lock);
 
+void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit)
+{
+	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
+	unsigned long flags;
+
+	spin_lock_irqsave(&wqh->lock, flags);
+	if (waitqueue_active(wqh))
+		__wake_up_common(wqh, TASK_NORMAL, 1, 0, &key);
+	else
+		ClearPageWaiters(page);
+	spin_unlock_irqrestore(&wqh->lock, flags);
+}
+
 void __wake_up_bit(wait_queue_head_t *wq, void *word, int bit)
 {
 	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
diff --git a/mm/filemap.c b/mm/filemap.c
index 263cffe..07633a4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -682,9 +682,9 @@ static wait_queue_head_t *page_waitqueue(struct page *page)
 	return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
 }
 
-static inline void wake_up_page(struct page *page, int bit)
+static inline void wake_up_page(struct page *page, int bit_nr)
 {
-	__wake_up_bit(page_waitqueue(page), &page->flags, bit);
+	__wake_up_page_bit(page_waitqueue(page), page, &page->flags, bit_nr);
 }
 
 void wait_on_page_bit(struct page *page, int bit_nr)
@@ -692,8 +692,8 @@ void wait_on_page_bit(struct page *page, int bit_nr)
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
 
 	if (test_bit(bit_nr, &page->flags))
-		__wait_on_bit(page_waitqueue(page), &wait, sleep_on_page,
-							TASK_UNINTERRUPTIBLE);
+		__wait_on_page_bit(page_waitqueue(page), &wait, page,
+					sleep_on_page, TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
@@ -704,7 +704,7 @@ int wait_on_page_bit_killable(struct page *page, int bit_nr)
 	if (!test_bit(bit_nr, &page->flags))
 		return 0;
 
-	return __wait_on_bit(page_waitqueue(page), &wait,
+	return __wait_on_page_bit(page_waitqueue(page), &wait, page,
 			     sleep_on_page_killable, TASK_KILLABLE);
 }
 
@@ -743,7 +743,8 @@ void unlock_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	clear_bit_unlock(PG_locked, &page->flags);
 	smp_mb__after_atomic();
-	wake_up_page(page, PG_locked);
+	if (unlikely(PageWaiters(page)))
+		wake_up_page(page, PG_locked);
 }
 EXPORT_SYMBOL(unlock_page);
 
@@ -769,7 +770,8 @@ void end_page_writeback(struct page *page)
 		BUG();
 
 	smp_mb__after_atomic();
-	wake_up_page(page, PG_writeback);
+	if (unlikely(PageWaiters(page)))
+		wake_up_page(page, PG_writeback);
 }
 EXPORT_SYMBOL(end_page_writeback);
 
@@ -806,8 +808,8 @@ void __lock_page(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
-	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
-							TASK_UNINTERRUPTIBLE);
+	__wait_on_page_bit_lock(page_waitqueue(page), &wait, page,
+					sleep_on_page, TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(__lock_page);
 
@@ -815,9 +817,10 @@ int __lock_page_killable(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
-	return __wait_on_bit_lock(page_waitqueue(page), &wait,
-					sleep_on_page_killable, TASK_KILLABLE);
+	return __wait_on_page_bit_lock(page_waitqueue(page), &wait, page,
+					sleep_on_page, TASK_KILLABLE);
 }
+
 EXPORT_SYMBOL_GPL(__lock_page_killable);
 
 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cd1f005..ebb947d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6603,6 +6603,7 @@ static const struct trace_print_flags pageflag_names[] = {
 	{1UL << PG_private_2,		"private_2"	},
 	{1UL << PG_writeback,		"writeback"	},
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
+	{1UL << PG_waiters,		"waiters"	},
 	{1UL << PG_head,		"head"		},
 	{1UL << PG_tail,		"tail"		},
 #else
diff --git a/mm/swap.c b/mm/swap.c
index 9e8e347..bf9bd4c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -67,6 +67,10 @@ static void __page_cache_release(struct page *page)
 static void __put_single_page(struct page *page)
 {
 	__page_cache_release(page);
+
+	/* Clear dangling waiters from collisions on page_waitqueue */
+	__ClearPageWaiters(page);
+
 	free_hot_cold_page(page, false);
 }
 
@@ -916,6 +920,12 @@ void release_pages(struct page **pages, int nr, bool cold)
 		/* Clear Active bit in case of parallel mark_page_accessed */
 		__ClearPageActive(page);
 
+		/*
+		 * Clear waiters bit that may still be set due to a collision
+		 * on page_waitqueue
+		 */
+		__ClearPageWaiters(page);
+
 		list_add(&page->lru, &pages_to_free);
 	}
 	if (zone)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7f85041..e409cbc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1096,6 +1096,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * waiting on the page lock, because there are no references.
 		 */
 		__clear_page_locked(page);
+		__ClearPageWaiters(page);
 free_it:
 		nr_reclaimed++;
 
@@ -1427,6 +1428,7 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
+			__ClearPageWaiters(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
@@ -1650,6 +1652,7 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
+			__ClearPageWaiters(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
  2014-05-21 12:15               ` [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5 Mel Gorman
@ 2014-05-21 13:02                 ` Peter Zijlstra
  2014-05-21 15:33                   ` Mel Gorman
  2014-05-21 21:26                 ` Andrew Morton
  1 sibling, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2014-05-21 13:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

On Wed, May 21, 2014 at 01:15:01PM +0100, Mel Gorman wrote:
> Andrew had suggested dropping v4 of the patch entirely as the numbers were
> marginal and the complexity was high. However, even on a relatively small
> machine running simple workloads the overhead of page_waitqueue and wakeup
> functions is around 5% of system CPU time. That's quite high for basic
> operations so I felt it was worth another shot. The performance figures
> are better with this version than they were for v4 and overall the patch
> should be more comprehensible.

Simpler patch and better performance, yay!

> This patch introduces a new page flag for 64-bit capable machines,
> PG_waiters, to signal there are processes waiting on PG_lock and uses it to
> avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.

The patch seems to also explicitly use it for PG_writeback, yet no
mention of that here.

> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index 0ffa20a..f829e73 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
>   * stops them from bleeding out - it would still allow subsequent
>   * loads to move into the critical region).
>   */
> +static inline void

Make that __always_inline, that way we're guaranteed to optimize the
build time constant .page=NULL cases.

> +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
> +			struct page *page, int state, bool exclusive)
>  {
>  	unsigned long flags;
>  
> +	if (page && !PageWaiters(page))
> +		SetPageWaiters(page);
> +	if (list_empty(&wait->task_list)) {
> +		if (exclusive) {
> +			wait->flags |= WQ_FLAG_EXCLUSIVE;
> +			__add_wait_queue_tail(q, wait);
> +		} else {

I'm fairly sure we've just initialized the wait thing to 0, so clearing
the bit would be superfluous.

> +			wait->flags &= ~WQ_FLAG_EXCLUSIVE;
> +			__add_wait_queue(q, wait);
> +		}
> +	}
>  	set_current_state(state);
>  	spin_unlock_irqrestore(&q->lock, flags);
>  }
> +
> +void
> +prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
> +{
> +	return __prepare_to_wait(q, wait, NULL, state, false);
> +}
>  EXPORT_SYMBOL(prepare_to_wait);
>  
>  void
>  prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
>  {
> +	return __prepare_to_wait(q, wait, NULL, state, true);
>  }
>  EXPORT_SYMBOL(prepare_to_wait_exclusive);
>  
> @@ -228,7 +236,8 @@ EXPORT_SYMBOL(prepare_to_wait_event);
>   * the wait descriptor from the given waitqueue if still
>   * queued.
>   */
> +static inline void __finish_wait(wait_queue_head_t *q, wait_queue_t *wait,
> +			struct page *page)
>  {

Same thing, make that __always_inline.

>  	unsigned long flags;
>  
> @@ -249,9 +258,16 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
>  	if (!list_empty_careful(&wait->task_list)) {
>  		spin_lock_irqsave(&q->lock, flags);
>  		list_del_init(&wait->task_list);
> +		if (page && !waitqueue_active(q))
> +			ClearPageWaiters(page);
>  		spin_unlock_irqrestore(&q->lock, flags);
>  	}
>  }
> +
> +void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
> +{
> +	return __finish_wait(q, wait, NULL);
> +}
>  EXPORT_SYMBOL(finish_wait);
>  
>  /**

> @@ -374,6 +427,19 @@ int __sched out_of_line_wait_on_bit_lock(void *word, int bit,
>  }
>  EXPORT_SYMBOL(out_of_line_wait_on_bit_lock);
>  
> +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit)
> +{
> +	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&wqh->lock, flags);
> +	if (waitqueue_active(wqh))
> +		__wake_up_common(wqh, TASK_NORMAL, 1, 0, &key);
> +	else
> +		ClearPageWaiters(page);
> +	spin_unlock_irqrestore(&wqh->lock, flags);
> +}

Seeing how word is always going to be &page->flags, might it make sense
to remove that argument?


Anyway, looks good in principle. Oleg?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
  2014-05-21 13:02                 ` Peter Zijlstra
@ 2014-05-21 15:33                   ` Mel Gorman
  2014-05-21 16:08                     ` Peter Zijlstra
  0 siblings, 1 reply; 103+ messages in thread
From: Mel Gorman @ 2014-05-21 15:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

On Wed, May 21, 2014 at 03:02:23PM +0200, Peter Zijlstra wrote:
> On Wed, May 21, 2014 at 01:15:01PM +0100, Mel Gorman wrote:
> > Andrew had suggested dropping v4 of the patch entirely as the numbers were
> > marginal and the complexity was high. However, even on a relatively small
> > machine running simple workloads the overhead of page_waitqueue and wakeup
> > functions is around 5% of system CPU time. That's quite high for basic
> > operations so I felt it was worth another shot. The performance figures
> > are better with this version than they were for v4 and overall the patch
> > should be more comprehensible.
> 
> Simpler patch and better performance, yay!
> 
> > This patch introduces a new page flag for 64-bit capable machines,
> > PG_waiters, to signal there are processes waiting on PG_lock and uses it to
> > avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.
> 
> The patch seems to also explicitly use it for PG_writeback, yet no
> mention of that here.
> 

I'll add a note.

> > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> > index 0ffa20a..f829e73 100644
> > --- a/kernel/sched/wait.c
> > +++ b/kernel/sched/wait.c
> > @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
> >   * stops them from bleeding out - it would still allow subsequent
> >   * loads to move into the critical region).
> >   */
> > +static inline void
> 
> Make that __always_inline, that way we're guaranteed to optimize the
> build time constant .page=NULL cases.
> 

Done.

> > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
> > +			struct page *page, int state, bool exclusive)
> >  {
> >  	unsigned long flags;
> >  
> > +	if (page && !PageWaiters(page))
> > +		SetPageWaiters(page);
> > +	if (list_empty(&wait->task_list)) {
> > +		if (exclusive) {
> > +			wait->flags |= WQ_FLAG_EXCLUSIVE;
> > +			__add_wait_queue_tail(q, wait);
> > +		} else {
> 
> I'm fairly sure we've just initialized the wait thing to 0, so clearing
> the bit would be superfluous.
> 

I assume you mean the clearing of WQ_FLAG_EXCLUSIVE. It may or may not be
superflous. If it's an on-stack wait_queue_t initialised with DEFINE_WAIT()
then it's redundant. If it's a wait_queue_t that is being reused and
sometimes used for exclusive waits and other times for non-exclusive
waits then it's required. The API allows this to happen so I see no harm
is clearing the flag like the old code did. Am I missing your point?

> > +			wait->flags &= ~WQ_FLAG_EXCLUSIVE;
> > +			__add_wait_queue(q, wait);
> > +		}
> > +	}
> >  	set_current_state(state);
> >  	spin_unlock_irqrestore(&q->lock, flags);
> >  }
> > +
> > +void
> > +prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
> > +{
> > +	return __prepare_to_wait(q, wait, NULL, state, false);
> > +}
> >  EXPORT_SYMBOL(prepare_to_wait);
> >  
> >  void
> >  prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
> >  {
> > +	return __prepare_to_wait(q, wait, NULL, state, true);
> >  }
> >  EXPORT_SYMBOL(prepare_to_wait_exclusive);
> >  
> > @@ -228,7 +236,8 @@ EXPORT_SYMBOL(prepare_to_wait_event);
> >   * the wait descriptor from the given waitqueue if still
> >   * queued.
> >   */
> > +static inline void __finish_wait(wait_queue_head_t *q, wait_queue_t *wait,
> > +			struct page *page)
> >  {
> 
> Same thing, make that __always_inline.
> 

Done.

> >  	unsigned long flags;
> >  
> > @@ -249,9 +258,16 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
> >  	if (!list_empty_careful(&wait->task_list)) {
> >  		spin_lock_irqsave(&q->lock, flags);
> >  		list_del_init(&wait->task_list);
> > +		if (page && !waitqueue_active(q))
> > +			ClearPageWaiters(page);
> >  		spin_unlock_irqrestore(&q->lock, flags);
> >  	}
> >  }
> > +
> > +void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
> > +{
> > +	return __finish_wait(q, wait, NULL);
> > +}
> >  EXPORT_SYMBOL(finish_wait);
> >  
> >  /**
> 
> > @@ -374,6 +427,19 @@ int __sched out_of_line_wait_on_bit_lock(void *word, int bit,
> >  }
> >  EXPORT_SYMBOL(out_of_line_wait_on_bit_lock);
> >  
> > +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit)
> > +{
> > +	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&wqh->lock, flags);
> > +	if (waitqueue_active(wqh))
> > +		__wake_up_common(wqh, TASK_NORMAL, 1, 0, &key);
> > +	else
> > +		ClearPageWaiters(page);
> > +	spin_unlock_irqrestore(&wqh->lock, flags);
> > +}
> 
> Seeing how word is always going to be &page->flags, might it make sense
> to remove that argument?
> 

The wait_queue was defined on-stack with DEFINE_WAIT_BIT which uses
wake_bit_function() as a wakeup function and that thing consumes both the
page->flags and the bit number it's interested in. This is used for both
PG_writeback and PG_locked so assumptions cannot really be made about
the value.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
  2014-05-21 15:33                   ` Mel Gorman
@ 2014-05-21 16:08                     ` Peter Zijlstra
  0 siblings, 0 replies; 103+ messages in thread
From: Peter Zijlstra @ 2014-05-21 16:08 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

On Wed, May 21, 2014 at 04:33:57PM +0100, Mel Gorman wrote:
> > > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
> > > +			struct page *page, int state, bool exclusive)
> > >  {
> > >  	unsigned long flags;
> > >  
> > > +	if (page && !PageWaiters(page))
> > > +		SetPageWaiters(page);
> > > +	if (list_empty(&wait->task_list)) {
> > > +		if (exclusive) {
> > > +			wait->flags |= WQ_FLAG_EXCLUSIVE;
> > > +			__add_wait_queue_tail(q, wait);
> > > +		} else {
> > 
> > I'm fairly sure we've just initialized the wait thing to 0, so clearing
> > the bit would be superfluous.
> > 
> 
> I assume you mean the clearing of WQ_FLAG_EXCLUSIVE. It may or may not be
> superflous. If it's an on-stack wait_queue_t initialised with DEFINE_WAIT()
> then it's redundant. If it's a wait_queue_t that is being reused and
> sometimes used for exclusive waits and other times for non-exclusive
> waits then it's required. The API allows this to happen so I see no harm
> is clearing the flag like the old code did. Am I missing your point?

Yeah, I'm not aware of any other users except the on-stack kind, but
you're right.

Maybe we should stick an object_is_on_stack() test in there to see if
anything falls out, something for a rainy afternoon perhaps..

> > > +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit)
> > > +{
> > > +	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
> > > +	unsigned long flags;
> > > +
> > > +	spin_lock_irqsave(&wqh->lock, flags);
> > > +	if (waitqueue_active(wqh))
> > > +		__wake_up_common(wqh, TASK_NORMAL, 1, 0, &key);
> > > +	else
> > > +		ClearPageWaiters(page);
> > > +	spin_unlock_irqrestore(&wqh->lock, flags);
> > > +}
> > 
> > Seeing how word is always going to be &page->flags, might it make sense
> > to remove that argument?
> > 
> 
> The wait_queue was defined on-stack with DEFINE_WAIT_BIT which uses
> wake_bit_function() as a wakeup function and that thing consumes both the
> page->flags and the bit number it's interested in. This is used for both
> PG_writeback and PG_locked so assumptions cannot really be made about
> the value.

Well, both PG_flags come from the same &page->flags word, right? But
yeah, if we ever decide to grow the page frame with another flags word
we'd be in trouble :-)

In any case I don't feel too strongly about either of these points.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 0/1] ptrace: task_clear_jobctl_trapping()->wake_up_bit() needs mb()
  2014-05-16 13:51           ` [PATCH 0/1] ptrace: task_clear_jobctl_trapping()->wake_up_bit() needs mb() Oleg Nesterov
  2014-05-16 13:51             ` [PATCH 1/1] " Oleg Nesterov
@ 2014-05-21 19:18             ` Andrew Morton
  1 sibling, 0 replies; 103+ messages in thread
From: Andrew Morton @ 2014-05-21 19:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, David Howells, Mel Gorman, Johannes Weiner,
	Vlastimil Babka, Jan Kara, Michal Hocko, Hugh Dickins,
	Dave Hansen, Linux Kernel, Linux-MM, Linux-FSDevel,
	Paul McKenney, Linus Torvalds

On Fri, 16 May 2014 15:51:16 +0200 Oleg Nesterov <oleg@redhat.com> wrote:

> On 05/14, Peter Zijlstra wrote:
> >
> > On Wed, May 14, 2014 at 06:11:52PM +0200, Oleg Nesterov wrote:
> > >
> > > I mean, we do not need mb() before __wake_up(). We need it only because
> > > __wake_up_bit() checks waitqueue_active().
> > >
> > >
> > > And at least
> > >
> > > 	fs/cachefiles/namei.c:cachefiles_delete_object()
> > > 	fs/block_dev.c:blkdev_get()
> > > 	kernel/signal.c:task_clear_jobctl_trapping()
> > > 	security/keys/gc.c:key_garbage_collector()
> > >
> > > look obviously wrong.
> > >
> > > I would be happy to send the fix, but do I need to split it per-file?
> > > Given that it is trivial, perhaps I can send a single patch?
> >
> > Since its all the same issue a single patch would be fine I think.
> 
> Actually blkdev_get() is fine, it relies on bdev_lock. But bd_prepare_to_claim()
> is the good example of abusing bit_waitqueue(). Not only it is itself suboptimal,
> this doesn't allow to optimize wake_up_bit-like paths. And there are more, say,
> inode_sleep_on_writeback(). Plus we have wait_on_atomic_t() which I think should
> be generalized or even unified with the regular wait_on_bit(). Perhaps I'll try
> to do this later, fortunately the recent patch from Neil greatly reduced the
> number of "action" functions.
> 
> As for cachefiles_walk_to_object() and key_garbage_collector(), it still seems
> to me they need smp_mb__after_clear_bit() but I'll leave this to David, I am
> not comfortable to change the code I absolutely do not understand. In particular,
> I fail to understand why key_garbage_collector() does smp_mb() before clear_bit().
> At least it could be smp_mb__before_clear_bit().

This is all quite convincing evidence that these interfaces are too
tricky for regular kernel developers to use. 

Can we fix them?

One way would be to make the interfaces safe to use and provide
lower-level no-barrier interfaces for use by hot-path code where the
author knows what he/she is doing.  And there are probably other ways.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 1/1] ptrace: task_clear_jobctl_trapping()->wake_up_bit() needs mb()
  2014-05-21  9:29               ` Peter Zijlstra
@ 2014-05-21 19:19                 ` Andrew Morton
  0 siblings, 0 replies; 103+ messages in thread
From: Andrew Morton @ 2014-05-21 19:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, David Howells, Mel Gorman, Johannes Weiner,
	Vlastimil Babka, Jan Kara, Michal Hocko, Hugh Dickins,
	Dave Hansen, Linux Kernel, Linux-MM, Linux-FSDevel,
	Paul McKenney, Linus Torvalds

On Wed, 21 May 2014 11:29:32 +0200 Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, May 16, 2014 at 03:51:37PM +0200, Oleg Nesterov wrote:
> > __wake_up_bit() checks waitqueue_active() and thus the caller needs
> > mb() as wake_up_bit() documents, fix task_clear_jobctl_trapping().
> > 
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> 
> Seeing how you are one of the ptrace maintainers, how do you want this
> routed? Does Andrew pick this up, do I stuff it somewhere?

Thanks, I grabbed it.  ptrace has been pretty quiet lately.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
  2014-05-21 12:15               ` [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5 Mel Gorman
  2014-05-21 13:02                 ` Peter Zijlstra
@ 2014-05-21 21:26                 ` Andrew Morton
  2014-05-21 21:33                   ` Peter Zijlstra
  2014-05-21 23:35                   ` Mel Gorman
  1 sibling, 2 replies; 103+ messages in thread
From: Andrew Morton @ 2014-05-21 21:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

On Wed, 21 May 2014 13:15:01 +0100 Mel Gorman <mgorman@suse.de> wrote:

> Andrew had suggested dropping v4 of the patch entirely as the numbers were
> marginal and the complexity was high. However, even on a relatively small
> machine running simple workloads the overhead of page_waitqueue and wakeup
> functions is around 5% of system CPU time. That's quite high for basic
> operations so I felt it was worth another shot. The performance figures
> are better with this version than they were for v4 and overall the patch
> should be more comprehensible.
> 
> Changelog since v4
> o Remove dependency on io_schedule_timeout
> o Push waiting logic down into waitqueue
> 
> This patch introduces a new page flag for 64-bit capable machines,
> PG_waiters, to signal there are processes waiting on PG_lock and uses it to
> avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.
> 
> This adds a few branches to the fast path but avoids bouncing a dirty
> cache line between CPUs. 32-bit machines always take the slow path but the
> primary motivation for this patch is large machines so I do not think that
> is a concern.
> 
> The test case used to evaulate this is a simple dd of a large file done
> multiple times with the file deleted on each iterations. The size of
> the file is 1/10th physical memory to avoid dirty page balancing. In the
> async case it will be possible that the workload completes without even
> hitting the disk and will have variable results but highlight the impact
> of mark_page_accessed for async IO. The sync results are expected to be
> more stable. The exception is tmpfs where the normal case is for the "IO"
> to not hit the disk.
> 
> The test machine was single socket and UMA to avoid any scheduling or
> NUMA artifacts. Throughput and wall times are presented for sync IO, only
> wall times are shown for async as the granularity reported by dd and the
> variability is unsuitable for comparison. As async results were variable
> do to writback timings, I'm only reporting the maximum figures. The sync
> results were stable enough to make the mean and stddev uninteresting.
> 
> The performance results are reported based on a run with no profiling.
> Profile data is based on a separate run with oprofile running. The
> kernels being compared are "accessed-v2" which is the patch series up
> to this patch where as lockpage-v2 includes this patch.
> 
> ...
>
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -147,8 +147,13 @@ void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *k
>  void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
>  void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
>  void __wake_up_bit(wait_queue_head_t *, void *, int);
> +void __wake_up_page_bit(wait_queue_head_t *, struct page *page, void *, int);

You're going to need to forward-declare struct page in wait.h.  The
good thing about this is that less people will notice that we've gone
and mentioned struct page in wait.h :(

>  int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
> +int __wait_on_page_bit(wait_queue_head_t *, struct wait_bit_queue *,
> 
> ...
>
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
>   * stops them from bleeding out - it would still allow subsequent
>   * loads to move into the critical region).
>   */
> -void
> -prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
> +static inline void
> +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
> +			struct page *page, int state, bool exclusive)

Putting MM stuff into core waitqueue code is rather bad.  I really
don't know how I'm going to explain this to my family.

>  {
>  	unsigned long flags;
>  
> -	wait->flags &= ~WQ_FLAG_EXCLUSIVE;
>  	spin_lock_irqsave(&q->lock, flags);
> -	if (list_empty(&wait->task_list))
> -		__add_wait_queue(q, wait);
> +	if (page && !PageWaiters(page))
> +		SetPageWaiters(page);

And this isn't racy because we're assuming that all users of `page' are
using the same waitqueue.  ie, assuming all callers use
page_waitqueue()?   Subtle, unobvious, worth documenting.

> +	if (list_empty(&wait->task_list)) {
> +		if (exclusive) {
> +			wait->flags |= WQ_FLAG_EXCLUSIVE;
> +			__add_wait_queue_tail(q, wait);
> +		} else {
> +			wait->flags &= ~WQ_FLAG_EXCLUSIVE;
> +			__add_wait_queue(q, wait);
> +		}
> +	}
>  	set_current_state(state);
>  	spin_unlock_irqrestore(&q->lock, flags);
>  }
> 
> ...
>
> @@ -228,7 +236,8 @@ EXPORT_SYMBOL(prepare_to_wait_event);
>   * the wait descriptor from the given waitqueue if still
>   * queued.
>   */
> -void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
> +static inline void __finish_wait(wait_queue_head_t *q, wait_queue_t *wait,
> +			struct page *page)

Thusly does kerneldoc bitrot.

>  {
>  	unsigned long flags;
>  
> @@ -249,9 +258,16 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
>  	if (!list_empty_careful(&wait->task_list)) {
>  		spin_lock_irqsave(&q->lock, flags);
>  		list_del_init(&wait->task_list);
> +		if (page && !waitqueue_active(q))
> +			ClearPageWaiters(page);

And again, the assumption that all users of this page use the same
waitqueue avoids the races?

>  		spin_unlock_irqrestore(&q->lock, flags);
>  	}
>  }
> +
> +void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
> +{
> +	return __finish_wait(q, wait, NULL);
> +}
>  EXPORT_SYMBOL(finish_wait);
>  
>  /**
> @@ -331,6 +347,22 @@ __wait_on_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
>  	finish_wait(wq, &q->wait);
>  	return ret;
>  }
> +
> +int __sched
> +__wait_on_page_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
> +			struct page *page,
> +			int (*action)(void *), unsigned mode)

Comment over __wait_on_bit needs updating.

> +{
> +	int ret = 0;
> +
> +	do {
> +		__prepare_to_wait(wq, &q->wait, page, mode, false);
> +		if (test_bit(q->key.bit_nr, q->key.flags))
> +			ret = (*action)(q->key.flags);
> +	} while (test_bit(q->key.bit_nr, q->key.flags) && !ret);
> +	__finish_wait(wq, &q->wait, page);
> +	return ret;
> +}

__wait_on_bit() can now become a wrapper which calls this with page==NULL?

>  EXPORT_SYMBOL(__wait_on_bit);

This export is now misplaced.

>  int __sched out_of_line_wait_on_bit(void *word, int bit,
> @@ -344,6 +376,27 @@ int __sched out_of_line_wait_on_bit(void *word, int bit,
>  EXPORT_SYMBOL(out_of_line_wait_on_bit);
>  
>  int __sched
> +__wait_on_page_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
> +			struct page *page,
> +			int (*action)(void *), unsigned mode)
> +{
> +	do {
> +		int ret;
> +
> +		__prepare_to_wait(wq, &q->wait, page, mode, true);
> +		if (!test_bit(q->key.bit_nr, q->key.flags))
> +			continue;
> +		ret = action(q->key.flags);
> +		if (!ret)
> +			continue;
> +		abort_exclusive_wait(wq, &q->wait, mode, &q->key);
> +		return ret;
> +	} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
> +	__finish_wait(wq, &q->wait, page);
> +	return 0;
> +}

You are in a maze of twisty little functions, all alike.  Perhaps some
rudimentary documentation here?  Like what on earth does
__wait_on_page_bit_lock() actually do?   And `mode'.


> +int __sched
>  __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
>  			int (*action)(void *), unsigned mode)

Perhaps __wait_on_bit_lock() can become a wrapper around
__wait_on_page_bit_lock().

> 
> ...
>
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -67,6 +67,10 @@ static void __page_cache_release(struct page *page)
>  static void __put_single_page(struct page *page)
>  {
>  	__page_cache_release(page);
> +
> +	/* Clear dangling waiters from collisions on page_waitqueue */
> +	__ClearPageWaiters(page);

What's this collisions thing?

>  	free_hot_cold_page(page, false);
>  }
>  
> 
> ...
>
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1096,6 +1096,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		 * waiting on the page lock, because there are no references.
>  		 */
>  		__clear_page_locked(page);
> +		__ClearPageWaiters(page);

We're freeing the page - if someone is still waiting on it then we have
a huge bug?  It's the mysterious collision thing again I hope?

>  free_it:
>  		nr_reclaimed++;
>  
> 
> ...
>


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
  2014-05-21 21:26                 ` Andrew Morton
@ 2014-05-21 21:33                   ` Peter Zijlstra
  2014-05-21 21:50                     ` Andrew Morton
  2014-05-21 23:35                   ` Mel Gorman
  1 sibling, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2014-05-21 21:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote:
> > +static inline void
> > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
> > +			struct page *page, int state, bool exclusive)
> 
> Putting MM stuff into core waitqueue code is rather bad.  I really
> don't know how I'm going to explain this to my family.

Right, so we could avoid all that and make the functions in mm/filemap.c
rather large and opencode a bunch of wait.c stuff.

Which is pretty much what I initially pseudo proposed.

> > +		__ClearPageWaiters(page);
> 
> We're freeing the page - if someone is still waiting on it then we have
> a huge bug?  It's the mysterious collision thing again I hope?

Yeah, so we only clear that bit when at 'unlock' we find there are no
more pending waiters, so if the last unlock still had a waiter, we'll
leave the bit set. So its entirely reasonable to still have it set when
we free a page etc..

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
  2014-05-21 21:33                   ` Peter Zijlstra
@ 2014-05-21 21:50                     ` Andrew Morton
  2014-05-22  0:07                       ` Mel Gorman
  2014-05-22  6:45                       ` [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5 Peter Zijlstra
  0 siblings, 2 replies; 103+ messages in thread
From: Andrew Morton @ 2014-05-21 21:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote:
> > > +static inline void
> > > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
> > > +			struct page *page, int state, bool exclusive)
> > 
> > Putting MM stuff into core waitqueue code is rather bad.  I really
> > don't know how I'm going to explain this to my family.
> 
> Right, so we could avoid all that and make the functions in mm/filemap.c
> rather large and opencode a bunch of wait.c stuff.
> 

The world won't end if we do it Mel's way and it's probably the most
efficient.  But ugh.  This stuff does raise the "it had better be a
useful patch" bar.

> Which is pretty much what I initially pseudo proposed.

Alternative solution is not to merge the patch ;)

> > > +		__ClearPageWaiters(page);
> > 
> > We're freeing the page - if someone is still waiting on it then we have
> > a huge bug?  It's the mysterious collision thing again I hope?
> 
> Yeah, so we only clear that bit when at 'unlock' we find there are no
> more pending waiters, so if the last unlock still had a waiter, we'll
> leave the bit set.

Confused.  If the last unlock had a waiter, that waiter will get woken
up so there are no waiters any more, so the last unlock clears the flag.

um, how do we determine that there are no more waiters?  By looking at
the waitqueue.  But that waitqueue is hashed, so it may contain waiters
for other pages so we're screwed?  But we could just go and wake up the
other-page waiters anyway and still clear PG_waiters?

um2, we're using exclusive waitqueues so we can't (or don't) wake all
waiters, so we're screwed again?

(This process is proving to be a hard way of writing Mel's changelog btw).

If I'm still on track here, what happens if we switch to wake-all so we
can avoid the dangling flag?  I doubt if there are many collisions on
that hash table?

If there *are* a lot of collisions, I bet it's because a great pile of
threads are all waiting on the same page.  If they're trying to lock
that page then wake-all is bad.  But if they're just waiting for IO
completion (probable) then it's OK.

I'll stop now.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: non-atomically mark page accessed during page cache allocation where possible -fix
  2014-05-21 12:09       ` Mel Gorman
@ 2014-05-21 22:11         ` Andrew Morton
  2014-05-22  0:07           ` Mel Gorman
  0 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2014-05-21 22:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Prabhakar Lad

On Wed, 21 May 2014 13:09:16 +0100 Mel Gorman <mgorman@suse.de> wrote:

> > From: Andrew Morton <akpm@linux-foundation.org>
> > Subject: mm/shmem.c: don't run init_page_accessed() against an uninitialised pointer
> > 
> > If shmem_getpage() returned an error then it didn't necessarily initialise
> > *pagep.  So shmem_write_begin() shouldn't be playing with *pagep in this
> > situation.
> > 
> > Fixes an oops when "mm: non-atomically mark page accessed during page
> > cache allocation where possible" (quite reasonably) left *pagep
> > uninitialized.
> > 
> > Reported-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Michal Hocko <mhocko@suse.cz>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Dave Hansen <dave.hansen@intel.com>
> > Cc: Mel Gorman <mgorman@suse.de>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> 
> Acked-by: Mel Gorman <mgorman@suse.de>

What to do with
http://ozlabs.org/~akpm/mmots/broken-out/mm-non-atomically-mark-page-accessed-during-page-cache-allocation-where-possible-fix.patch?

We shouldn't need it any more.  otoh it's pretty harmless.  otooh it
will hide bugs such as this one.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
  2014-05-21 21:26                 ` Andrew Morton
  2014-05-21 21:33                   ` Peter Zijlstra
@ 2014-05-21 23:35                   ` Mel Gorman
  1 sibling, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-21 23:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote:
> On Wed, 21 May 2014 13:15:01 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > Andrew had suggested dropping v4 of the patch entirely as the numbers were
> > marginal and the complexity was high. However, even on a relatively small
> > machine running simple workloads the overhead of page_waitqueue and wakeup
> > functions is around 5% of system CPU time. That's quite high for basic
> > operations so I felt it was worth another shot. The performance figures
> > are better with this version than they were for v4 and overall the patch
> > should be more comprehensible.
> > 
> > Changelog since v4
> > o Remove dependency on io_schedule_timeout
> > o Push waiting logic down into waitqueue
> > 
> > This patch introduces a new page flag for 64-bit capable machines,
> > PG_waiters, to signal there are processes waiting on PG_lock and uses it to
> > avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.
> > 
> > This adds a few branches to the fast path but avoids bouncing a dirty
> > cache line between CPUs. 32-bit machines always take the slow path but the
> > primary motivation for this patch is large machines so I do not think that
> > is a concern.
> > 
> > The test case used to evaulate this is a simple dd of a large file done
> > multiple times with the file deleted on each iterations. The size of
> > the file is 1/10th physical memory to avoid dirty page balancing. In the
> > async case it will be possible that the workload completes without even
> > hitting the disk and will have variable results but highlight the impact
> > of mark_page_accessed for async IO. The sync results are expected to be
> > more stable. The exception is tmpfs where the normal case is for the "IO"
> > to not hit the disk.
> > 
> > The test machine was single socket and UMA to avoid any scheduling or
> > NUMA artifacts. Throughput and wall times are presented for sync IO, only
> > wall times are shown for async as the granularity reported by dd and the
> > variability is unsuitable for comparison. As async results were variable
> > do to writback timings, I'm only reporting the maximum figures. The sync
> > results were stable enough to make the mean and stddev uninteresting.
> > 
> > The performance results are reported based on a run with no profiling.
> > Profile data is based on a separate run with oprofile running. The
> > kernels being compared are "accessed-v2" which is the patch series up
> > to this patch where as lockpage-v2 includes this patch.
> > 
> > ...
> >
> > --- a/include/linux/wait.h
> > +++ b/include/linux/wait.h
> > @@ -147,8 +147,13 @@ void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *k
> >  void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
> >  void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
> >  void __wake_up_bit(wait_queue_head_t *, void *, int);
> > +void __wake_up_page_bit(wait_queue_head_t *, struct page *page, void *, int);
> 
> You're going to need to forward-declare struct page in wait.h.  The
> good thing about this is that less people will notice that we've gone
> and mentioned struct page in wait.h :(
> 

Will add the forward-declare.

> >  int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
> > +int __wait_on_page_bit(wait_queue_head_t *, struct wait_bit_queue *,
> > 
> > ...
> >
> > --- a/kernel/sched/wait.c
> > +++ b/kernel/sched/wait.c
> > @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
> >   * stops them from bleeding out - it would still allow subsequent
> >   * loads to move into the critical region).
> >   */
> > -void
> > -prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
> > +static inline void
> > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
> > +			struct page *page, int state, bool exclusive)
> 
> Putting MM stuff into core waitqueue code is rather bad.  I really
> don't know how I'm going to explain this to my family.
> 

The alternative is updating wait.h and wait.c was open-coding the waitqueue
modifications in filemap.c but that is just as ugly. The wait queue stuff
is complex and there was motivation to keep it in one place even if we
are special casing struct page handling.

FWIW, I cannot explain anything I do in work to my family. It gets blank
looks no matter what.


> >  {
> >  	unsigned long flags;
> >  
> > -	wait->flags &= ~WQ_FLAG_EXCLUSIVE;
> >  	spin_lock_irqsave(&q->lock, flags);
> > -	if (list_empty(&wait->task_list))
> > -		__add_wait_queue(q, wait);
> > +	if (page && !PageWaiters(page))
> > +		SetPageWaiters(page);
> 
> And this isn't racy because we're assuming that all users of `page' are
> using the same waitqueue.  ie, assuming all callers use
> page_waitqueue()?   Subtle, unobvious, worth documenting.
> 

All users of the page will get the same waitqueue. page_waitqueue is hashed
on the pointer value.

> > +	if (list_empty(&wait->task_list)) {
> > +		if (exclusive) {
> > +			wait->flags |= WQ_FLAG_EXCLUSIVE;
> > +			__add_wait_queue_tail(q, wait);
> > +		} else {
> > +			wait->flags &= ~WQ_FLAG_EXCLUSIVE;
> > +			__add_wait_queue(q, wait);
> > +		}
> > +	}
> >  	set_current_state(state);
> >  	spin_unlock_irqrestore(&q->lock, flags);
> >  }
> > 
> > ...
> >
> > @@ -228,7 +236,8 @@ EXPORT_SYMBOL(prepare_to_wait_event);
> >   * the wait descriptor from the given waitqueue if still
> >   * queued.
> >   */
> > -void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
> > +static inline void __finish_wait(wait_queue_head_t *q, wait_queue_t *wait,
> > +			struct page *page)
> 
> Thusly does kerneldoc bitrot.
> 

Now, I am become Rot, Destroyer of Kerneldoc.

Kerneldoc comment moved to correct location.

> >  {
> >  	unsigned long flags;
> >  
> > @@ -249,9 +258,16 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
> >  	if (!list_empty_careful(&wait->task_list)) {
> >  		spin_lock_irqsave(&q->lock, flags);
> >  		list_del_init(&wait->task_list);
> > +		if (page && !waitqueue_active(q))
> > +			ClearPageWaiters(page);
> 
> And again, the assumption that all users of this page use the same
> waitqueue avoids the races?
> 

Yes. There is no guarantee that the bit will be cleared if there are no
waiters on that specific page if there are waitqueue collisions but
detecting that accurately would require ref counts.

Will stick in a comment.

> >  		spin_unlock_irqrestore(&q->lock, flags);
> >  	}
> >  }
> > +
> > +void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
> > +{
> > +	return __finish_wait(q, wait, NULL);
> > +}
> >  EXPORT_SYMBOL(finish_wait);
> >  
> >  /**
> > @@ -331,6 +347,22 @@ __wait_on_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
> >  	finish_wait(wq, &q->wait);
> >  	return ret;
> >  }
> > +
> > +int __sched
> > +__wait_on_page_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
> > +			struct page *page,
> > +			int (*action)(void *), unsigned mode)
> 
> Comment over __wait_on_bit needs updating.
> 
> > +{
> > +	int ret = 0;
> > +
> > +	do {
> > +		__prepare_to_wait(wq, &q->wait, page, mode, false);
> > +		if (test_bit(q->key.bit_nr, q->key.flags))
> > +			ret = (*action)(q->key.flags);
> > +	} while (test_bit(q->key.bit_nr, q->key.flags) && !ret);
> > +	__finish_wait(wq, &q->wait, page);
> > +	return ret;
> > +}
> 
> __wait_on_bit() can now become a wrapper which calls this with page==NULL?
> 

Yep. Early in development this would have looked ugly but not in the
final version but failed to fix it up.

> >  EXPORT_SYMBOL(__wait_on_bit);
> 
> This export is now misplaced.
> 
> >  int __sched out_of_line_wait_on_bit(void *word, int bit,
> > @@ -344,6 +376,27 @@ int __sched out_of_line_wait_on_bit(void *word, int bit,
> >  EXPORT_SYMBOL(out_of_line_wait_on_bit);
> >  
> >  int __sched
> > +__wait_on_page_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
> > +			struct page *page,
> > +			int (*action)(void *), unsigned mode)
> > +{
> > +	do {
> > +		int ret;
> > +
> > +		__prepare_to_wait(wq, &q->wait, page, mode, true);
> > +		if (!test_bit(q->key.bit_nr, q->key.flags))
> > +			continue;
> > +		ret = action(q->key.flags);
> > +		if (!ret)
> > +			continue;
> > +		abort_exclusive_wait(wq, &q->wait, mode, &q->key);
> > +		return ret;
> > +	} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
> > +	__finish_wait(wq, &q->wait, page);
> > +	return 0;
> > +}
> 
> You are in a maze of twisty little functions, all alike.  Perhaps some
> rudimentary documentation here?  Like what on earth does
> __wait_on_page_bit_lock() actually do?   And `mode'.
> 

Most of the useful commentary on what this does is in the kerneldoc comment
for wait_on_bit including the meaning of "mode".

> 
> > +int __sched
> >  __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
> >  			int (*action)(void *), unsigned mode)
> 
> Perhaps __wait_on_bit_lock() can become a wrapper around
> __wait_on_page_bit_lock().
> 

Yes.

> > 
> > ...
> >
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -67,6 +67,10 @@ static void __page_cache_release(struct page *page)
> >  static void __put_single_page(struct page *page)
> >  {
> >  	__page_cache_release(page);
> > +
> > +	/* Clear dangling waiters from collisions on page_waitqueue */
> > +	__ClearPageWaiters(page);
> 
> What's this collisions thing?
> 

I'll expand the comment.

> >  	free_hot_cold_page(page, false);
> >  }
> >  
> > 
> > ...
> >
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1096,6 +1096,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		 * waiting on the page lock, because there are no references.
> >  		 */
> >  		__clear_page_locked(page);
> > +		__ClearPageWaiters(page);
> 
> We're freeing the page - if someone is still waiting on it then we have
> a huge bug?  It's the mysterious collision thing again I hope?
> 

Yes. Freeing a page with active waiters should also be attempting to free
a page with an elevated ref count. The page allocator should catch that
and scream.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
  2014-05-21 21:50                     ` Andrew Morton
@ 2014-05-22  0:07                       ` Mel Gorman
  2014-05-22  7:20                         ` Peter Zijlstra
  2014-05-22  6:45                       ` [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5 Peter Zijlstra
  1 sibling, 1 reply; 103+ messages in thread
From: Mel Gorman @ 2014-05-22  0:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

On Wed, May 21, 2014 at 02:50:00PM -0700, Andrew Morton wrote:
> On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote:
> > > > +static inline void
> > > > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
> > > > +			struct page *page, int state, bool exclusive)
> > > 
> > > Putting MM stuff into core waitqueue code is rather bad.  I really
> > > don't know how I'm going to explain this to my family.
> > 
> > Right, so we could avoid all that and make the functions in mm/filemap.c
> > rather large and opencode a bunch of wait.c stuff.
> > 
> 
> The world won't end if we do it Mel's way and it's probably the most
> efficient.  But ugh.  This stuff does raise the "it had better be a
> useful patch" bar.
> 
> > Which is pretty much what I initially pseudo proposed.
> 
> Alternative solution is not to merge the patch ;)
> 

While true, the overhead of the page_waitqueue lookups and unnecessary
wakeups sucks even on small machines. Not only does it hit us during simple
operations like dd to a file but we would hit it during page reclaim as
well which is trylock_page/unlock_page intensive

> > > > +		__ClearPageWaiters(page);
> > > 
> > > We're freeing the page - if someone is still waiting on it then we have
> > > a huge bug?  It's the mysterious collision thing again I hope?
> > 
> > Yeah, so we only clear that bit when at 'unlock' we find there are no
> > more pending waiters, so if the last unlock still had a waiter, we'll
> > leave the bit set.
> 
> Confused.  If the last unlock had a waiter, that waiter will get woken
> up so there are no waiters any more, so the last unlock clears the flag.
> 
> um, how do we determine that there are no more waiters?  By looking at
> the waitqueue.  But that waitqueue is hashed, so it may contain waiters
> for other pages so we're screwed?  But we could just go and wake up the
> other-page waiters anyway and still clear PG_waiters?
> 
> um2, we're using exclusive waitqueues so we can't (or don't) wake all
> waiters, so we're screwed again?
> 
> (This process is proving to be a hard way of writing Mel's changelog btw).
> 
> If I'm still on track here, what happens if we switch to wake-all so we
> can avoid the dangling flag?  I doubt if there are many collisions on
> that hash table?
> 
> If there *are* a lot of collisions, I bet it's because a great pile of
> threads are all waiting on the same page.  If they're trying to lock
> that page then wake-all is bad.  But if they're just waiting for IO
> completion (probable) then it's OK.
> 
> I'll stop now.

Rather than putting details in the changelog, here is an updated version
that hopefully improves the commentary to the point where it's actually
clear. 

---8<---
From: Nick Piggin <npiggin@suse.de>
Subject: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v6

Changelog since v5
o __always_inline where appropriate	(peterz)
o Documentation				(akpm)

Changelog since v4
o Remove dependency on io_schedule_timeout
o Push waiting logic down into waitqueue

This patch introduces a new page flag for 64-bit capable machines,
PG_waiters, to signal there are processes waiting on PG_lock or PG_writeback
and uses it to avoid memory barriers and waitqueue hash lookup in the
unlock_page fastpath.

This adds a few branches to the fast path but avoids bouncing a dirty
cache line between CPUs. 32-bit machines always take the slow path but the
primary motivation for this patch is large machines so I do not think that
is a concern.

The test case used to evaulate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of
the file is 1/10th physical memory to avoid dirty page balancing. In the
async case it will be possible that the workload completes without even
hitting the disk and will have variable results but highlight the impact
of mark_page_accessed for async IO. The sync results are expected to be
more stable. The exception is tmpfs where the normal case is for the "IO"
to not hit the disk.

The test machine was single socket and UMA to avoid any scheduling or
NUMA artifacts. Throughput and wall times are presented for sync IO, only
wall times are shown for async as the granularity reported by dd and the
variability is unsuitable for comparison. As async results were variable
do to writback timings, I'm only reporting the maximum figures. The sync
results were stable enough to make the mean and stddev uninteresting.

The performance results are reported based on a run with no profiling.
Profile data is based on a separate run with oprofile running. The
kernels being compared are "accessed-v2" which is the patch series up
to this patch where as lockpage-v2 includes this patch.

async dd
                                 3.15.0-rc5            3.15.0-rc5
                                      mmotm           lockpage-v5
btrfs Max      ddtime      0.5863 (  0.00%)      0.5621 (  4.14%)
ext3  Max      ddtime      1.4870 (  0.00%)      1.4609 (  1.76%)
ext4  Max      ddtime      1.0440 (  0.00%)      1.0376 (  0.61%)
tmpfs Max      ddtime      0.3541 (  0.00%)      0.3486 (  1.54%)
xfs   Max      ddtime      0.4995 (  0.00%)      0.4834 (  3.21%)

A separate run with profiles showed this

     samples percentage
ext3  225851    2.3180  vmlinux-3.15.0-rc5-mmotm       test_clear_page_writeback
ext3  106848    1.0966  vmlinux-3.15.0-rc5-mmotm       __wake_up_bit
ext3   71849    0.7374  vmlinux-3.15.0-rc5-mmotm       page_waitqueue
ext3   40319    0.4138  vmlinux-3.15.0-rc5-mmotm       unlock_page
ext3   26243    0.2693  vmlinux-3.15.0-rc5-mmotm       end_page_writeback
ext3  178777    1.7774  vmlinux-3.15.0-rc5-lockpage-v5 test_clear_page_writeback
ext3   67702    0.6731  vmlinux-3.15.0-rc5-lockpage-v5 unlock_page
ext3   22357    0.2223  vmlinux-3.15.0-rc5-lockpage-v5 end_page_writeback
ext3   11131    0.1107  vmlinux-3.15.0-rc5-lockpage-v5 __wake_up_bit
ext3    6360    0.0632  vmlinux-3.15.0-rc5-lockpage-v5 __wake_up_page_bit
ext3    1660    0.0165  vmlinux-3.15.0-rc5-lockpage-v5 page_waitqueue

The profiles show a clear reduction in waitqueue and wakeup functions. Note
that end_page_writeback costs the same as the savings there are due
to reduced calls to __wake_up_bit and page_waitqueue so there is no
obvious direct savings. The cost of unlock_page is higher as it's checking
PageWaiters but it is offset by reduced numbers of calls to page_waitqueue
and _wake_up_bit. There is a similar story told for each of the filesystems.
Note that for workloads that contend heavily on the page lock that
unlock_page may increase in cost as it has to clear PG_waiters so while
the typical case should be much faster, the worst case costs are now higher.

This is also reflected in the time taken to mmap a range of pages.
These are the results for xfs only but the other filesystems tell a
similar story.

                       3.15.0-rc5            3.15.0-rc5
                            mmotm           lockpage-v5
Procs 107M     423.0000 (  0.00%)    409.0000 (  3.31%)
Procs 214M     847.0000 (  0.00%)    823.0000 (  2.83%)
Procs 322M    1296.0000 (  0.00%)   1232.0000 (  4.94%)
Procs 429M    1692.0000 (  0.00%)   1644.0000 (  2.84%)
Procs 536M    2137.0000 (  0.00%)   2057.0000 (  3.74%)
Procs 644M    2542.0000 (  0.00%)   2472.0000 (  2.75%)
Procs 751M    2953.0000 (  0.00%)   2872.0000 (  2.74%)
Procs 859M    3360.0000 (  0.00%)   3310.0000 (  1.49%)
Procs 966M    3770.0000 (  0.00%)   3724.0000 (  1.22%)
Procs 1073M   4220.0000 (  0.00%)   4114.0000 (  2.51%)
Procs 1181M   4638.0000 (  0.00%)   4546.0000 (  1.98%)
Procs 1288M   5038.0000 (  0.00%)   4940.0000 (  1.95%)
Procs 1395M   5481.0000 (  0.00%)   5431.0000 (  0.91%)
Procs 1503M   5940.0000 (  0.00%)   5832.0000 (  1.82%)
Procs 1610M   6316.0000 (  0.00%)   6204.0000 (  1.77%)
Procs 1717M   6749.0000 (  0.00%)   6799.0000 ( -0.74%)
Procs 1825M   7323.0000 (  0.00%)   7082.0000 (  3.29%)
Procs 1932M   7694.0000 (  0.00%)   7452.0000 (  3.15%)
Procs 2040M   8079.0000 (  0.00%)   7927.0000 (  1.88%)
Procs 2147M   8495.0000 (  0.00%)   8360.0000 (  1.59%)

   samples percentage
xfs  78334    1.3089  vmlinux-3.15.0-rc5-mmotm          page_waitqueue
xfs  55910    0.9342  vmlinux-3.15.0-rc5-mmotm          unlock_page
xfs  45120    0.7539  vmlinux-3.15.0-rc5-mmotm          __wake_up_bit
xfs  41414    0.6920  vmlinux-3.15.0-rc5-mmotm          test_clear_page_writeback
xfs   4823    0.0806  vmlinux-3.15.0-rc5-mmotm          end_page_writeback
xfs 100864    1.7063  vmlinux-3.15.0-rc5-lockpage-v5    unlock_page
xfs  52547    0.8889  vmlinux-3.15.0-rc5-lockpage-v5    test_clear_page_writeback
xfs   5031    0.0851  vmlinux-3.15.0-rc5-lockpage-v5    end_page_writeback
xfs   1938    0.0328  vmlinux-3.15.0-rc5-lockpage-v5    __wake_up_bit
xfs      9   1.5e-04  vmlinux-3.15.0-rc5-lockpage-v5    __wake_up_page_bit
xfs      7   1.2e-04  vmlinux-3.15.0-rc5-lockpage-v5    page_waitqueue

[jack@suse.cz: Fix add_page_wait_queue]
[mhocko@suse.cz: Use sleep_on_page_killable in __wait_on_page_locked_killable]
[steiner@sgi.com: Do not update struct page unnecessarily]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/page-flags.h |  18 ++++++
 include/linux/wait.h       |   8 +++
 kernel/sched/wait.c        | 137 ++++++++++++++++++++++++++++++++++-----------
 mm/filemap.c               |  25 +++++----
 mm/page_alloc.c            |   1 +
 mm/swap.c                  |  12 ++++
 mm/vmscan.c                |   7 +++
 7 files changed, 165 insertions(+), 43 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7baf0fe..b697e4f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -87,6 +87,7 @@ enum pageflags {
 	PG_private_2,		/* If pagecache, has fs aux data */
 	PG_writeback,		/* Page is under writeback */
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
+	PG_waiters,		/* Page has PG_locked waiters. */
 	PG_head,		/* A head page */
 	PG_tail,		/* A tail page */
 #else
@@ -213,6 +214,22 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
 
 __PAGEFLAG(SlobFree, slob_free)
 
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+PAGEFLAG(Waiters, waiters) __CLEARPAGEFLAG(Waiters, waiters)
+	TESTCLEARFLAG(Waiters, waiters)
+#define __PG_WAITERS		(1 << PG_waiters)
+#else
+/* Always fallback to slow path on 32-bit */
+static inline bool PageWaiters(struct page *page)
+{
+	return true;
+}
+static inline void __ClearPageWaiters(struct page *page) {}
+static inline void ClearPageWaiters(struct page *page) {}
+static inline void SetPageWaiters(struct page *page) {}
+#define __PG_WAITERS		0
+#endif /* CONFIG_PAGEFLAGS_EXTENDED */
+
 /*
  * Private page markings that may be used by the filesystem that owns the page
  * for its own purposes.
@@ -509,6 +526,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 	 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
 	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
+	 __PG_WAITERS | \
 	 __PG_COMPOUND_LOCK)
 
 /*
diff --git a/include/linux/wait.h b/include/linux/wait.h
index bd68819..9226724 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -141,14 +141,21 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
 	list_del(&old->task_list);
 }
 
+struct page;
+
 void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
 void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_bit(wait_queue_head_t *, void *, int);
+void __wake_up_page_bit(wait_queue_head_t *, struct page *page, void *, int);
 int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
+int __wait_on_page_bit(wait_queue_head_t *, struct wait_bit_queue *,
+				struct page *page, int (*)(void *), unsigned);
 int __wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
+int __wait_on_page_bit_lock(wait_queue_head_t *, struct wait_bit_queue *,
+				struct page *page, int (*)(void *), unsigned);
 void wake_up_bit(void *, int);
 void wake_up_atomic_t(atomic_t *);
 int out_of_line_wait_on_bit(void *, int, int (*)(void *), unsigned);
@@ -822,6 +829,7 @@ void prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state);
 void prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state);
 long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state);
 void finish_wait(wait_queue_head_t *q, wait_queue_t *wait);
+void finish_wait_page(wait_queue_head_t *q, wait_queue_t *wait, struct page *page);
 void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, unsigned int mode, void *key);
 int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
 int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 0ffa20a..bd0495a92 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -167,31 +167,47 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
  * stops them from bleeding out - it would still allow subsequent
  * loads to move into the critical region).
  */
-void
-prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
+static __always_inline void
+__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
+			struct page *page, int state, bool exclusive)
 {
 	unsigned long flags;
 
-	wait->flags &= ~WQ_FLAG_EXCLUSIVE;
 	spin_lock_irqsave(&q->lock, flags);
-	if (list_empty(&wait->task_list))
-		__add_wait_queue(q, wait);
+
+	/*
+	 * pages are hashed on a waitqueue that is expensive to lookup.
+	 * __wait_on_page_bit and __wait_on_page_bit_lock pass in a page
+	 * to set PG_waiters here. A PageWaiters() can then be used at
+	 * unlock time or when writeback completes to detect if there
+	 * are any potential waiters that justify a lookup.
+	 */
+	if (page && !PageWaiters(page))
+		SetPageWaiters(page);
+	if (list_empty(&wait->task_list)) {
+		if (exclusive) {
+			wait->flags |= WQ_FLAG_EXCLUSIVE;
+			__add_wait_queue_tail(q, wait);
+		} else {
+			wait->flags &= ~WQ_FLAG_EXCLUSIVE;
+			__add_wait_queue(q, wait);
+		}
+	}
 	set_current_state(state);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
+
+void
+prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
+{
+	return __prepare_to_wait(q, wait, NULL, state, false);
+}
 EXPORT_SYMBOL(prepare_to_wait);
 
 void
 prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
 {
-	unsigned long flags;
-
-	wait->flags |= WQ_FLAG_EXCLUSIVE;
-	spin_lock_irqsave(&q->lock, flags);
-	if (list_empty(&wait->task_list))
-		__add_wait_queue_tail(q, wait);
-	set_current_state(state);
-	spin_unlock_irqrestore(&q->lock, flags);
+	return __prepare_to_wait(q, wait, NULL, state, true);
 }
 EXPORT_SYMBOL(prepare_to_wait_exclusive);
 
@@ -219,16 +235,8 @@ long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state)
 }
 EXPORT_SYMBOL(prepare_to_wait_event);
 
-/**
- * finish_wait - clean up after waiting in a queue
- * @q: waitqueue waited on
- * @wait: wait descriptor
- *
- * Sets current thread back to running state and removes
- * the wait descriptor from the given waitqueue if still
- * queued.
- */
-void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
+static __always_inline void __finish_wait(wait_queue_head_t *q,
+			wait_queue_t *wait, struct page *page)
 {
 	unsigned long flags;
 
@@ -249,9 +257,33 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
 	if (!list_empty_careful(&wait->task_list)) {
 		spin_lock_irqsave(&q->lock, flags);
 		list_del_init(&wait->task_list);
+
+		/*
+		 * Clear PG_waiters if the waitqueue is no longer active. There
+		 * is no guarantee that a page with no waiters will get cleared
+		 * as there may be unrelated pages hashed to sleep on the same
+		 * queue. Accurate detection would require a counter but
+		 * collisions are expected to be rare.
+		 */
+		if (page && !waitqueue_active(q))
+			ClearPageWaiters(page);
 		spin_unlock_irqrestore(&q->lock, flags);
 	}
 }
+
+/**
+ * finish_wait - clean up after waiting in a queue
+ * @q: waitqueue waited on
+ * @wait: wait descriptor
+ *
+ * Sets current thread back to running state and removes
+ * the wait descriptor from the given waitqueue if still
+ * queued.
+ */
+void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
+{
+	return __finish_wait(q, wait, NULL);
+}
 EXPORT_SYMBOL(finish_wait);
 
 /**
@@ -313,24 +345,39 @@ int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
 EXPORT_SYMBOL(wake_bit_function);
 
 /*
- * To allow interruptible waiting and asynchronous (i.e. nonblocking)
- * waiting, the actions of __wait_on_bit() and __wait_on_bit_lock() are
- * permitted return codes. Nonzero return codes halt waiting and return.
+ * waits on a bit to be cleared (see wait_on_bit in wait.h for details.
+ * A page is optionally provided when used to wait on the PG_locked or
+ * PG_writeback bit. By setting PG_waiters a lookup of the waitqueue
+ * can be avoided during unlock_page or end_page_writeback.
  */
 int __sched
-__wait_on_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
+__wait_on_page_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
+			struct page *page,
 			int (*action)(void *), unsigned mode)
 {
 	int ret = 0;
 
 	do {
-		prepare_to_wait(wq, &q->wait, mode);
+		__prepare_to_wait(wq, &q->wait, page, mode, false);
 		if (test_bit(q->key.bit_nr, q->key.flags))
 			ret = (*action)(q->key.flags);
 	} while (test_bit(q->key.bit_nr, q->key.flags) && !ret);
-	finish_wait(wq, &q->wait);
+	__finish_wait(wq, &q->wait, page);
 	return ret;
 }
+
+/*
+ * To allow interruptible waiting and asynchronous (i.e. nonblocking)
+ * waiting, the actions of __wait_on_bit() and __wait_on_bit_lock() are
+ * permitted return codes. Nonzero return codes halt waiting and return.
+ */
+int __sched
+__wait_on_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
+			int (*action)(void *), unsigned mode)
+{
+	return __wait_on_page_bit(wq, q, NULL, action, mode);
+}
+
 EXPORT_SYMBOL(__wait_on_bit);
 
 int __sched out_of_line_wait_on_bit(void *word, int bit,
@@ -344,13 +391,14 @@ int __sched out_of_line_wait_on_bit(void *word, int bit,
 EXPORT_SYMBOL(out_of_line_wait_on_bit);
 
 int __sched
-__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
+__wait_on_page_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
+			struct page *page,
 			int (*action)(void *), unsigned mode)
 {
 	do {
 		int ret;
 
-		prepare_to_wait_exclusive(wq, &q->wait, mode);
+		__prepare_to_wait(wq, &q->wait, page, mode, true);
 		if (!test_bit(q->key.bit_nr, q->key.flags))
 			continue;
 		ret = action(q->key.flags);
@@ -359,9 +407,16 @@ __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
 		abort_exclusive_wait(wq, &q->wait, mode, &q->key);
 		return ret;
 	} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
-	finish_wait(wq, &q->wait);
+	__finish_wait(wq, &q->wait, page);
 	return 0;
 }
+
+int __sched
+__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
+			int (*action)(void *), unsigned mode)
+{
+	return __wait_on_page_bit_lock(wq, q, NULL, action, mode);
+}
 EXPORT_SYMBOL(__wait_on_bit_lock);
 
 int __sched out_of_line_wait_on_bit_lock(void *word, int bit,
@@ -374,6 +429,24 @@ int __sched out_of_line_wait_on_bit_lock(void *word, int bit,
 }
 EXPORT_SYMBOL(out_of_line_wait_on_bit_lock);
 
+void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit)
+{
+	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
+	unsigned long flags;
+
+	/*
+	 * Unlike __wake_up_bit it is necessary to check waitqueue_active to be
+	 * checked under the wqh->lock to avoid races with parallel additions
+	 * to the waitqueue. Otherwise races could result in lost wakeups
+	 */
+	spin_lock_irqsave(&wqh->lock, flags);
+	if (waitqueue_active(wqh))
+		__wake_up_common(wqh, TASK_NORMAL, 1, 0, &key);
+	else
+		ClearPageWaiters(page);
+	spin_unlock_irqrestore(&wqh->lock, flags);
+}
+
 void __wake_up_bit(wait_queue_head_t *wq, void *word, int bit)
 {
 	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
diff --git a/mm/filemap.c b/mm/filemap.c
index 263cffe..07633a4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -682,9 +682,9 @@ static wait_queue_head_t *page_waitqueue(struct page *page)
 	return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
 }
 
-static inline void wake_up_page(struct page *page, int bit)
+static inline void wake_up_page(struct page *page, int bit_nr)
 {
-	__wake_up_bit(page_waitqueue(page), &page->flags, bit);
+	__wake_up_page_bit(page_waitqueue(page), page, &page->flags, bit_nr);
 }
 
 void wait_on_page_bit(struct page *page, int bit_nr)
@@ -692,8 +692,8 @@ void wait_on_page_bit(struct page *page, int bit_nr)
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
 
 	if (test_bit(bit_nr, &page->flags))
-		__wait_on_bit(page_waitqueue(page), &wait, sleep_on_page,
-							TASK_UNINTERRUPTIBLE);
+		__wait_on_page_bit(page_waitqueue(page), &wait, page,
+					sleep_on_page, TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
@@ -704,7 +704,7 @@ int wait_on_page_bit_killable(struct page *page, int bit_nr)
 	if (!test_bit(bit_nr, &page->flags))
 		return 0;
 
-	return __wait_on_bit(page_waitqueue(page), &wait,
+	return __wait_on_page_bit(page_waitqueue(page), &wait, page,
 			     sleep_on_page_killable, TASK_KILLABLE);
 }
 
@@ -743,7 +743,8 @@ void unlock_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	clear_bit_unlock(PG_locked, &page->flags);
 	smp_mb__after_atomic();
-	wake_up_page(page, PG_locked);
+	if (unlikely(PageWaiters(page)))
+		wake_up_page(page, PG_locked);
 }
 EXPORT_SYMBOL(unlock_page);
 
@@ -769,7 +770,8 @@ void end_page_writeback(struct page *page)
 		BUG();
 
 	smp_mb__after_atomic();
-	wake_up_page(page, PG_writeback);
+	if (unlikely(PageWaiters(page)))
+		wake_up_page(page, PG_writeback);
 }
 EXPORT_SYMBOL(end_page_writeback);
 
@@ -806,8 +808,8 @@ void __lock_page(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
-	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
-							TASK_UNINTERRUPTIBLE);
+	__wait_on_page_bit_lock(page_waitqueue(page), &wait, page,
+					sleep_on_page, TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(__lock_page);
 
@@ -815,9 +817,10 @@ int __lock_page_killable(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
-	return __wait_on_bit_lock(page_waitqueue(page), &wait,
-					sleep_on_page_killable, TASK_KILLABLE);
+	return __wait_on_page_bit_lock(page_waitqueue(page), &wait, page,
+					sleep_on_page, TASK_KILLABLE);
 }
+
 EXPORT_SYMBOL_GPL(__lock_page_killable);
 
 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cd1f005..ebb947d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6603,6 +6603,7 @@ static const struct trace_print_flags pageflag_names[] = {
 	{1UL << PG_private_2,		"private_2"	},
 	{1UL << PG_writeback,		"writeback"	},
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
+	{1UL << PG_waiters,		"waiters"	},
 	{1UL << PG_head,		"head"		},
 	{1UL << PG_tail,		"tail"		},
 #else
diff --git a/mm/swap.c b/mm/swap.c
index 9e8e347..1581dbf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -67,6 +67,10 @@ static void __page_cache_release(struct page *page)
 static void __put_single_page(struct page *page)
 {
 	__page_cache_release(page);
+
+	/* See release_pages on why this clear may be necessary */
+	__ClearPageWaiters(page);
+
 	free_hot_cold_page(page, false);
 }
 
@@ -916,6 +920,14 @@ void release_pages(struct page **pages, int nr, bool cold)
 		/* Clear Active bit in case of parallel mark_page_accessed */
 		__ClearPageActive(page);
 
+		/*
+		 * pages are hashed on a waitqueue so there may be collisions.
+		 * When waiters are woken the waitqueue is checked but
+		 * unrelated pages on the queue can leave the bit set. Clear
+		 * it here if that happens.
+		 */
+		__ClearPageWaiters(page);
+
 		list_add(&page->lru, &pages_to_free);
 	}
 	if (zone)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7f85041..d7a4969 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1096,6 +1096,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * waiting on the page lock, because there are no references.
 		 */
 		__clear_page_locked(page);
+
+		/* See release_pages on why this clear may be necessary */
+		__ClearPageWaiters(page);
 free_it:
 		nr_reclaimed++;
 
@@ -1427,6 +1430,8 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
+			/* See release_pages on why this clear may be necessary */
+			__ClearPageWaiters(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
@@ -1650,6 +1655,8 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
+			/* See release_pages on why this clear may be necessary */
+			__ClearPageWaiters(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: non-atomically mark page accessed during page cache allocation where possible -fix
  2014-05-21 22:11         ` Andrew Morton
@ 2014-05-22  0:07           ` Mel Gorman
  0 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-22  0:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Vlastimil Babka, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Prabhakar Lad

On Wed, May 21, 2014 at 03:11:42PM -0700, Andrew Morton wrote:
> On Wed, 21 May 2014 13:09:16 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > > From: Andrew Morton <akpm@linux-foundation.org>
> > > Subject: mm/shmem.c: don't run init_page_accessed() against an uninitialised pointer
> > > 
> > > If shmem_getpage() returned an error then it didn't necessarily initialise
> > > *pagep.  So shmem_write_begin() shouldn't be playing with *pagep in this
> > > situation.
> > > 
> > > Fixes an oops when "mm: non-atomically mark page accessed during page
> > > cache allocation where possible" (quite reasonably) left *pagep
> > > uninitialized.
> > > 
> > > Reported-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > Cc: Jan Kara <jack@suse.cz>
> > > Cc: Michal Hocko <mhocko@suse.cz>
> > > Cc: Hugh Dickins <hughd@google.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Dave Hansen <dave.hansen@intel.com>
> > > Cc: Mel Gorman <mgorman@suse.de>
> > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > 
> > Acked-by: Mel Gorman <mgorman@suse.de>
> 
> What to do with
> http://ozlabs.org/~akpm/mmots/broken-out/mm-non-atomically-mark-page-accessed-during-page-cache-allocation-where-possible-fix.patch?
> 
> We shouldn't need it any more.  otoh it's pretty harmless.  otooh it
> will hide bugs such as this one.
> 

Drop it.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: non-atomically mark page accessed during page cache allocation where possible -fix
  2014-05-20 19:34     ` Andrew Morton
  2014-05-21 12:09       ` Mel Gorman
@ 2014-05-22  5:35       ` Prabhakar Lad
  1 sibling, 0 replies; 103+ messages in thread
From: Prabhakar Lad @ 2014-05-22  5:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Vlastimil Babka, Jan Kara,
	Michal Hocko, Hugh Dickins, Peter Zijlstra, Dave Hansen,
	Linux Kernel, Linux-MM, Linux-FSDevel

On Wed, May 21, 2014 at 1:04 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Tue, 20 May 2014 16:49:00 +0100 Mel Gorman <mgorman@suse.de> wrote:
>
>> Prabhakar Lad reported the following problem
>>
>>   I see following issue on DA850 evm,
>>   git bisect points me to
>>   commit id: 975c3a671f11279441006a29a19f55ccc15fb320
>>   ( mm: non-atomically mark page accessed during page cache allocation
>>   where possible)
>>
>>   Unable to handle kernel paging request at virtual address 30e03501
>>   pgd = c68cc000
>>   [30e03501] *pgd=00000000
>>   Internal error: Oops: 1 [#1] PREEMPT ARM
>>   Modules linked in:
>>   CPU: 0 PID: 1015 Comm: network.sh Not tainted 3.15.0-rc5-00323-g975c3a6 #9
>>   task: c70c4e00 ti: c73d0000 task.ti: c73d0000
>>   PC is at init_page_accessed+0xc/0x24
>>   LR is at shmem_write_begin+0x54/0x60
>>   pc : [<c0088aa0>]    lr : [<c00923e8>]    psr: 20000013
>>   sp : c73d1d90  ip : c73d1da0  fp : c73d1d9c
>>   r10: c73d1dec  r9 : 00000000  r8 : 00000000
>>   r7 : c73d1e6c  r6 : c694d7bc  r5 : ffffffe4  r4 : c73d1dec
>>   r3 : c73d0000  r2 : 00000001  r1 : 00000000  r0 : 30e03501
>>   Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
>>   Control: 0005317f  Table: c68cc000  DAC: 00000015
>>   Process network.sh (pid: 1015, stack limit = 0xc73d01c0)
>>
>> pagep is set but not pointing to anywhere valid as it's an uninitialised
>> stack variable. This patch is a fix to
>> mm-non-atomically-mark-page-accessed-during-page-cache-allocation-where-possible.patch
>>
>> ...
>>
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -2459,7 +2459,7 @@ ssize_t generic_perform_write(struct file *file,
>>               flags |= AOP_FLAG_UNINTERRUPTIBLE;
>>
>>       do {
>> -             struct page *page;
>> +             struct page *page = NULL;
>>               unsigned long offset;   /* Offset into pagecache page */
>>               unsigned long bytes;    /* Bytes to write to page */
>>               size_t copied;          /* Bytes copied from user */
>
> Well not really.  generic_perform_write() only touches *page if
> ->write_begin() returned "success", which is reasonable behavior.
>
> I'd say you mucked up shmem_write_begin() - it runs
> init_page_accessed() even if shmem_getpage() returned an error.  It
> shouldn't be doing that.
>
> This?
>
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: mm/shmem.c: don't run init_page_accessed() against an uninitialised pointer
>
> If shmem_getpage() returned an error then it didn't necessarily initialise
> *pagep.  So shmem_write_begin() shouldn't be playing with *pagep in this
> situation.
>
> Fixes an oops when "mm: non-atomically mark page accessed during page
> cache allocation where possible" (quite reasonably) left *pagep
> uninitialized.
>
> Reported-by: Prabhakar Lad <prabhakar.csengg@gmail.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
>  mm/shmem.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff -puN mm/shmem.c~mm-non-atomically-mark-page-accessed-during-page-cache-allocation-where-possiblefix-2 mm/shmem.c
> --- a/mm/shmem.c~mm-non-atomically-mark-page-accessed-during-page-cache-allocation-where-possiblefix-2
> +++ a/mm/shmem.c
> @@ -1376,7 +1376,7 @@ shmem_write_begin(struct file *file, str
>         struct inode *inode = mapping->host;
>         pgoff_t index = pos >> PAGE_CACHE_SHIFT;
>         ret = shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
> -       if (*pagep)
> +       if (ret == 0 && *pagep)
>                 init_page_accessed(*pagep);
>         return ret;
>  }

Reported-and-Tested-by: Lad, Prabhakar <prabhakar.csengg@gmail.com>

Regards,
--Prabhakar Lad

> _
>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
  2014-05-21 21:50                     ` Andrew Morton
  2014-05-22  0:07                       ` Mel Gorman
@ 2014-05-22  6:45                       ` Peter Zijlstra
  2014-05-22  8:46                         ` Mel Gorman
  1 sibling, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2014-05-22  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

[-- Attachment #1: Type: text/plain, Size: 2059 bytes --]

On Wed, May 21, 2014 at 02:50:00PM -0700, Andrew Morton wrote:
> On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra <peterz@infradead.org> wrote:

> Alternative solution is not to merge the patch ;)

There is always that.. :-)

> > Yeah, so we only clear that bit when at 'unlock' we find there are no
> > more pending waiters, so if the last unlock still had a waiter, we'll
> > leave the bit set.
> 
> Confused.  If the last unlock had a waiter, that waiter will get woken
> up so there are no waiters any more, so the last unlock clears the flag.
> 
> um, how do we determine that there are no more waiters?  By looking at
> the waitqueue.  But that waitqueue is hashed, so it may contain waiters
> for other pages so we're screwed?  But we could just go and wake up the
> other-page waiters anyway and still clear PG_waiters?
> 
> um2, we're using exclusive waitqueues so we can't (or don't) wake all
> waiters, so we're screwed again?

Ah, so leave it set. Then when we do an uncontended wakeup, that is a
wakeup where there are _no_ waiters left, we'll iterate the entire
hashed queue, looking for a matching page.

We'll find none, and only then clear the bit.


> (This process is proving to be a hard way of writing Mel's changelog btw).

Agreed :/

> If I'm still on track here, what happens if we switch to wake-all so we
> can avoid the dangling flag?  I doubt if there are many collisions on
> that hash table?

Wake-all will be ugly and loose a herd of waiters, all racing to
acquire, all but one of whoem will loose the race. It also looses the
fairness, its currently a FIFO queue. Wake-all will allow starvation.

> If there *are* a lot of collisions, I bet it's because a great pile of
> threads are all waiting on the same page.  If they're trying to lock
> that page then wake-all is bad.  But if they're just waiting for IO
> completion (probable) then it's OK.

Yeah, I'm not entirely sure on the rationale for adding PG_waiters to
writeback completion, and yes PG_writeback is a wake-all.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
  2014-05-22  0:07                       ` Mel Gorman
@ 2014-05-22  7:20                         ` Peter Zijlstra
  2014-05-22 10:40                           ` [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7 Mel Gorman
  0 siblings, 1 reply; 103+ messages in thread
From: Peter Zijlstra @ 2014-05-22  7:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

[-- Attachment #1: Type: text/plain, Size: 1447 bytes --]

On Thu, May 22, 2014 at 01:07:15AM +0100, Mel Gorman wrote:

> +PAGEFLAG(Waiters, waiters) __CLEARPAGEFLAG(Waiters, waiters)
> +	TESTCLEARFLAG(Waiters, waiters)
> +#define __PG_WAITERS		(1 << PG_waiters)
> +#else
> +/* Always fallback to slow path on 32-bit */
> +static inline bool PageWaiters(struct page *page)
> +{
> +	return true;
> +}
> +static inline void __ClearPageWaiters(struct page *page) {}
> +static inline void ClearPageWaiters(struct page *page) {}
> +static inline void SetPageWaiters(struct page *page) {}
> +#define __PG_WAITERS		0


> +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit)
> +{
> +	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
> +	unsigned long flags;
> +
> +	/*
> +	 * Unlike __wake_up_bit it is necessary to check waitqueue_active to be
> +	 * checked under the wqh->lock to avoid races with parallel additions
> +	 * to the waitqueue. Otherwise races could result in lost wakeups
> +	 */

Well, you could do something like:

	if (!__PG_WAITERS && !waitqueue_active(wqh))
		return;

Which at least for 32bit restores some of the performance loss of this
patch (did you have 32bit numbers in that massive changelog?, I totally
tl;dr it).

> +	spin_lock_irqsave(&wqh->lock, flags);
> +	if (waitqueue_active(wqh))
> +		__wake_up_common(wqh, TASK_NORMAL, 1, 0, &key);
> +	else
> +		ClearPageWaiters(page);
> +	spin_unlock_irqrestore(&wqh->lock, flags);
> +}

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
  2014-05-22  6:45                       ` [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5 Peter Zijlstra
@ 2014-05-22  8:46                         ` Mel Gorman
  2014-05-22 17:47                           ` Andrew Morton
  0 siblings, 1 reply; 103+ messages in thread
From: Mel Gorman @ 2014-05-22  8:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

On Thu, May 22, 2014 at 08:45:29AM +0200, Peter Zijlstra wrote:
> On Wed, May 21, 2014 at 02:50:00PM -0700, Andrew Morton wrote:
> > On Wed, 21 May 2014 23:33:54 +0200 Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Alternative solution is not to merge the patch ;)
> 
> There is always that.. :-)
> 
> > > Yeah, so we only clear that bit when at 'unlock' we find there are no
> > > more pending waiters, so if the last unlock still had a waiter, we'll
> > > leave the bit set.
> > 
> > Confused.  If the last unlock had a waiter, that waiter will get woken
> > up so there are no waiters any more, so the last unlock clears the flag.
> > 
> > um, how do we determine that there are no more waiters?  By looking at
> > the waitqueue.  But that waitqueue is hashed, so it may contain waiters
> > for other pages so we're screwed?  But we could just go and wake up the
> > other-page waiters anyway and still clear PG_waiters?
> > 
> > um2, we're using exclusive waitqueues so we can't (or don't) wake all
> > waiters, so we're screwed again?
> 
> Ah, so leave it set. Then when we do an uncontended wakeup, that is a
> wakeup where there are _no_ waiters left, we'll iterate the entire
> hashed queue, looking for a matching page.
> 
> We'll find none, and only then clear the bit.
> 

Yes, sorry that was not clear.

> 
> > (This process is proving to be a hard way of writing Mel's changelog btw).
> 
> Agreed :/
> 

I've lost sight of what is obvious and what is not. The introduction
now reads

	This patch introduces a new page flag for 64-bit capable machines,
	PG_waiters, to signal there are *potentially* processes waiting on
	PG_lock or PG_writeback.  If there are no possible waiters then we
	avoid barriers, a waitqueue hash lookup and a failed wake_up in the
	unlock_page and end_page_writeback paths. There is no guarantee
	that waiters exist if PG_waiters is set as multiple pages can
	hash to the same waitqueue and we cannot accurately detect if a
	waking process is the last waiter without a reference count. When
	this happens, the bit is left set and the next unlock or writeback
	completion will lookup the waitqueue and clear the bit when there
	are no collisions. This adds a few branches to the fast path but
	avoids bouncing a dirty cache line between CPUs. 32-bit machines
	always take the slow path but the primary motivation for this
	patch is large machines so I do not think that is a concern.

> > If I'm still on track here, what happens if we switch to wake-all so we
> > can avoid the dangling flag?  I doubt if there are many collisions on
> > that hash table?
> 
> Wake-all will be ugly and loose a herd of waiters, all racing to
> acquire, all but one of whoem will loose the race. It also looses the
> fairness, its currently a FIFO queue. Wake-all will allow starvation.
> 

And the cost of the thundering herd of waiters may offset any benefit of
reducing the number of calls to page_waitqueue and waker functions.

> > If there *are* a lot of collisions, I bet it's because a great pile of
> > threads are all waiting on the same page.  If they're trying to lock
> > that page then wake-all is bad.  But if they're just waiting for IO
> > completion (probable) then it's OK.
> 
> Yeah, I'm not entirely sure on the rationale for adding PG_waiters to
> writeback completion, and yes PG_writeback is a wake-all.

tmpfs was the most obvious one. We were doing a useless lookup almost
every time writeback completed for async streaming writers. I suspected
it would also apply to normal filesystems if backed by fast storage.

There is not much to gain by continuing to use __wake_up_bit in the
writeback pathso when PG_waiters is available. Only the first waiters
incurs the SetPageWaiters penalty. In the uncontended case, neither
take locks (one approach checks waitqueue_active outside the lock, the
other checks PageWaiters). Both approaches end up taking q->lock either
in __wake_up_bit->__wake_up or __wake_up_page_bit but in some cases
__wake_up_page_bit.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 09/19] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
  2014-05-13  9:45 ` [PATCH 09/19] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps Mel Gorman
@ 2014-05-22  9:24   ` Vlastimil Babka
  2014-05-22 18:23     ` Andrew Morton
  0 siblings, 1 reply; 103+ messages in thread
From: Vlastimil Babka @ 2014-05-22  9:24 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Joonsoo Kim
  Cc: Johannes Weiner, Jan Kara, Michal Hocko, Hugh Dickins,
	Peter Zijlstra, Dave Hansen, Linux Kernel, Linux-MM,
	Linux-FSDevel

On 05/13/2014 11:45 AM, Mel Gorman wrote:
> The test_bit operations in get/set pageblock flags are expensive. This patch
> reads the bitmap on a word basis and use shifts and masks to isolate the bits
> of interest. Similarly masks are used to set a local copy of the bitmap and then
> use cmpxchg to update the bitmap if there have been no other changes made in
> parallel.
> 
> In a test running dd onto tmpfs the overhead of the pageblock-related
> functions went from 1.27% in profiles to 0.5%.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Hi, I've tested if this closes the race I've been previously trying to fix
with the series in http://marc.info/?l=linux-mm&m=139359694028925&w=2
And indeed with this patch I wasn't able to reproduce it in my stress test
(which adds lots of memory isolation calls) anymore. So thanks to Mel I can
dump my series in the trashcan :P

Therefore I believe something like below should be added to the changelog,
and put to stable as well.

Thanks,
Vlastimil

-----8<-----
In addition to the performance benefits, this patch closes races that are
possible between:

a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
   reads part of the bits before and other part of the bits after
   set_pageblock_migratetype() has updated them.

b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
   read-modify-update set bit operation in set_pageblock_skip() will cause
   lost updates to some bits changed in the set_pageblock_migratetype().

Joonsoo Kim first reported the case a) via code inspection. Vlastimil Babka's
testing with a debug patch showed that either a) or b) occurs roughly once per
mmtests' stress-highalloc benchmark (although not necessarily in the same
pageblock). Furthermore during development of unrelated compaction patches,
it was observed that frequent calls to {start,undo}_isolate_page_range() the
race occurs several thousands of times and has resulted in NULL pointer
dereferences in move_freepages() and free_one_page() in places where
free_list[migratetype] is manipulated by e.g. list_move(). Further debugging
confirmed that migratetype had invalid value of 6, causing out of bounds access
to the free_list array. 

That confirmed that the race exist, although it may be extremely rare, and
currently only fatal where page isolation is performed due to memory hot remove.
Races on pageblocks being updated by set_pageblock_migratetype(), where both
old and new migratetype are lower MIGRATE_RESERVE, currently cannot result in an
invalid value being observed, although theoretically they may still lead to
unexpected creation or destruction of MIGRATE_RESERVE pageblocks. Furthermore,
things could get suddenly worse when memory isolation is used more, or when new
migratetypes are added.

After this patch, the race has no longer been observed in testing.

Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-and-tested-by: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7
  2014-05-22  7:20                         ` Peter Zijlstra
@ 2014-05-22 10:40                           ` Mel Gorman
  2014-05-22 10:56                             ` Peter Zijlstra
  0 siblings, 1 reply; 103+ messages in thread
From: Mel Gorman @ 2014-05-22 10:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Paul McKenney,
	Linus Torvalds, David Howells, Linux Kernel, Linux-MM,
	Linux-FSDevel

Changelog since v6
o Optimisation when PG_waiters is not available	(peterz)
o Documentation

Changelog since v5
o __always_inline where appropriate		(peterz)
o Documentation					(akpm)

Changelog since v4
o Remove dependency on io_schedule_timeout
o Push waiting logic down into waitqueue

From: Nick Piggin <npiggin@suse.de>

This patch introduces a new page flag for 64-bit capable machines,
PG_waiters, to signal there are *potentially* processes waiting on
PG_lock or PG_writeback.  If there are no possible waiters then we avoid
barriers, a waitqueue hash lookup and a failed wake_up in the unlock_page
and end_page_writeback paths. There is no guarantee that waiters exist if
PG_waiters is set as multiple pages can hash to the same waitqueue and we
cannot accurately detect if a waking process is the last waiter without
a reference count. When this happens, the bit is left set and a future
unlock or writeback completion will lookup the waitqueue and clear the
bit when there are no collisions. This adds a few branches to the fast
path but avoids bouncing a dirty cache line between CPUs. 32-bit machines
always take the slow path but the primary motivation for this patch is
large machines so I do not think that is a concern.

The test case used to evaluate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. After each
dd there is a sync so the reported times do not vary much. By measuring
the time it takes to do async the impact of page_waitqueue overhead for
async IO is highlighted.

The test machine was single socket and UMA to avoid any scheduling or
NUMA artifacts. The performance results are reported based on a run with
no profiling.  Profile data is based on a separate run with oprofile running.

async dd
                                 3.15.0-rc5            3.15.0-rc5
                                      mmotm           lockpage
btrfs Max      ddtime      0.5863 (  0.00%)      0.5621 (  4.14%)
ext3  Max      ddtime      1.4870 (  0.00%)      1.4609 (  1.76%)
ext4  Max      ddtime      1.0440 (  0.00%)      1.0376 (  0.61%)
tmpfs Max      ddtime      0.3541 (  0.00%)      0.3486 (  1.54%)
xfs   Max      ddtime      0.4995 (  0.00%)      0.4834 (  3.21%)

A separate run with profiles showed this

     samples percentage
ext3  225851    2.3180  vmlinux-3.15.0-rc5-mmotm       test_clear_page_writeback
ext3  106848    1.0966  vmlinux-3.15.0-rc5-mmotm       __wake_up_bit
ext3   71849    0.7374  vmlinux-3.15.0-rc5-mmotm       page_waitqueue
ext3   40319    0.4138  vmlinux-3.15.0-rc5-mmotm       unlock_page
ext3   26243    0.2693  vmlinux-3.15.0-rc5-mmotm       end_page_writeback
ext3  178777    1.7774  vmlinux-3.15.0-rc5-lockpage test_clear_page_writeback
ext3   67702    0.6731  vmlinux-3.15.0-rc5-lockpage unlock_page
ext3   22357    0.2223  vmlinux-3.15.0-rc5-lockpage end_page_writeback
ext3   11131    0.1107  vmlinux-3.15.0-rc5-lockpage __wake_up_bit
ext3    6360    0.0632  vmlinux-3.15.0-rc5-lockpage __wake_up_page_bit
ext3    1660    0.0165  vmlinux-3.15.0-rc5-lockpage page_waitqueue

The profiles show a clear reduction in waitqueue and wakeup functions. Note
that end_page_writeback costs the same as the savings there are due
to reduced calls to __wake_up_bit and page_waitqueue so there is no
obvious direct savings. The cost of unlock_page is higher as it's checking
PageWaiters but it is offset by reduced numbers of calls to page_waitqueue
and _wake_up_bit. There is a similar story told for each of the filesystems.
Note that for workloads that contend heavily on the page lock that
unlock_page may increase in cost as it has to clear PG_waiters so while
the typical case should be much faster, the worst case costs are now higher.

This is also reflected in the time taken to mmap a range of pages.
These are the results for xfs only but the other filesystems tell a
similar story.

                       3.15.0-rc5            3.15.0-rc5
                            mmotm           lockpage
Procs 107M     423.0000 (  0.00%)    409.0000 (  3.31%)
Procs 214M     847.0000 (  0.00%)    823.0000 (  2.83%)
Procs 322M    1296.0000 (  0.00%)   1232.0000 (  4.94%)
Procs 429M    1692.0000 (  0.00%)   1644.0000 (  2.84%)
Procs 536M    2137.0000 (  0.00%)   2057.0000 (  3.74%)
Procs 644M    2542.0000 (  0.00%)   2472.0000 (  2.75%)
Procs 751M    2953.0000 (  0.00%)   2872.0000 (  2.74%)
Procs 859M    3360.0000 (  0.00%)   3310.0000 (  1.49%)
Procs 966M    3770.0000 (  0.00%)   3724.0000 (  1.22%)
Procs 1073M   4220.0000 (  0.00%)   4114.0000 (  2.51%)
Procs 1181M   4638.0000 (  0.00%)   4546.0000 (  1.98%)
Procs 1288M   5038.0000 (  0.00%)   4940.0000 (  1.95%)
Procs 1395M   5481.0000 (  0.00%)   5431.0000 (  0.91%)
Procs 1503M   5940.0000 (  0.00%)   5832.0000 (  1.82%)
Procs 1610M   6316.0000 (  0.00%)   6204.0000 (  1.77%)
Procs 1717M   6749.0000 (  0.00%)   6799.0000 ( -0.74%)
Procs 1825M   7323.0000 (  0.00%)   7082.0000 (  3.29%)
Procs 1932M   7694.0000 (  0.00%)   7452.0000 (  3.15%)
Procs 2040M   8079.0000 (  0.00%)   7927.0000 (  1.88%)
Procs 2147M   8495.0000 (  0.00%)   8360.0000 (  1.59%)

   samples percentage
xfs  78334    1.3089  vmlinux-3.15.0-rc5-mmotm          page_waitqueue
xfs  55910    0.9342  vmlinux-3.15.0-rc5-mmotm          unlock_page
xfs  45120    0.7539  vmlinux-3.15.0-rc5-mmotm          __wake_up_bit
xfs  41414    0.6920  vmlinux-3.15.0-rc5-mmotm          test_clear_page_writeback
xfs   4823    0.0806  vmlinux-3.15.0-rc5-mmotm          end_page_writeback
xfs 100864    1.7063  vmlinux-3.15.0-rc5-lockpage    unlock_page
xfs  52547    0.8889  vmlinux-3.15.0-rc5-lockpage    test_clear_page_writeback
xfs   5031    0.0851  vmlinux-3.15.0-rc5-lockpage    end_page_writeback
xfs   1938    0.0328  vmlinux-3.15.0-rc5-lockpage    __wake_up_bit
xfs      9   1.5e-04  vmlinux-3.15.0-rc5-lockpage    __wake_up_page_bit
xfs      7   1.2e-04  vmlinux-3.15.0-rc5-lockpage    page_waitqueue

[jack@suse.cz: Fix add_page_wait_queue]
[mhocko@suse.cz: Use sleep_on_page_killable in __wait_on_page_locked_killable]
[steiner@sgi.com: Do not update struct page unnecessarily]
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/page-flags.h |  18 ++++++
 include/linux/wait.h       |   8 +++
 kernel/sched/wait.c        | 145 +++++++++++++++++++++++++++++++++++----------
 mm/filemap.c               |  25 ++++----
 mm/page_alloc.c            |   1 +
 mm/swap.c                  |  12 ++++
 mm/vmscan.c                |   7 +++
 7 files changed, 173 insertions(+), 43 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7baf0fe..b697e4f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -87,6 +87,7 @@ enum pageflags {
 	PG_private_2,		/* If pagecache, has fs aux data */
 	PG_writeback,		/* Page is under writeback */
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
+	PG_waiters,		/* Page has PG_locked waiters. */
 	PG_head,		/* A head page */
 	PG_tail,		/* A tail page */
 #else
@@ -213,6 +214,22 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
 
 __PAGEFLAG(SlobFree, slob_free)
 
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+PAGEFLAG(Waiters, waiters) __CLEARPAGEFLAG(Waiters, waiters)
+	TESTCLEARFLAG(Waiters, waiters)
+#define __PG_WAITERS		(1 << PG_waiters)
+#else
+/* Always fallback to slow path on 32-bit */
+static inline bool PageWaiters(struct page *page)
+{
+	return true;
+}
+static inline void __ClearPageWaiters(struct page *page) {}
+static inline void ClearPageWaiters(struct page *page) {}
+static inline void SetPageWaiters(struct page *page) {}
+#define __PG_WAITERS		0
+#endif /* CONFIG_PAGEFLAGS_EXTENDED */
+
 /*
  * Private page markings that may be used by the filesystem that owns the page
  * for its own purposes.
@@ -509,6 +526,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 	 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
 	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
+	 __PG_WAITERS | \
 	 __PG_COMPOUND_LOCK)
 
 /*
diff --git a/include/linux/wait.h b/include/linux/wait.h
index bd68819..9226724 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -141,14 +141,21 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
 	list_del(&old->task_list);
 }
 
+struct page;
+
 void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
 void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_bit(wait_queue_head_t *, void *, int);
+void __wake_up_page_bit(wait_queue_head_t *, struct page *page, void *, int);
 int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
+int __wait_on_page_bit(wait_queue_head_t *, struct wait_bit_queue *,
+				struct page *page, int (*)(void *), unsigned);
 int __wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
+int __wait_on_page_bit_lock(wait_queue_head_t *, struct wait_bit_queue *,
+				struct page *page, int (*)(void *), unsigned);
 void wake_up_bit(void *, int);
 void wake_up_atomic_t(atomic_t *);
 int out_of_line_wait_on_bit(void *, int, int (*)(void *), unsigned);
@@ -822,6 +829,7 @@ void prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state);
 void prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state);
 long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state);
 void finish_wait(wait_queue_head_t *q, wait_queue_t *wait);
+void finish_wait_page(wait_queue_head_t *q, wait_queue_t *wait, struct page *page);
 void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, unsigned int mode, void *key);
 int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
 int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 0ffa20a..73cb8c6 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -167,31 +167,47 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
  * stops them from bleeding out - it would still allow subsequent
  * loads to move into the critical region).
  */
-void
-prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
+static __always_inline void
+__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
+			struct page *page, int state, bool exclusive)
 {
 	unsigned long flags;
 
-	wait->flags &= ~WQ_FLAG_EXCLUSIVE;
 	spin_lock_irqsave(&q->lock, flags);
-	if (list_empty(&wait->task_list))
-		__add_wait_queue(q, wait);
+
+	/*
+	 * pages are hashed on a waitqueue that is expensive to lookup.
+	 * __wait_on_page_bit and __wait_on_page_bit_lock pass in a page
+	 * to set PG_waiters here. A PageWaiters() can then be used at
+	 * unlock time or when writeback completes to detect if there
+	 * are any potential waiters that justify a lookup.
+	 */
+	if (page && !PageWaiters(page))
+		SetPageWaiters(page);
+	if (list_empty(&wait->task_list)) {
+		if (exclusive) {
+			wait->flags |= WQ_FLAG_EXCLUSIVE;
+			__add_wait_queue_tail(q, wait);
+		} else {
+			wait->flags &= ~WQ_FLAG_EXCLUSIVE;
+			__add_wait_queue(q, wait);
+		}
+	}
 	set_current_state(state);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
+
+void
+prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
+{
+	return __prepare_to_wait(q, wait, NULL, state, false);
+}
 EXPORT_SYMBOL(prepare_to_wait);
 
 void
 prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
 {
-	unsigned long flags;
-
-	wait->flags |= WQ_FLAG_EXCLUSIVE;
-	spin_lock_irqsave(&q->lock, flags);
-	if (list_empty(&wait->task_list))
-		__add_wait_queue_tail(q, wait);
-	set_current_state(state);
-	spin_unlock_irqrestore(&q->lock, flags);
+	return __prepare_to_wait(q, wait, NULL, state, true);
 }
 EXPORT_SYMBOL(prepare_to_wait_exclusive);
 
@@ -219,16 +235,8 @@ long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state)
 }
 EXPORT_SYMBOL(prepare_to_wait_event);
 
-/**
- * finish_wait - clean up after waiting in a queue
- * @q: waitqueue waited on
- * @wait: wait descriptor
- *
- * Sets current thread back to running state and removes
- * the wait descriptor from the given waitqueue if still
- * queued.
- */
-void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
+static __always_inline void __finish_wait(wait_queue_head_t *q,
+			wait_queue_t *wait, struct page *page)
 {
 	unsigned long flags;
 
@@ -249,9 +257,33 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
 	if (!list_empty_careful(&wait->task_list)) {
 		spin_lock_irqsave(&q->lock, flags);
 		list_del_init(&wait->task_list);
+
+		/*
+		 * Clear PG_waiters if the waitqueue is no longer active. There
+		 * is no guarantee that a page with no waiters will get cleared
+		 * as there may be unrelated pages hashed to sleep on the same
+		 * queue. Accurate detection would require a counter but
+		 * collisions are expected to be rare.
+		 */
+		if (page && !waitqueue_active(q))
+			ClearPageWaiters(page);
 		spin_unlock_irqrestore(&q->lock, flags);
 	}
 }
+
+/**
+ * finish_wait - clean up after waiting in a queue
+ * @q: waitqueue waited on
+ * @wait: wait descriptor
+ *
+ * Sets current thread back to running state and removes
+ * the wait descriptor from the given waitqueue if still
+ * queued.
+ */
+void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
+{
+	return __finish_wait(q, wait, NULL);
+}
 EXPORT_SYMBOL(finish_wait);
 
 /**
@@ -313,24 +345,39 @@ int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
 EXPORT_SYMBOL(wake_bit_function);
 
 /*
- * To allow interruptible waiting and asynchronous (i.e. nonblocking)
- * waiting, the actions of __wait_on_bit() and __wait_on_bit_lock() are
- * permitted return codes. Nonzero return codes halt waiting and return.
+ * waits on a bit to be cleared (see wait_on_bit in wait.h for details.
+ * A page is optionally provided when used to wait on the PG_locked or
+ * PG_writeback bit. By setting PG_waiters a lookup of the waitqueue
+ * can be avoided during unlock_page or end_page_writeback.
  */
 int __sched
-__wait_on_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
+__wait_on_page_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
+			struct page *page,
 			int (*action)(void *), unsigned mode)
 {
 	int ret = 0;
 
 	do {
-		prepare_to_wait(wq, &q->wait, mode);
+		__prepare_to_wait(wq, &q->wait, page, mode, false);
 		if (test_bit(q->key.bit_nr, q->key.flags))
 			ret = (*action)(q->key.flags);
 	} while (test_bit(q->key.bit_nr, q->key.flags) && !ret);
-	finish_wait(wq, &q->wait);
+	__finish_wait(wq, &q->wait, page);
 	return ret;
 }
+
+/*
+ * To allow interruptible waiting and asynchronous (i.e. nonblocking)
+ * waiting, the actions of __wait_on_bit() and __wait_on_bit_lock() are
+ * permitted return codes. Nonzero return codes halt waiting and return.
+ */
+int __sched
+__wait_on_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
+			int (*action)(void *), unsigned mode)
+{
+	return __wait_on_page_bit(wq, q, NULL, action, mode);
+}
+
 EXPORT_SYMBOL(__wait_on_bit);
 
 int __sched out_of_line_wait_on_bit(void *word, int bit,
@@ -344,13 +391,14 @@ int __sched out_of_line_wait_on_bit(void *word, int bit,
 EXPORT_SYMBOL(out_of_line_wait_on_bit);
 
 int __sched
-__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
+__wait_on_page_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
+			struct page *page,
 			int (*action)(void *), unsigned mode)
 {
 	do {
 		int ret;
 
-		prepare_to_wait_exclusive(wq, &q->wait, mode);
+		__prepare_to_wait(wq, &q->wait, page, mode, true);
 		if (!test_bit(q->key.bit_nr, q->key.flags))
 			continue;
 		ret = action(q->key.flags);
@@ -359,9 +407,16 @@ __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
 		abort_exclusive_wait(wq, &q->wait, mode, &q->key);
 		return ret;
 	} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
-	finish_wait(wq, &q->wait);
+	__finish_wait(wq, &q->wait, page);
 	return 0;
 }
+
+int __sched
+__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
+			int (*action)(void *), unsigned mode)
+{
+	return __wait_on_page_bit_lock(wq, q, NULL, action, mode);
+}
 EXPORT_SYMBOL(__wait_on_bit_lock);
 
 int __sched out_of_line_wait_on_bit_lock(void *word, int bit,
@@ -380,6 +435,32 @@ void __wake_up_bit(wait_queue_head_t *wq, void *word, int bit)
 	if (waitqueue_active(wq))
 		__wake_up(wq, TASK_NORMAL, 1, &key);
 }
+
+void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit)
+{
+	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
+	unsigned long flags;
+
+	/* If there is no PG_waiters bit, always take the slow path */
+	if (!__PG_WAITERS && waitqueue_active(wq)) {
+		__wake_up(wq, TASK_NORMAL, 1, &key);
+		return;
+	}
+
+	/*
+	 * Unlike __wake_up_bit it is necessary to check waitqueue_active to be
+	 * checked under the wqh->lock to avoid races with parallel additions
+	 * to the waitqueue. Otherwise races could result in lost wakeups
+	 */
+	spin_lock_irqsave(&wqh->lock, flags);
+	if (waitqueue_active(wqh))
+		__wake_up_common(wqh, TASK_NORMAL, 1, 0, &key);
+	else
+		ClearPageWaiters(page);
+	spin_unlock_irqrestore(&wqh->lock, flags);
+}
+
+
 EXPORT_SYMBOL(__wake_up_bit);
 
 /**
diff --git a/mm/filemap.c b/mm/filemap.c
index 263cffe..07633a4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -682,9 +682,9 @@ static wait_queue_head_t *page_waitqueue(struct page *page)
 	return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
 }
 
-static inline void wake_up_page(struct page *page, int bit)
+static inline void wake_up_page(struct page *page, int bit_nr)
 {
-	__wake_up_bit(page_waitqueue(page), &page->flags, bit);
+	__wake_up_page_bit(page_waitqueue(page), page, &page->flags, bit_nr);
 }
 
 void wait_on_page_bit(struct page *page, int bit_nr)
@@ -692,8 +692,8 @@ void wait_on_page_bit(struct page *page, int bit_nr)
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
 
 	if (test_bit(bit_nr, &page->flags))
-		__wait_on_bit(page_waitqueue(page), &wait, sleep_on_page,
-							TASK_UNINTERRUPTIBLE);
+		__wait_on_page_bit(page_waitqueue(page), &wait, page,
+					sleep_on_page, TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
@@ -704,7 +704,7 @@ int wait_on_page_bit_killable(struct page *page, int bit_nr)
 	if (!test_bit(bit_nr, &page->flags))
 		return 0;
 
-	return __wait_on_bit(page_waitqueue(page), &wait,
+	return __wait_on_page_bit(page_waitqueue(page), &wait, page,
 			     sleep_on_page_killable, TASK_KILLABLE);
 }
 
@@ -743,7 +743,8 @@ void unlock_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	clear_bit_unlock(PG_locked, &page->flags);
 	smp_mb__after_atomic();
-	wake_up_page(page, PG_locked);
+	if (unlikely(PageWaiters(page)))
+		wake_up_page(page, PG_locked);
 }
 EXPORT_SYMBOL(unlock_page);
 
@@ -769,7 +770,8 @@ void end_page_writeback(struct page *page)
 		BUG();
 
 	smp_mb__after_atomic();
-	wake_up_page(page, PG_writeback);
+	if (unlikely(PageWaiters(page)))
+		wake_up_page(page, PG_writeback);
 }
 EXPORT_SYMBOL(end_page_writeback);
 
@@ -806,8 +808,8 @@ void __lock_page(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
-	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
-							TASK_UNINTERRUPTIBLE);
+	__wait_on_page_bit_lock(page_waitqueue(page), &wait, page,
+					sleep_on_page, TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(__lock_page);
 
@@ -815,9 +817,10 @@ int __lock_page_killable(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
-	return __wait_on_bit_lock(page_waitqueue(page), &wait,
-					sleep_on_page_killable, TASK_KILLABLE);
+	return __wait_on_page_bit_lock(page_waitqueue(page), &wait, page,
+					sleep_on_page, TASK_KILLABLE);
 }
+
 EXPORT_SYMBOL_GPL(__lock_page_killable);
 
 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cd1f005..ebb947d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6603,6 +6603,7 @@ static const struct trace_print_flags pageflag_names[] = {
 	{1UL << PG_private_2,		"private_2"	},
 	{1UL << PG_writeback,		"writeback"	},
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
+	{1UL << PG_waiters,		"waiters"	},
 	{1UL << PG_head,		"head"		},
 	{1UL << PG_tail,		"tail"		},
 #else
diff --git a/mm/swap.c b/mm/swap.c
index 9e8e347..1581dbf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -67,6 +67,10 @@ static void __page_cache_release(struct page *page)
 static void __put_single_page(struct page *page)
 {
 	__page_cache_release(page);
+
+	/* See release_pages on why this clear may be necessary */
+	__ClearPageWaiters(page);
+
 	free_hot_cold_page(page, false);
 }
 
@@ -916,6 +920,14 @@ void release_pages(struct page **pages, int nr, bool cold)
 		/* Clear Active bit in case of parallel mark_page_accessed */
 		__ClearPageActive(page);
 
+		/*
+		 * pages are hashed on a waitqueue so there may be collisions.
+		 * When waiters are woken the waitqueue is checked but
+		 * unrelated pages on the queue can leave the bit set. Clear
+		 * it here if that happens.
+		 */
+		__ClearPageWaiters(page);
+
 		list_add(&page->lru, &pages_to_free);
 	}
 	if (zone)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7f85041..d7a4969 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1096,6 +1096,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * waiting on the page lock, because there are no references.
 		 */
 		__clear_page_locked(page);
+
+		/* See release_pages on why this clear may be necessary */
+		__ClearPageWaiters(page);
 free_it:
 		nr_reclaimed++;
 
@@ -1427,6 +1430,8 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
+			/* See release_pages on why this clear may be necessary */
+			__ClearPageWaiters(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
@@ -1650,6 +1655,8 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
+			/* See release_pages on why this clear may be necessary */
+			__ClearPageWaiters(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7
  2014-05-22 10:40                           ` [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7 Mel Gorman
@ 2014-05-22 10:56                             ` Peter Zijlstra
  2014-05-22 13:00                               ` Mel Gorman
  2014-05-22 14:40                               ` Mel Gorman
  0 siblings, 2 replies; 103+ messages in thread
From: Peter Zijlstra @ 2014-05-22 10:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Paul McKenney,
	Linus Torvalds, David Howells, Linux Kernel, Linux-MM,
	Linux-FSDevel

[-- Attachment #1: Type: text/plain, Size: 2650 bytes --]

On Thu, May 22, 2014 at 11:40:51AM +0100, Mel Gorman wrote:
> +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit)
> +{
> +	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
> +	unsigned long flags;
> +
> +	/* If there is no PG_waiters bit, always take the slow path */

That comment is misleading, this is actually a fast path for
!PG_waiters.

> +	if (!__PG_WAITERS && waitqueue_active(wq)) {
> +		__wake_up(wq, TASK_NORMAL, 1, &key);
> +		return;
> +	}
> +
> +	/*
> +	 * Unlike __wake_up_bit it is necessary to check waitqueue_active to be
> +	 * checked under the wqh->lock to avoid races with parallel additions
> +	 * to the waitqueue. Otherwise races could result in lost wakeups
> +	 */
> +	spin_lock_irqsave(&wqh->lock, flags);
> +	if (waitqueue_active(wqh))
> +		__wake_up_common(wqh, TASK_NORMAL, 1, 0, &key);
> +	else
> +		ClearPageWaiters(page);
> +	spin_unlock_irqrestore(&wqh->lock, flags);
> +}

So I think you missed one Clear opportunity here that was in my original
proposal, possibly because you also frobbed PG_writeback in.

If you do:

	spin_lock_irqsave(&wqh->lock, flags);
	if (!waitqueue_active(wqh) || !__wake_up_common(wqh, TASK_NORMAL, 1, 0, &key))
		ClearPageWaiters(page);
	spin_unlock_irqrestore(&wqh->lock, flags);

With the below change to __wake_up_common(), we'll also clear the bit
when there's no waiters of @page, even if there's waiters for another
page.

I suppose the one thing to say for the big open coded loop is that its
much easier to read than this scattered stuff.

---
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 0ffa20ae657b..213c5bfe6b56 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -61,18 +61,23 @@ EXPORT_SYMBOL(remove_wait_queue);
  * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
  * zero in this (rare) case, and we handle it by continuing to scan the queue.
  */
-static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
+static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode,
 			int nr_exclusive, int wake_flags, void *key)
 {
 	wait_queue_t *curr, *next;
+	bool woke = false;
 
 	list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
 		unsigned flags = curr->flags;
 
-		if (curr->func(curr, mode, wake_flags, key) &&
-				(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
-			break;
+		if (curr->func(curr, mode, wake_flags, key)) {
+			woke = true;
+			if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
+				break;
+		}
 	}
+
+	return woke;
 }
 
 /**



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7
  2014-05-22 10:56                             ` Peter Zijlstra
@ 2014-05-22 13:00                               ` Mel Gorman
  2014-05-22 14:40                               ` Mel Gorman
  1 sibling, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-22 13:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Paul McKenney,
	Linus Torvalds, David Howells, Linux Kernel, Linux-MM,
	Linux-FSDevel

On Thu, May 22, 2014 at 12:56:38PM +0200, Peter Zijlstra wrote:
> On Thu, May 22, 2014 at 11:40:51AM +0100, Mel Gorman wrote:
> > +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit)
> > +{
> > +	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
> > +	unsigned long flags;
> > +
> > +	/* If there is no PG_waiters bit, always take the slow path */
> 
> That comment is misleading, this is actually a fast path for
> !PG_waiters.
> 

And could have been far better anyway now that you called me on it.

> > +	if (!__PG_WAITERS && waitqueue_active(wq)) {
> > +		__wake_up(wq, TASK_NORMAL, 1, &key);
> > +		return;
> > +	}
> > +
> > +	/*
> > +	 * Unlike __wake_up_bit it is necessary to check waitqueue_active to be
> > +	 * checked under the wqh->lock to avoid races with parallel additions
> > +	 * to the waitqueue. Otherwise races could result in lost wakeups
> > +	 */
> > +	spin_lock_irqsave(&wqh->lock, flags);
> > +	if (waitqueue_active(wqh))
> > +		__wake_up_common(wqh, TASK_NORMAL, 1, 0, &key);
> > +	else
> > +		ClearPageWaiters(page);
> > +	spin_unlock_irqrestore(&wqh->lock, flags);
> > +}
> 
> So I think you missed one Clear opportunity here that was in my original
> proposal, possibly because you also frobbed PG_writeback in.
> 

It got lost in the midst of all the other modifications to make this as
"obvious" as possible.

> <SNIP>
>
> I suppose the one thing to say for the big open coded loop is that its
> much easier to read than this scattered stuff.
> 

Sure, but the end result of open coding this is duplicated code that will
be harder to maintain overall. I could split __wake_up_bit and use that
in both but I do not think it would make the code any clearer for the sake
of two lines.  Untested but this on top?

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 73cb8c6..d3a8c34 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -60,19 +60,26 @@ EXPORT_SYMBOL(remove_wait_queue);
  * There are circumstances in which we can try to wake a task which has already
  * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
  * zero in this (rare) case, and we handle it by continuing to scan the queue.
+ *
+ * Returns true if a process was woken up
  */
-static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
+static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode,
 			int nr_exclusive, int wake_flags, void *key)
 {
 	wait_queue_t *curr, *next;
+	bool woke = false;
 
 	list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
 		unsigned flags = curr->flags;
 
-		if (curr->func(curr, mode, wake_flags, key) &&
-				(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
-			break;
+		if (curr->func(curr, mode, wake_flags, key)) {
+			woke = true;
+			if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
+				break;
+		}
 	}
+
+	return woke;
 }
 
 /**
@@ -441,9 +448,13 @@ void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, i
 	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
 	unsigned long flags;
 
-	/* If there is no PG_waiters bit, always take the slow path */
-	if (!__PG_WAITERS && waitqueue_active(wq)) {
-		__wake_up(wq, TASK_NORMAL, 1, &key);
+	/*
+	 * If there is no PG_waiters bit (32-bit), then waitqueue_active can be
+	 * checked without wqh->lock as there is no PG_waiters race to protect.
+	 */
+	if (!__PG_WAITERS) {
+		if (waitqueue_active(wqh))
+			__wake_up(wqh, TASK_NORMAL, 1, &key);
 		return;
 	}
 
@@ -453,9 +464,8 @@ void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, i
 	 * to the waitqueue. Otherwise races could result in lost wakeups
 	 */
 	spin_lock_irqsave(&wqh->lock, flags);
-	if (waitqueue_active(wqh))
-		__wake_up_common(wqh, TASK_NORMAL, 1, 0, &key);
-	else
+	if (!waitqueue_active(wqh) ||
+	    !__wake_up_common(wqh, TASK_NORMAL, 1, 0, &key))
 		ClearPageWaiters(page);
 	spin_unlock_irqrestore(&wqh->lock, flags);
 }


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7
  2014-05-22 10:56                             ` Peter Zijlstra
  2014-05-22 13:00                               ` Mel Gorman
@ 2014-05-22 14:40                               ` Mel Gorman
  2014-05-22 15:04                                 ` Peter Zijlstra
  1 sibling, 1 reply; 103+ messages in thread
From: Mel Gorman @ 2014-05-22 14:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Paul McKenney,
	Linus Torvalds, David Howells, Linux Kernel, Linux-MM,
	Linux-FSDevel

On Thu, May 22, 2014 at 12:56:38PM +0200, Peter Zijlstra wrote:
> On Thu, May 22, 2014 at 11:40:51AM +0100, Mel Gorman wrote:
> > +void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit)
> > +{
> > +	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
> > +	unsigned long flags;
> > +
> > +	/* If there is no PG_waiters bit, always take the slow path */
> 
> That comment is misleading, this is actually a fast path for
> !PG_waiters.
> 
> > +	if (!__PG_WAITERS && waitqueue_active(wq)) {
> > +		__wake_up(wq, TASK_NORMAL, 1, &key);
> > +		return;
> > +	}
> > +
> > +	/*
> > +	 * Unlike __wake_up_bit it is necessary to check waitqueue_active to be
> > +	 * checked under the wqh->lock to avoid races with parallel additions
> > +	 * to the waitqueue. Otherwise races could result in lost wakeups
> > +	 */
> > +	spin_lock_irqsave(&wqh->lock, flags);
> > +	if (waitqueue_active(wqh))
> > +		__wake_up_common(wqh, TASK_NORMAL, 1, 0, &key);
> > +	else
> > +		ClearPageWaiters(page);
> > +	spin_unlock_irqrestore(&wqh->lock, flags);
> > +}
> 
> So I think you missed one Clear opportunity here that was in my original
> proposal, possibly because you also frobbed PG_writeback in.
> 
> If you do:
> 
> 	spin_lock_irqsave(&wqh->lock, flags);
> 	if (!waitqueue_active(wqh) || !__wake_up_common(wqh, TASK_NORMAL, 1, 0, &key))
> 		ClearPageWaiters(page);
> 	spin_unlock_irqrestore(&wqh->lock, flags);
> 
> With the below change to __wake_up_common(), we'll also clear the bit
> when there's no waiters of @page, even if there's waiters for another
> page.
> 
> I suppose the one thing to say for the big open coded loop is that its
> much easier to read than this scattered stuff.
> 
> ---
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index 0ffa20ae657b..213c5bfe6b56 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -61,18 +61,23 @@ EXPORT_SYMBOL(remove_wait_queue);
>   * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
>   * zero in this (rare) case, and we handle it by continuing to scan the queue.
>   */
> -static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
> +static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode,
>  			int nr_exclusive, int wake_flags, void *key)
>  {
>  	wait_queue_t *curr, *next;
> +	bool woke = false;
>  
>  	list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
>  		unsigned flags = curr->flags;
>  
> -		if (curr->func(curr, mode, wake_flags, key) &&
> -				(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
> -			break;
> +		if (curr->func(curr, mode, wake_flags, key)) {
> +			woke = true;
> +			if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
> +				break;
> +		}
>  	}
> +
> +	return woke;

Ok, thinking about this more I'm less sure.

There are cases where the curr->func returns false even though there is a
task that needs to run -- task was already running or preparing to run. We
potentially end up clearing PG_waiters while there are still tasks on the
waitqueue. As __finish_wait checks if the waitqueue is empty and the last
waiter clears the bit I think there is nothing to gain by trying to do the
same job in __wake_up_page_bit.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7
  2014-05-22 14:40                               ` Mel Gorman
@ 2014-05-22 15:04                                 ` Peter Zijlstra
  2014-05-22 15:36                                   ` Mel Gorman
  2014-05-22 16:58                                   ` [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v8 Mel Gorman
  0 siblings, 2 replies; 103+ messages in thread
From: Peter Zijlstra @ 2014-05-22 15:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Paul McKenney,
	Linus Torvalds, David Howells, Linux Kernel, Linux-MM,
	Linux-FSDevel

[-- Attachment #1: Type: text/plain, Size: 1181 bytes --]

On Thu, May 22, 2014 at 03:40:45PM +0100, Mel Gorman wrote:

> > +static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode,
> >  			int nr_exclusive, int wake_flags, void *key)
> >  {
> >  	wait_queue_t *curr, *next;
> > +	bool woke = false;
> >  
> >  	list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
> >  		unsigned flags = curr->flags;
> >  
> > +		if (curr->func(curr, mode, wake_flags, key)) {
> > +			woke = true;
> > +			if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
> > +				break;
> > +		}
> >  	}
> > +
> > +	return woke;
> 
> Ok, thinking about this more I'm less sure.
> 
> There are cases where the curr->func returns false even though there is a
> task that needs to run -- task was already running or preparing to run. We
> potentially end up clearing PG_waiters while there are still tasks on the
> waitqueue. As __finish_wait checks if the waitqueue is empty and the last
> waiter clears the bit I think there is nothing to gain by trying to do the
> same job in __wake_up_page_bit.

Hmm, I think you're right, we need the test result from
wake_bit_function(), unpolluted by the ttwu return value.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7
  2014-05-22 15:04                                 ` Peter Zijlstra
@ 2014-05-22 15:36                                   ` Mel Gorman
  2014-05-22 16:58                                   ` [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v8 Mel Gorman
  1 sibling, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-22 15:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Paul McKenney,
	Linus Torvalds, David Howells, Linux Kernel, Linux-MM,
	Linux-FSDevel

On Thu, May 22, 2014 at 05:04:51PM +0200, Peter Zijlstra wrote:
> On Thu, May 22, 2014 at 03:40:45PM +0100, Mel Gorman wrote:
> 
> > > +static bool __wake_up_common(wait_queue_head_t *q, unsigned int mode,
> > >  			int nr_exclusive, int wake_flags, void *key)
> > >  {
> > >  	wait_queue_t *curr, *next;
> > > +	bool woke = false;
> > >  
> > >  	list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
> > >  		unsigned flags = curr->flags;
> > >  
> > > +		if (curr->func(curr, mode, wake_flags, key)) {
> > > +			woke = true;
> > > +			if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
> > > +				break;
> > > +		}
> > >  	}
> > > +
> > > +	return woke;
> > 
> > Ok, thinking about this more I'm less sure.
> > 
> > There are cases where the curr->func returns false even though there is a
> > task that needs to run -- task was already running or preparing to run. We
> > potentially end up clearing PG_waiters while there are still tasks on the
> > waitqueue. As __finish_wait checks if the waitqueue is empty and the last
> > waiter clears the bit I think there is nothing to gain by trying to do the
> > same job in __wake_up_page_bit.
> 
> Hmm, I think you're right, we need the test result from
> wake_bit_function(), unpolluted by the ttwu return value.

Which would be a bit too special cased and not a clear win. I at least
added a comment to explain what is going on here.

	/*
	 * Unlike __wake_up_bit it is necessary to check waitqueue_active
	 * under the wqh->lock to avoid races with parallel additions that
	 * could result in lost wakeups.
	 */
	spin_lock_irqsave(&wqh->lock, flags);
	if (waitqueue_active(wqh)) {
		/*
		 * Try waking a task on the queue. Responsibility for clearing
		 * the PG_waiters bit is left to the last waiter on the
		 * waitqueue as PageWaiters is called outside wqh->lock and
		 * we cannot miss wakeups. Due to hashqueue collisions, there
		 * may be colliding pages that still have PG_waiters set but
		 * the impact means there will be at least one unnecessary
		 * lookup of the page waitqueue on the next unlock_page or
		 * end of writeback.
		 */
		__wake_up_common(wqh, TASK_NORMAL, 1, 0, &key);
	} else {
		/* No potential waiters, safe to clear PG_waiters */
		ClearPageWaiters(page);
	}
	spin_unlock_irqrestore(&wqh->lock, flags);

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

* [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v8
  2014-05-22 15:04                                 ` Peter Zijlstra
  2014-05-22 15:36                                   ` Mel Gorman
@ 2014-05-22 16:58                                   ` Mel Gorman
  1 sibling, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-22 16:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Paul McKenney,
	Linus Torvalds, David Howells, Linux Kernel, Linux-MM,
	Linux-FSDevel

Changelog since v7
o Further optimisation when PG_waiters is not available	(peterz)
o Catch all opportunities to ClearPageWaiters		(peterz)

Changelog since v6
o Optimisation when PG_waiters is not available		(peterz)
o Documentation

Changelog since v5
o __always_inline where appropriate			(peterz)
o Documentation						(akpm)

Changelog since v4
o Remove dependency on io_schedule_timeout
o Push waiting logic down into waitqueue

This patch introduces a new page flag for 64-bit capable machines,
PG_waiters, to signal there are *potentially* processes waiting on
PG_lock or PG_writeback.  If there are no possible waiters then we avoid
barriers, a waitqueue hash lookup and a failed wake_up in the unlock_page
and end_page_writeback paths. There is no guarantee that waiters exist if
PG_waiters is set as multiple pages can hash to the same waitqueue and we
cannot accurately detect if a waking process is the last waiter without
a reference count. When this happens, the bit is left set and a future
unlock or writeback completion will lookup the waitqueue and clear the
bit when there are no collisions. This adds a few branches to the fast
path but avoids bouncing a dirty cache line between CPUs. 32-bit machines
always take the slow path but the primary motivation for this patch is
large machines so I do not think that is a concern.

The test case used to evaluate this is a simple dd of a large file done
multiple times with the file deleted on each iterations. The size of the
file is 1/10th physical memory to avoid dirty page balancing. After each
dd there is a sync so the reported times do not vary much. By measuring
the time it takes to do async the impact of page_waitqueue overhead for
async IO is highlighted.

The test machine was single socket and UMA to avoid any scheduling or
NUMA artifacts. The performance results are reported based on a run with
no profiling.  Profile data is based on a separate run with oprofile running.

async dd
                                 3.15.0-rc5            3.15.0-rc5
                                      mmotm           lockpage-v8
btrfs Max      ddtime      0.5863 (  0.00%)      0.5593 (  4.61%)
ext3  Max      ddtime      1.4870 (  0.00%)      1.4609 (  1.76%)
ext4  Max      ddtime      1.0440 (  0.00%)      1.0376 (  0.61%)
tmpfs Max      ddtime      0.3541 (  0.00%)      0.3478 (  1.76%)
xfs   Max      ddtime      0.4995 (  0.00%)      0.4762 (  4.65%)

A separate run with profiles showed this

     samples percentage
ext3  225851    2.3180  vmlinux-3.15.0-rc5-mmotm       test_clear_page_writeback
ext3  106848    1.0966  vmlinux-3.15.0-rc5-mmotm       __wake_up_bit
ext3   71849    0.7374  vmlinux-3.15.0-rc5-mmotm       page_waitqueue
ext3   40319    0.4138  vmlinux-3.15.0-rc5-mmotm       unlock_page
ext3   26243    0.2693  vmlinux-3.15.0-rc5-mmotm       end_page_writeback
ext3  203718    2.1020  vmlinux-3.15.0-rc5-lockpage-v8 test_clear_page_writeback
ext3   64004    0.6604  vmlinux-3.15.0-rc5-lockpage-v8 unlock_page
ext3   24753    0.2554  vmlinux-3.15.0-rc5-lockpage-v8 end_page_writeback
ext3    8618    0.0889  vmlinux-3.15.0-rc5-lockpage-v8 __wake_up_bit
ext3    7247    0.0748  vmlinux-3.15.0-rc5-lockpage-v8 __wake_up_page_bit
ext3    2012    0.0208  vmlinux-3.15.0-rc5-lockpage-v8 page_waitqueue

The profiles show a clear reduction in waitqueue and wakeup functions. Note
that end_page_writeback costs the same as the savings there are due
to reduced calls to __wake_up_bit and page_waitqueue so there is no
obvious direct savings. The cost of unlock_page is higher as it's checking
PageWaiters but it is offset by reduced numbers of calls to page_waitqueue
and _wake_up_bit. There is a similar story told for each of the filesystems.
Note that for workloads that contend heavily on the page lock that
unlock_page may increase in cost as it has to clear PG_waiters so while
the typical case should be much faster, the worst case costs are now higher.

This is also reflected in the time taken to mmap a range of pages.
These are the results for xfs only but the other filesystems tell a
similar story.

                       3.15.0-rc5            3.15.0-rc5
                            mmotm           lockpage-v8
Procs 107M     423.0000 (  0.00%)    409.0000 (  3.31%)
Procs 214M     847.0000 (  0.00%)    821.0000 (  3.07%)
Procs 322M    1296.0000 (  0.00%)   1232.0000 (  4.94%)
Procs 429M    1692.0000 (  0.00%)   1646.0000 (  2.72%)
Procs 536M    2137.0000 (  0.00%)   2052.0000 (  3.98%)
Procs 644M    2542.0000 (  0.00%)   2472.0000 (  2.75%)
Procs 751M    2953.0000 (  0.00%)   2871.0000 (  2.78%)
Procs 859M    3360.0000 (  0.00%)   3290.0000 (  2.08%)
Procs 966M    3770.0000 (  0.00%)   3678.0000 (  2.44%)
Procs 1073M   4220.0000 (  0.00%)   4101.0000 (  2.82%)
Procs 1181M   4638.0000 (  0.00%)   4518.0000 (  2.59%)
Procs 1288M   5038.0000 (  0.00%)   4934.0000 (  2.06%)
Procs 1395M   5481.0000 (  0.00%)   5344.0000 (  2.50%)
Procs 1503M   5940.0000 (  0.00%)   5764.0000 (  2.96%)
Procs 1610M   6316.0000 (  0.00%)   6186.0000 (  2.06%)
Procs 1717M   6749.0000 (  0.00%)   6595.0000 (  2.28%)
Procs 1825M   7323.0000 (  0.00%)   7034.0000 (  3.95%)
Procs 1932M   7694.0000 (  0.00%)   7461.0000 (  3.03%)
Procs 2040M   8079.0000 (  0.00%)   7837.0000 (  3.00%)
Procs 2147M   8495.0000 (  0.00%)   8351.0000 (  1.70%)

   samples percentage
xfs  78334    1.3089  vmlinux-3.15.0-rc5-mmotm          page_waitqueue
xfs  55910    0.9342  vmlinux-3.15.0-rc5-mmotm          unlock_page
xfs  45120    0.7539  vmlinux-3.15.0-rc5-mmotm          __wake_up_bit
xfs  41414    0.6920  vmlinux-3.15.0-rc5-mmotm          test_clear_page_writeback
xfs   4823    0.0806  vmlinux-3.15.0-rc5-mmotm          end_page_writeback
xfs 120504    2.0046  vmlinux-3.15.0-rc5-lockpage-v8    unlock_page
xfs  49179    0.8181  vmlinux-3.15.0-rc5-lockpage-v8    test_clear_page_writeback
xfs   5397    0.0898  vmlinux-3.15.0-rc5-lockpage-v8    end_page_writeback
xfs   2101    0.0350  vmlinux-3.15.0-rc5-lockpage-v8    __wake_up_bit
xfs      5   8.3e-05  vmlinux-3.15.0-rc5-lockpage-v8    page_waitqueue
xfs      4   6.7e-05  vmlinux-3.15.0-rc5-lockpage-v8    __wake_up_page_bit

[jack@suse.cz: Fix add_page_wait_queue]
[mhocko@suse.cz: Use sleep_on_page_killable in __wait_on_page_locked_killable]
[steiner@sgi.com: Do not update struct page unnecessarily]
[peterz@infradead.org: consolidate within wait.c, catch all ClearPageWaiters]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/page-flags.h |  18 +++++
 include/linux/wait.h       |   8 +++
 kernel/sched/wait.c        | 161 ++++++++++++++++++++++++++++++++++++---------
 mm/filemap.c               |  25 +++----
 mm/page_alloc.c            |   1 +
 mm/swap.c                  |  12 ++++
 mm/vmscan.c                |   7 ++
 7 files changed, 189 insertions(+), 43 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 7baf0fe..b697e4f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -87,6 +87,7 @@ enum pageflags {
 	PG_private_2,		/* If pagecache, has fs aux data */
 	PG_writeback,		/* Page is under writeback */
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
+	PG_waiters,		/* Page has PG_locked waiters. */
 	PG_head,		/* A head page */
 	PG_tail,		/* A tail page */
 #else
@@ -213,6 +214,22 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
 
 __PAGEFLAG(SlobFree, slob_free)
 
+#ifdef CONFIG_PAGEFLAGS_EXTENDED
+PAGEFLAG(Waiters, waiters) __CLEARPAGEFLAG(Waiters, waiters)
+	TESTCLEARFLAG(Waiters, waiters)
+#define __PG_WAITERS		(1 << PG_waiters)
+#else
+/* Always fallback to slow path on 32-bit */
+static inline bool PageWaiters(struct page *page)
+{
+	return true;
+}
+static inline void __ClearPageWaiters(struct page *page) {}
+static inline void ClearPageWaiters(struct page *page) {}
+static inline void SetPageWaiters(struct page *page) {}
+#define __PG_WAITERS		0
+#endif /* CONFIG_PAGEFLAGS_EXTENDED */
+
 /*
  * Private page markings that may be used by the filesystem that owns the page
  * for its own purposes.
@@ -509,6 +526,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
 	 1 << PG_writeback | 1 << PG_reserved | \
 	 1 << PG_slab	 | 1 << PG_swapcache | 1 << PG_active | \
 	 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
+	 __PG_WAITERS | \
 	 __PG_COMPOUND_LOCK)
 
 /*
diff --git a/include/linux/wait.h b/include/linux/wait.h
index bd68819..9226724 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -141,14 +141,21 @@ __remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
 	list_del(&old->task_list);
 }
 
+struct page;
+
 void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
 void __wake_up_locked_key(wait_queue_head_t *q, unsigned int mode, void *key);
 void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *key);
 void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
 void __wake_up_bit(wait_queue_head_t *, void *, int);
+void __wake_up_page_bit(wait_queue_head_t *, struct page *page, void *, int);
 int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
+int __wait_on_page_bit(wait_queue_head_t *, struct wait_bit_queue *,
+				struct page *page, int (*)(void *), unsigned);
 int __wait_on_bit_lock(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
+int __wait_on_page_bit_lock(wait_queue_head_t *, struct wait_bit_queue *,
+				struct page *page, int (*)(void *), unsigned);
 void wake_up_bit(void *, int);
 void wake_up_atomic_t(atomic_t *);
 int out_of_line_wait_on_bit(void *, int, int (*)(void *), unsigned);
@@ -822,6 +829,7 @@ void prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state);
 void prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state);
 long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state);
 void finish_wait(wait_queue_head_t *q, wait_queue_t *wait);
+void finish_wait_page(wait_queue_head_t *q, wait_queue_t *wait, struct page *page);
 void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait, unsigned int mode, void *key);
 int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
 int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 0ffa20a..43e7df0 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -167,31 +167,47 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
  * stops them from bleeding out - it would still allow subsequent
  * loads to move into the critical region).
  */
-void
-prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
+static __always_inline void
+__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
+			struct page *page, int state, bool exclusive)
 {
 	unsigned long flags;
 
-	wait->flags &= ~WQ_FLAG_EXCLUSIVE;
 	spin_lock_irqsave(&q->lock, flags);
-	if (list_empty(&wait->task_list))
-		__add_wait_queue(q, wait);
+
+	/*
+	 * pages are hashed on a waitqueue that is expensive to lookup.
+	 * __wait_on_page_bit and __wait_on_page_bit_lock pass in a page
+	 * to set PG_waiters here. A PageWaiters() can then be used at
+	 * unlock time or when writeback completes to detect if there
+	 * are any potential waiters that justify a lookup.
+	 */
+	if (page && !PageWaiters(page))
+		SetPageWaiters(page);
+	if (list_empty(&wait->task_list)) {
+		if (exclusive) {
+			wait->flags |= WQ_FLAG_EXCLUSIVE;
+			__add_wait_queue_tail(q, wait);
+		} else {
+			wait->flags &= ~WQ_FLAG_EXCLUSIVE;
+			__add_wait_queue(q, wait);
+		}
+	}
 	set_current_state(state);
 	spin_unlock_irqrestore(&q->lock, flags);
 }
+
+void
+prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
+{
+	return __prepare_to_wait(q, wait, NULL, state, false);
+}
 EXPORT_SYMBOL(prepare_to_wait);
 
 void
 prepare_to_wait_exclusive(wait_queue_head_t *q, wait_queue_t *wait, int state)
 {
-	unsigned long flags;
-
-	wait->flags |= WQ_FLAG_EXCLUSIVE;
-	spin_lock_irqsave(&q->lock, flags);
-	if (list_empty(&wait->task_list))
-		__add_wait_queue_tail(q, wait);
-	set_current_state(state);
-	spin_unlock_irqrestore(&q->lock, flags);
+	return __prepare_to_wait(q, wait, NULL, state, true);
 }
 EXPORT_SYMBOL(prepare_to_wait_exclusive);
 
@@ -219,16 +235,8 @@ long prepare_to_wait_event(wait_queue_head_t *q, wait_queue_t *wait, int state)
 }
 EXPORT_SYMBOL(prepare_to_wait_event);
 
-/**
- * finish_wait - clean up after waiting in a queue
- * @q: waitqueue waited on
- * @wait: wait descriptor
- *
- * Sets current thread back to running state and removes
- * the wait descriptor from the given waitqueue if still
- * queued.
- */
-void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
+static __always_inline void __finish_wait(wait_queue_head_t *q,
+			wait_queue_t *wait, struct page *page)
 {
 	unsigned long flags;
 
@@ -249,9 +257,33 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
 	if (!list_empty_careful(&wait->task_list)) {
 		spin_lock_irqsave(&q->lock, flags);
 		list_del_init(&wait->task_list);
+
+		/*
+		 * Clear PG_waiters if the waitqueue is no longer active. There
+		 * is no guarantee that a page with no waiters will get cleared
+		 * as there may be unrelated pages hashed to sleep on the same
+		 * queue. Accurate detection would require a counter but
+		 * collisions are expected to be rare.
+		 */
+		if (page && !waitqueue_active(q))
+			ClearPageWaiters(page);
 		spin_unlock_irqrestore(&q->lock, flags);
 	}
 }
+
+/**
+ * finish_wait - clean up after waiting in a queue
+ * @q: waitqueue waited on
+ * @wait: wait descriptor
+ *
+ * Sets current thread back to running state and removes
+ * the wait descriptor from the given waitqueue if still
+ * queued.
+ */
+void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
+{
+	return __finish_wait(q, wait, NULL);
+}
 EXPORT_SYMBOL(finish_wait);
 
 /**
@@ -313,24 +345,39 @@ int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *arg)
 EXPORT_SYMBOL(wake_bit_function);
 
 /*
- * To allow interruptible waiting and asynchronous (i.e. nonblocking)
- * waiting, the actions of __wait_on_bit() and __wait_on_bit_lock() are
- * permitted return codes. Nonzero return codes halt waiting and return.
+ * waits on a bit to be cleared (see wait_on_bit in wait.h for details.
+ * A page is optionally provided when used to wait on the PG_locked or
+ * PG_writeback bit. By setting PG_waiters a lookup of the waitqueue
+ * can be avoided during unlock_page or end_page_writeback.
  */
 int __sched
-__wait_on_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
+__wait_on_page_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
+			struct page *page,
 			int (*action)(void *), unsigned mode)
 {
 	int ret = 0;
 
 	do {
-		prepare_to_wait(wq, &q->wait, mode);
+		__prepare_to_wait(wq, &q->wait, page, mode, false);
 		if (test_bit(q->key.bit_nr, q->key.flags))
 			ret = (*action)(q->key.flags);
 	} while (test_bit(q->key.bit_nr, q->key.flags) && !ret);
-	finish_wait(wq, &q->wait);
+	__finish_wait(wq, &q->wait, page);
 	return ret;
 }
+
+/*
+ * To allow interruptible waiting and asynchronous (i.e. nonblocking)
+ * waiting, the actions of __wait_on_bit() and __wait_on_bit_lock() are
+ * permitted return codes. Nonzero return codes halt waiting and return.
+ */
+int __sched
+__wait_on_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
+			int (*action)(void *), unsigned mode)
+{
+	return __wait_on_page_bit(wq, q, NULL, action, mode);
+}
+
 EXPORT_SYMBOL(__wait_on_bit);
 
 int __sched out_of_line_wait_on_bit(void *word, int bit,
@@ -344,13 +391,14 @@ int __sched out_of_line_wait_on_bit(void *word, int bit,
 EXPORT_SYMBOL(out_of_line_wait_on_bit);
 
 int __sched
-__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
+__wait_on_page_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
+			struct page *page,
 			int (*action)(void *), unsigned mode)
 {
 	do {
 		int ret;
 
-		prepare_to_wait_exclusive(wq, &q->wait, mode);
+		__prepare_to_wait(wq, &q->wait, page, mode, true);
 		if (!test_bit(q->key.bit_nr, q->key.flags))
 			continue;
 		ret = action(q->key.flags);
@@ -359,9 +407,16 @@ __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
 		abort_exclusive_wait(wq, &q->wait, mode, &q->key);
 		return ret;
 	} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
-	finish_wait(wq, &q->wait);
+	__finish_wait(wq, &q->wait, page);
 	return 0;
 }
+
+int __sched
+__wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
+			int (*action)(void *), unsigned mode)
+{
+	return __wait_on_page_bit_lock(wq, q, NULL, action, mode);
+}
 EXPORT_SYMBOL(__wait_on_bit_lock);
 
 int __sched out_of_line_wait_on_bit_lock(void *word, int bit,
@@ -380,6 +435,48 @@ void __wake_up_bit(wait_queue_head_t *wq, void *word, int bit)
 	if (waitqueue_active(wq))
 		__wake_up(wq, TASK_NORMAL, 1, &key);
 }
+
+void __wake_up_page_bit(wait_queue_head_t *wqh, struct page *page, void *word, int bit)
+{
+	struct wait_bit_key key = __WAIT_BIT_KEY_INITIALIZER(word, bit);
+	unsigned long flags;
+
+	/*
+	 * If there is no PG_waiters bit (32-bit), then waitqueue_active can be
+	 * checked without wqh->lock as there is no PG_waiters race to protect.
+	 */
+	if (!__PG_WAITERS) {
+		if (waitqueue_active(wqh))
+			__wake_up(wqh, TASK_NORMAL, 1, &key);
+		return;
+	}
+
+	/*
+	 * Unlike __wake_up_bit it is necessary to check waitqueue_active
+	 * under the wqh->lock to avoid races with parallel additions that
+	 * could result in lost wakeups.
+	 */
+	spin_lock_irqsave(&wqh->lock, flags);
+	if (waitqueue_active(wqh)) {
+		/*
+		 * Try waking a task on the queue. Responsibility for clearing
+		 * the PG_waiters bit is left to the last waiter on the
+		 * waitqueue as PageWaiters is called outside wqh->lock and
+		 * we cannot miss wakeups. Due to hashqueue collisions, there
+		 * may be colliding pages that still have PG_waiters set but
+		 * the impact means there will be at least one unnecessary
+		 * lookup of the page waitqueue on the next unlock_page or
+		 * end of writeback.
+		 */
+		__wake_up_common(wqh, TASK_NORMAL, 1, 0, &key);
+	} else {
+		/* No potential waiters, safe to clear PG_waiters */
+		ClearPageWaiters(page);
+	}
+	spin_unlock_irqrestore(&wqh->lock, flags);
+}
+
+
 EXPORT_SYMBOL(__wake_up_bit);
 
 /**
diff --git a/mm/filemap.c b/mm/filemap.c
index 263cffe..07633a4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -682,9 +682,9 @@ static wait_queue_head_t *page_waitqueue(struct page *page)
 	return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
 }
 
-static inline void wake_up_page(struct page *page, int bit)
+static inline void wake_up_page(struct page *page, int bit_nr)
 {
-	__wake_up_bit(page_waitqueue(page), &page->flags, bit);
+	__wake_up_page_bit(page_waitqueue(page), page, &page->flags, bit_nr);
 }
 
 void wait_on_page_bit(struct page *page, int bit_nr)
@@ -692,8 +692,8 @@ void wait_on_page_bit(struct page *page, int bit_nr)
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
 
 	if (test_bit(bit_nr, &page->flags))
-		__wait_on_bit(page_waitqueue(page), &wait, sleep_on_page,
-							TASK_UNINTERRUPTIBLE);
+		__wait_on_page_bit(page_waitqueue(page), &wait, page,
+					sleep_on_page, TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(wait_on_page_bit);
 
@@ -704,7 +704,7 @@ int wait_on_page_bit_killable(struct page *page, int bit_nr)
 	if (!test_bit(bit_nr, &page->flags))
 		return 0;
 
-	return __wait_on_bit(page_waitqueue(page), &wait,
+	return __wait_on_page_bit(page_waitqueue(page), &wait, page,
 			     sleep_on_page_killable, TASK_KILLABLE);
 }
 
@@ -743,7 +743,8 @@ void unlock_page(struct page *page)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	clear_bit_unlock(PG_locked, &page->flags);
 	smp_mb__after_atomic();
-	wake_up_page(page, PG_locked);
+	if (unlikely(PageWaiters(page)))
+		wake_up_page(page, PG_locked);
 }
 EXPORT_SYMBOL(unlock_page);
 
@@ -769,7 +770,8 @@ void end_page_writeback(struct page *page)
 		BUG();
 
 	smp_mb__after_atomic();
-	wake_up_page(page, PG_writeback);
+	if (unlikely(PageWaiters(page)))
+		wake_up_page(page, PG_writeback);
 }
 EXPORT_SYMBOL(end_page_writeback);
 
@@ -806,8 +808,8 @@ void __lock_page(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
-	__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
-							TASK_UNINTERRUPTIBLE);
+	__wait_on_page_bit_lock(page_waitqueue(page), &wait, page,
+					sleep_on_page, TASK_UNINTERRUPTIBLE);
 }
 EXPORT_SYMBOL(__lock_page);
 
@@ -815,9 +817,10 @@ int __lock_page_killable(struct page *page)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
 
-	return __wait_on_bit_lock(page_waitqueue(page), &wait,
-					sleep_on_page_killable, TASK_KILLABLE);
+	return __wait_on_page_bit_lock(page_waitqueue(page), &wait, page,
+					sleep_on_page, TASK_KILLABLE);
 }
+
 EXPORT_SYMBOL_GPL(__lock_page_killable);
 
 int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cd1f005..ebb947d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6603,6 +6603,7 @@ static const struct trace_print_flags pageflag_names[] = {
 	{1UL << PG_private_2,		"private_2"	},
 	{1UL << PG_writeback,		"writeback"	},
 #ifdef CONFIG_PAGEFLAGS_EXTENDED
+	{1UL << PG_waiters,		"waiters"	},
 	{1UL << PG_head,		"head"		},
 	{1UL << PG_tail,		"tail"		},
 #else
diff --git a/mm/swap.c b/mm/swap.c
index 9e8e347..1581dbf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -67,6 +67,10 @@ static void __page_cache_release(struct page *page)
 static void __put_single_page(struct page *page)
 {
 	__page_cache_release(page);
+
+	/* See release_pages on why this clear may be necessary */
+	__ClearPageWaiters(page);
+
 	free_hot_cold_page(page, false);
 }
 
@@ -916,6 +920,14 @@ void release_pages(struct page **pages, int nr, bool cold)
 		/* Clear Active bit in case of parallel mark_page_accessed */
 		__ClearPageActive(page);
 
+		/*
+		 * pages are hashed on a waitqueue so there may be collisions.
+		 * When waiters are woken the waitqueue is checked but
+		 * unrelated pages on the queue can leave the bit set. Clear
+		 * it here if that happens.
+		 */
+		__ClearPageWaiters(page);
+
 		list_add(&page->lru, &pages_to_free);
 	}
 	if (zone)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7f85041..d7a4969 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1096,6 +1096,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * waiting on the page lock, because there are no references.
 		 */
 		__clear_page_locked(page);
+
+		/* See release_pages on why this clear may be necessary */
+		__ClearPageWaiters(page);
 free_it:
 		nr_reclaimed++;
 
@@ -1427,6 +1430,8 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
+			/* See release_pages on why this clear may be necessary */
+			__ClearPageWaiters(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {
@@ -1650,6 +1655,8 @@ static void move_active_pages_to_lru(struct lruvec *lruvec,
 		if (put_page_testzero(page)) {
 			__ClearPageLRU(page);
 			__ClearPageActive(page);
+			/* See release_pages on why this clear may be necessary */
+			__ClearPageWaiters(page);
 			del_page_from_lru_list(page, lruvec, lru);
 
 			if (unlikely(PageCompound(page))) {

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
  2014-05-22  8:46                         ` Mel Gorman
@ 2014-05-22 17:47                           ` Andrew Morton
  2014-05-22 19:53                             ` Mel Gorman
  0 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2014-05-22 17:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

On Thu, 22 May 2014 09:46:43 +0100 Mel Gorman <mgorman@suse.de> wrote:

> > > If I'm still on track here, what happens if we switch to wake-all so we
> > > can avoid the dangling flag?  I doubt if there are many collisions on
> > > that hash table?
> > 
> > Wake-all will be ugly and loose a herd of waiters, all racing to
> > acquire, all but one of whoem will loose the race. It also looses the
> > fairness, its currently a FIFO queue. Wake-all will allow starvation.
> > 
> 
> And the cost of the thundering herd of waiters may offset any benefit of
> reducing the number of calls to page_waitqueue and waker functions.

Well, none of this has been demonstrated.

As I speculated earlier, hash chain collisions will probably be rare,
except for the case where a bunch of processes are waiting on the same
page.  And in this case, perhaps wake-all is the desired behavior.

Take a look at do_read_cache_page().  It does lock_page(), but it
doesn't actually *need* to.  It checks ->mapping and PG_uptodate and
then...  unlocks the page!  We could have used wait_on_page_locked()
there and permitted concurrent threads to run concurrently.

btw, I'm struggling a bit to understand why we bother checking
->mapping there as we're about to unlock the page anyway...


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 09/19] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
  2014-05-22  9:24   ` Vlastimil Babka
@ 2014-05-22 18:23     ` Andrew Morton
  2014-05-22 18:45       ` Vlastimil Babka
  0 siblings, 1 reply; 103+ messages in thread
From: Andrew Morton @ 2014-05-22 18:23 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mel Gorman, Joonsoo Kim, Johannes Weiner, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel

On Thu, 22 May 2014 11:24:23 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> > In a test running dd onto tmpfs the overhead of the pageblock-related
> > functions went from 1.27% in profiles to 0.5%.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Hi, I've tested if this closes the race I've been previously trying to fix
> with the series in http://marc.info/?l=linux-mm&m=139359694028925&w=2
> And indeed with this patch I wasn't able to reproduce it in my stress test
> (which adds lots of memory isolation calls) anymore. So thanks to Mel I can
> dump my series in the trashcan :P
> 
> Therefore I believe something like below should be added to the changelog,
> and put to stable as well.

OK, I made it so.

Miraculously, the patch applies OK to 3.14.  And it compiles!

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH 09/19] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps
  2014-05-22 18:23     ` Andrew Morton
@ 2014-05-22 18:45       ` Vlastimil Babka
  0 siblings, 0 replies; 103+ messages in thread
From: Vlastimil Babka @ 2014-05-22 18:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Joonsoo Kim, Johannes Weiner, Jan Kara, Michal Hocko,
	Hugh Dickins, Peter Zijlstra, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel

On 22.5.2014 20:23, Andrew Morton wrote:
> On Thu, 22 May 2014 11:24:23 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
>
>>> In a test running dd onto tmpfs the overhead of the pageblock-related
>>> functions went from 1.27% in profiles to 0.5%.
>>>
>>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>> Hi, I've tested if this closes the race I've been previously trying to fix
>> with the series in http://marc.info/?l=linux-mm&m=139359694028925&w=2
>> And indeed with this patch I wasn't able to reproduce it in my stress test
>> (which adds lots of memory isolation calls) anymore. So thanks to Mel I can
>> dump my series in the trashcan :P
>>
>> Therefore I believe something like below should be added to the changelog,
>> and put to stable as well.
> OK, I made it so.

Thanks.

> Miraculously, the patch applies OK to 3.14.  And it compiles!

Great, shipping time!

Vlastimil

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
  2014-05-22 17:47                           ` Andrew Morton
@ 2014-05-22 19:53                             ` Mel Gorman
  0 siblings, 0 replies; 103+ messages in thread
From: Mel Gorman @ 2014-05-22 19:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Peter Zijlstra, Oleg Nesterov, Johannes Weiner, Vlastimil Babka,
	Jan Kara, Michal Hocko, Hugh Dickins, Dave Hansen, Linux Kernel,
	Linux-MM, Linux-FSDevel, Paul McKenney, Linus Torvalds,
	David Howells

On Thu, May 22, 2014 at 10:47:22AM -0700, Andrew Morton wrote:
> On Thu, 22 May 2014 09:46:43 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > > > If I'm still on track here, what happens if we switch to wake-all so we
> > > > can avoid the dangling flag?  I doubt if there are many collisions on
> > > > that hash table?
> > > 
> > > Wake-all will be ugly and loose a herd of waiters, all racing to
> > > acquire, all but one of whoem will loose the race. It also looses the
> > > fairness, its currently a FIFO queue. Wake-all will allow starvation.
> > > 
> > 
> > And the cost of the thundering herd of waiters may offset any benefit of
> > reducing the number of calls to page_waitqueue and waker functions.
> 
> Well, none of this has been demonstrated.
> 

True, but it's also the type of thing that would deserve a patch of its
own with some separation in case bisection fingerpoints to a patch that
is doing too much on its own.

> As I speculated earlier, hash chain collisions will probably be rare,

They are meant to be (well, they're documented to be). It's the primary
reason why I'm not concerned about "dangling waiters" being that common
a case.

> except for the case where a bunch of processes are waiting on the same
> page.  And in this case, perhaps wake-all is the desired behavior.
> 
> Take a look at do_read_cache_page().  It does lock_page(), but it
> doesn't actually *need* to.  It checks ->mapping and PG_uptodate and
> then...  unlocks the page!  We could have used wait_on_page_locked()
> there and permitted concurrent threads to run concurrently.
> 

It does that later when it calls wait_on_page_read but the flow is weird. It
looks like the first lock_page was to serialise against any IO and double
check it was not racing against a parallel reclaim although the elevated
reference count should have prevented that. Historical artifact maybe?
It looks like there could be some improvement there but also would deserve
a patch on its own.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 103+ messages in thread

end of thread, other threads:[~2014-05-22 19:53 UTC | newest]

Thread overview: 103+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
2014-05-13  9:45 ` [PATCH 01/19] mm: page_alloc: Do not update zlc unless the zlc is active Mel Gorman
2014-05-13  9:45 ` [PATCH 02/19] mm: page_alloc: Do not treat a zone that cannot be used for dirty pages as "full" Mel Gorman
2014-05-13  9:45 ` [PATCH 03/19] jump_label: Expose the reference count Mel Gorman
2014-05-13  9:45 ` [PATCH 04/19] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets Mel Gorman
2014-05-13 10:58   ` Peter Zijlstra
2014-05-13 12:28     ` Mel Gorman
2014-05-13  9:45 ` [PATCH 05/19] mm: page_alloc: Calculate classzone_idx once from the zonelist ref Mel Gorman
2014-05-13 22:25   ` Andrew Morton
2014-05-14  6:32     ` Mel Gorman
2014-05-14 20:29     ` Mel Gorman
2014-05-13  9:45 ` [PATCH 06/19] mm: page_alloc: Only check the zone id check if pages are buddies Mel Gorman
2014-05-13  9:45 ` [PATCH 07/19] mm: page_alloc: Only check the alloc flags and gfp_mask for dirty once Mel Gorman
2014-05-13  9:45 ` [PATCH 08/19] mm: page_alloc: Take the ALLOC_NO_WATERMARK check out of the fast path Mel Gorman
2014-05-13  9:45 ` [PATCH 09/19] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps Mel Gorman
2014-05-22  9:24   ` Vlastimil Babka
2014-05-22 18:23     ` Andrew Morton
2014-05-22 18:45       ` Vlastimil Babka
2014-05-13  9:45 ` [PATCH 10/19] mm: page_alloc: Reduce number of times page_to_pfn is called Mel Gorman
2014-05-13 13:27   ` Vlastimil Babka
2014-05-13 14:09     ` Mel Gorman
2014-05-13  9:45 ` [PATCH 11/19] mm: page_alloc: Lookup pageblock migratetype with IRQs enabled during free Mel Gorman
2014-05-13 13:36   ` Vlastimil Babka
2014-05-13 14:23     ` Mel Gorman
2014-05-13  9:45 ` [PATCH 12/19] mm: page_alloc: Use unsigned int for order in more places Mel Gorman
2014-05-13  9:45 ` [PATCH 13/19] mm: page_alloc: Convert hot/cold parameter and immediate callers to bool Mel Gorman
2014-05-13  9:45 ` [PATCH 14/19] mm: shmem: Avoid atomic operation during shmem_getpage_gfp Mel Gorman
2014-05-13  9:45 ` [PATCH 15/19] mm: Do not use atomic operations when releasing pages Mel Gorman
2014-05-13  9:45 ` [PATCH 16/19] mm: Do not use unnecessary atomic operations when adding pages to the LRU Mel Gorman
2014-05-13  9:45 ` [PATCH 17/19] fs: buffer: Do not use unnecessary atomic operations when discarding buffers Mel Gorman
2014-05-13 11:09   ` Peter Zijlstra
2014-05-13 12:50     ` Mel Gorman
2014-05-13 13:49       ` Jan Kara
2014-05-13 14:30         ` Mel Gorman
2014-05-13 14:01       ` Peter Zijlstra
2014-05-13 14:46         ` Mel Gorman
2014-05-13 13:50   ` Jan Kara
2014-05-13 22:29   ` Andrew Morton
2014-05-14  6:12     ` Mel Gorman
2014-05-13  9:45 ` [PATCH 18/19] mm: Non-atomically mark page accessed during page cache allocation where possible Mel Gorman
2014-05-13 14:29   ` Theodore Ts'o
2014-05-20 15:49   ` [PATCH] mm: non-atomically mark page accessed during page cache allocation where possible -fix Mel Gorman
2014-05-20 19:34     ` Andrew Morton
2014-05-21 12:09       ` Mel Gorman
2014-05-21 22:11         ` Andrew Morton
2014-05-22  0:07           ` Mel Gorman
2014-05-22  5:35       ` Prabhakar Lad
2014-05-13  9:45 ` [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath Mel Gorman
2014-05-13 12:53   ` Mel Gorman
2014-05-13 14:17     ` Peter Zijlstra
2014-05-13 15:27       ` Paul E. McKenney
2014-05-13 15:44         ` Peter Zijlstra
2014-05-13 16:14           ` Paul E. McKenney
2014-05-13 18:57             ` Oleg Nesterov
2014-05-13 20:24               ` Paul E. McKenney
2014-05-14 14:25                 ` Oleg Nesterov
2014-05-13 18:22           ` Oleg Nesterov
2014-05-13 18:18         ` Oleg Nesterov
2014-05-13 18:24           ` Peter Zijlstra
2014-05-13 18:52           ` Paul E. McKenney
2014-05-13 19:31             ` Oleg Nesterov
2014-05-13 20:32               ` Paul E. McKenney
2014-05-14 16:11       ` Oleg Nesterov
2014-05-14 16:17         ` Peter Zijlstra
2014-05-16 13:51           ` [PATCH 0/1] ptrace: task_clear_jobctl_trapping()->wake_up_bit() needs mb() Oleg Nesterov
2014-05-16 13:51             ` [PATCH 1/1] " Oleg Nesterov
2014-05-21  9:29               ` Peter Zijlstra
2014-05-21 19:19                 ` Andrew Morton
2014-05-21 19:18             ` [PATCH 0/1] " Andrew Morton
2014-05-14 19:29         ` [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath Oleg Nesterov
2014-05-14 20:53           ` Mel Gorman
2014-05-15 10:48           ` [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v4 Mel Gorman
2014-05-15 13:20             ` Peter Zijlstra
2014-05-15 13:29               ` Peter Zijlstra
2014-05-15 15:34               ` Oleg Nesterov
2014-05-15 15:45                 ` Peter Zijlstra
2014-05-15 16:18               ` Mel Gorman
2014-05-15 15:03             ` Oleg Nesterov
2014-05-15 21:24             ` Andrew Morton
2014-05-21 12:15               ` [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5 Mel Gorman
2014-05-21 13:02                 ` Peter Zijlstra
2014-05-21 15:33                   ` Mel Gorman
2014-05-21 16:08                     ` Peter Zijlstra
2014-05-21 21:26                 ` Andrew Morton
2014-05-21 21:33                   ` Peter Zijlstra
2014-05-21 21:50                     ` Andrew Morton
2014-05-22  0:07                       ` Mel Gorman
2014-05-22  7:20                         ` Peter Zijlstra
2014-05-22 10:40                           ` [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7 Mel Gorman
2014-05-22 10:56                             ` Peter Zijlstra
2014-05-22 13:00                               ` Mel Gorman
2014-05-22 14:40                               ` Mel Gorman
2014-05-22 15:04                                 ` Peter Zijlstra
2014-05-22 15:36                                   ` Mel Gorman
2014-05-22 16:58                                   ` [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v8 Mel Gorman
2014-05-22  6:45                       ` [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5 Peter Zijlstra
2014-05-22  8:46                         ` Mel Gorman
2014-05-22 17:47                           ` Andrew Morton
2014-05-22 19:53                             ` Mel Gorman
2014-05-21 23:35                   ` Mel Gorman
2014-05-13 16:52   ` [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath Peter Zijlstra
2014-05-14  7:31     ` Mel Gorman
2014-05-19  8:57 ` [PATCH] mm: Avoid unnecessary atomic operations during end_page_writeback Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).